arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 63 信号源:cs.CL, cs.AI, cs.LG
2606.18431 2026-06-18 cs.LG cs.DC 新提交 85%

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

超越预测:面向LLM推理的尾延迟感知调度

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

发表机构 * Cornell University, Computer Science Department(康奈尔大学计算机科学系) Cornell University, Electrical and Computer Engineering Department(康奈尔大学电气与计算机工程系) Cornell University, Operations Research and Information Engineering Department(康奈尔大学运筹学与信息工程系) Microsoft Azure System Research(微软Azure系统研究) NVIDIA Corporation(英伟达公司)

专题命中 其他LLM :提出LLM推理调度框架,优化尾延迟

AI总结 针对LLM推理中长度预测调度在分布偏移和尾延迟控制上的脆弱性,提出无预测的分布感知调度框架,通过轻量统计信号实现软优先级提升,结合缓存感知抢占,在多种工作负载下将P99 TTLT降低35-50%,TTFT降低34-47%。

Journal ref Forty-Third International Conference on Machine Learning (2026)

详情
AI中文摘要

LLM服务表现出极端的长度可变性,使得基于大小的调度在实践中变得困难。最近的LLM调度器使用预测的解码长度或排名来近似SJF/SRPT,并主要报告均值中心指标如TTFT和TBT。我们表明,这些预测驱动的策略在分布偏移、突发到达和GPU内存压力下可能脆弱,同时对主导用户体验的尾延迟(P90-P99)控制有限,即使拥有完美的解码长度知识。我们引入了一个分布感知、无预测的调度框架,用由轻量统计信号驱动的软优先级提升取代显式长度预测。我们的设计协同优化调度和缓存感知抢占,以考虑跨工作负载混合的内存耦合解码动态。在生产环境和开源轨迹上的评估表明,相对于具有完美长度知识的SRPT,我们的方法将P99 TTLT降低了高达35-50%,并在各种工作负载(包括推理密集型和聊天密集型任务)上将TTFT降低了34-47%。这些结果证明了在在线LLM服务中优化尾延迟的稳健替代方案。

英文摘要

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

2606.18394 2026-06-18 cs.CL 新提交 85%

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego(加州大学圣地亚哥分校) Zhejiang University(浙江大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Nanjing University(南京大学) StepFun(阶跃星辰)

专题命中 其他LLM :提出并行树草稿加速LLM推测解码

AI总结 提出JetFlow框架,通过因果并行草稿头结合树推测解码,将更大草稿预算转化为更长接受前缀和更高端到端加速,在Qwen3模型上实现最高9.64倍加速。

详情
AI中文摘要

推测解码(SD)通过草拟多个令牌并并行验证来加速自回归大语言模型(LLM),但面临缩放限制:仅当接受率保持较高且草拟开销较低时,增加草稿预算才能提高速度。这一上限难以突破,因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选,适用于树推测解码且接受长度更高,但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置,但其分支无关的边缘分布可能形成个体合理但相互不一致的树,浪费预算并降低接受率。我们提出JetFlow,一种基于头的SD框架,结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头,生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中,JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上,JetFlow在MATH-500上实现高达9.64倍加速,在开放式对话工作负载上实现4.58倍加速,并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 85%

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

基于Bandit的提示设计策略选择改进提示优化器

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

专题命中 其他LLM :提出OPTS方法优化LLM提示策略

AI总结 本文提出OPTS方法,通过显式选择提示设计策略提升EvoPrompt性能,采用Thompson采样机制在BIG-Bench Hard上验证效果,实现最优结果。

Comments Accepted to ACL 2025 Findings

详情
AI中文摘要

提示优化旨在寻找能提升大语言模型性能的有效提示。尽管现有方法已发现有效提示,但往往与人类专家精心设计的复杂提示不同。提示设计策略作为提升提示性能的最佳实践,对优化提示至关重要。最近,Autonomous Prompt Engineering Toolbox (APET) 将多种提示设计策略整合到提示优化过程中。在APET中,需要LLM隐式选择和应用合适的策略,因为提示设计策略可能产生负面影响。这种隐式选择可能因LLM的有限优化能力而表现不佳。本文引入Optimizing Prompts with sTrategy Selection (OPTS),实现提示设计的显式选择机制。我们提出三种机制,包括基于Thompson采样的方法,并将其整合到EvoPrompt中。在使用BIG-Bench Hard对Llama-3-8B-Instruct和GPT-4o mini进行提示优化的实验中,结果表明提示设计策略的选择提升了EvoPrompt的性能,Thompson采样机制实现了最佳整体结果。我们的实验代码可在https://github.com/shiralab/OPTS获取。

英文摘要

Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

2506.09822 2026-06-18 cs.CE cs.AI 85%

Superstudent intelligence in thermodynamics

热力学中的超级学生智能

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)(工程热力学实验室) Visual Information Analysis Research Group (VIA)(视觉信息分析研究组) Machine Learning Research Group (ML)(机器学习研究组)

专题命中 其他LLM :评估o3模型在热力学考试中的表现

AI总结 研究展示OpenAI的o3模型在热力学考试中超越所有学生,证明机器在复杂任务中的能力,影响工程教育与实践。

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

详情
AI中文摘要

在本文中,我们报告并分析了一个引人注目的事件:OpenAI的大型语言模型o3在热力学考试中击败了所有学生。热力学考试是大多数学生的难点,需要展示对这一重要主题基本原理的掌握。因此,失败率很高,A级分数稀少,被视为学生卓越智力的证明。这是因为模式学习无助于考试。问题只能通过有创造力地结合热力学原理来解决。我们不仅将最新热力学考试提供给学生,还提供给OpenAI最强大的推理模型o3,并以相同方式评估其答案。在零样本模式下,模型o3正确解答了所有问题,优于所有参加考试的学生;其总分在1985年以来超过10000次类似考试中最佳分数范围内。这标志着转折点:机器现在在复杂任务中表现出色,通常被视为人类智力能力的证明。我们讨论了这对工程师工作和未来工程师教育的影响。

英文摘要

In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 85%

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

对上中学数学中演进式大语言模型的评估

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology(信息科技学院) University of Jyväskylä(于韦斯屈莱大学) Faculty of Humanities and Social Sciences(人文与社会科学学院)

专题命中 其他LLM :评估LLM在中学数学考试中的能力

AI总结 本文评估了不同大语言模型在芬兰毕业考试中的数学能力,发现随着模型演进,其表现显著提升,部分模型接近完美,展示了LLM在数学能力上的快速进步及其在教育中的潜力。

详情
AI中文摘要

大型语言模型(LLMs)在教育环境中展现出日益增长的前景,但其数学推理能力被认为是在不断演变的。本研究通过芬兰毕业考试,一种针对上中学教育的高风险数字测试,评估了各种LLMs的数学能力。初步测试显示中等表现,对应中等成绩,但后续评估显示随着语言模型的演进,表现显著提升。令人惊讶的是,某些模型达到了接近完美或完美分数,与顶尖学生表现相当,符合大学入学要求。我们的发现突显了LLM数学能力的快速进步,并展示了其作为支持学习和教学的潜在工具的可能性。

英文摘要

Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.

2606.19256 2026-06-18 cs.AI 新提交 80%

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides:面向受众条件的幻灯片生成基准测试

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Harbin Institute of Technology(哈尔滨工业大学) SenseTime

专题命中 其他LLM :LLM幻灯片生成基准测试

AI总结 提出X+Slides基准,通过动态评估框架和受众特定权重,衡量幻灯片生成系统在受众覆盖、领域覆盖、效率和正确性方面的表现,揭示现有系统在受众关键信息恢复上的不足。

详情
AI中文摘要

从源文档自动生成幻灯片是大语言模型(LLMs)的重要应用。现有基准主要评估幻灯片的完整性和技术深度,而忽略了目标受众这一关键现实因素。例如,专家需要严格的证明,而决策者优先考虑可操作的结论。为弥补这一差距,我们引入了X+Slides,一个专门为受众条件幻灯片生成设计的基准。基于涵盖113个主题和七种演示场景的多样化语料库,X+Slides采用由8,133个去重、基于源的探针构建的动态评估框架。通过为相同的基于源的探针分配受众特定的效用权重,X+Slides报告四个互补指标:受众覆盖率衡量传达了受众必要信息的程度,领域覆盖率显示覆盖了哪些信息类型,效率衡量每单位注意力成本传递的效用,正确性验证幻灯片声明是否得到源支持。在DeepPresenter、SlideTailor和NotebookLM上的实验表明,当前系统可以恢复大部分但仍有缺失的受众必要信息:在τ_A=0.7时,DeepPresenter达到最佳受众覆盖率0.714,SlideTailor达到0.594,NotebookLM消融达到0.853,同时显示出明显的接地差异。这些结果表明,视觉质量和广泛的主题覆盖不应在没有基于源评估的情况下被视为证据支持。

英文摘要

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

2606.18946 2026-06-18 cs.CL 新提交 80%

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University(西北工业大学) Zhejiang Lab(浙江实验室)

专题命中 其他LLM :检测LLM生成文本,建模句间依赖

AI总结 针对人机混合文档的句子级AI文本检测,提出SenFlow模型,通过图传播和CRF解码建模句间依赖,在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情
AI中文摘要

针对混合文档(人类与LLM共同撰写同一文本)的句子级AI生成文本检测(S-AGTD)面临两个空白:现有方法孤立地对每个句子进行分类,忽略了句间依赖;现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准,包含来自PubMed和XSum的16,000个混合文档,由DeepSeek-V3.2和Kimi K2生成,并经过严格质量控制,包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测,并实例化为SenFlow,在句子图的单次文档级传递中,将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能,在跨域迁移(三种难度递增协议中最难的一种)上平均Macro-F1提高了4.15个百分点。我们进一步发现,即使困惑度过滤器平衡了显式线索,AI插入仍然保留了一个依赖于生成器的句子长度差距,句子级检测器仍可利用这一点。代码和数据:此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

2606.18922 2026-06-18 cs.CL cs.AI 新提交 80%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

专题命中 其他LLM :评估LLM对否定与比喻语言的理解

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

2606.18797 2026-06-18 cs.CL 新提交 80%

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

超越标量分数:探索基于LLM的放射学报告临床意义评估指标

Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

发表机构 * Nanyang Technological University(南洋理工大学) Technical University of Munich(慕尼黑工业大学) Alibaba(阿里巴巴) University of Glasgow(格拉斯哥大学) University of Massachusetts Boston(马萨诸塞大学波士顿分校)

专题命中 其他LLM :基于LLM的放射学报告评估指标

AI总结 针对放射学报告评估中临床准确性要求,研究基于LLM的指标区分临床错误与无害变体的能力,发现判别偏差,并通过合成数据训练轻量级指标,在成本敏感部署中优于大型模型。

Comments Under Review

详情
AI中文摘要

对生成的放射学报告进行可靠评估需要严格的临床准确性,因为遗漏关键发现或误判影像学观察结果会直接影响患者护理。现有指标通过将报告质量简化为一个医学上无依据的标量而模糊了这一要求。尽管大型语言模型(LLM)拥有丰富的医学知识,但它们同样难以在临床显著错误和无害变异之间划定可靠边界。我们以ReEvalMed基准为测试平台研究这一边界,并从检测真实临床错误(“判别力”)和容忍无关变异(“鲁棒性”)两方面评估指标的临床意义。在单次和两次设置下对8个LLM评估器进行实验,我们发现了一个普遍的判别偏差:模型能有效检测错误,但也过度惩罚无害的改写。为缓解这一问题,我们合成了4000对报告,并在Qwen3-8B和MedGemma-4B上训练了轻量级可解释指标。我们训练的指标明确了临床意义边界,超越了32B规模的医学LLM,并与专有模型保持竞争力。关键的是,成本更高的两次设置未能持续提升整体性能,主要是在用判别力换取鲁棒性。这些发现表明,单次训练指标是成本敏感部署的实用选择,而两次推理则保留给判别-鲁棒平衡至关重要的场景。我们将发布数据集和指标。

英文摘要

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

2606.18741 2026-06-18 cs.DC 新提交 80%

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

ReMP:面向LLM服务的低停机时间运行时模型并行重配置

Haipeng Yuan, Kaining Zheng, Yongshu Bai, Yuchen Zhang, Yunquan Zhang, Baodong Wu, Xiang Gao, Daning Cheng

专题命中 其他LLM :LLM推理服务模型并行重配置,低停机时间。

AI总结 提出ReMP框架,通过解耦拓扑与运行时状态、二维KV缓存迁移等技术,实现LLM推理服务中模型并行拓扑的在线动态调整,将重配置停机时间从分钟级降至1-7秒。

详情
AI中文摘要

当前大语言模型(LLM)推理系统普遍采用张量并行(TP)和流水线并行(PP)的组合来部署超大规模模型。然而,现有系统将模型并行拓扑视为静态配置,无法在运行时灵活调整。这种刚性设计与实际场景中动态变化的推理负载存在根本矛盾。最先进的系统缺乏在线重配置能力,只能通过重启服务来切换配置,导致数分钟的服务中断、KV缓存丢失以及高昂的重计算开销。为解决此问题,本文提出ReMP,一种支持低停机时间的运行时模型并行重配置框架。ReMP通过三项关键技术实现动态调整:(1)将模型并行拓扑与运行时状态解耦,避免完全重建服务;(2)设计二维KV缓存迁移机制,在TP/PP变化后保留可复用的缓存状态;(3)实现端到端的在线重配置。实验表明,ReMP能在7B到70B参数规模的模型上,在1-7秒内完成大多数拓扑切换,相比重启方法实现数十至上百倍的加速。此外,在动态负载下,ReMP显著优于固定配置,在TTFT、TPOT和输出吞吐量方面表现出更优性能。

英文摘要

Current large language model (LLM) inference systems universally deploy ultra-large-scale models using a combination of Tensor Parallelism (TP) and Pipeline Parallelism (PP). However, existing systems treat the model parallelism topology as a static configuration that cannot be flexibly adjusted at runtime. This rigid design creates a fundamental contradiction with the dynamically changing inference workloads in real-world scenarios. State-of-the-art systems lack online reconfiguration capabilities and can only switch configurations by restarting the service, resulting in several minutes of service interruption, KV cache loss, and prohibitive recomputation overhead. To address this problem, this paper presents ReMP, a runtime model parallelism reconfiguration framework that supports low downtime. ReMP achieves dynamic adjustment through three key techniques: (1) decoupling the model parallelism topology from runtime state to avoid full service reconstruction; (2) designing a two-dimensional KV cache migration mechanism to preserve reusable cache states after TP/PP changes; and (3) implementing end-to-end online reconfiguration. Experiments demonstrate that ReMP can complete most topology switches within 1-7 seconds on models ranging from 7B to 70B parameters, achieving speedups of tens to over a hundred times compared to the restart approach. Moreover, ReMP significantly outperforms fixed configurations under dynamic workloads, delivering superior performance in terms of TTFT, TPOT, and output throughput.

2606.18677 2026-06-18 cs.LG cs.AI 新提交 80%

Bounded Context Management for Tabular Foundation Models on Stream Learning

表格基础模型在流学习中的有界上下文管理

Jinmo Lee, Doyun Choi, Moongi Choi, Jaemin Yoo

发表机构 * Seoul National University(首尔大学) KAIST(韩国科学技术院)

专题命中 其他LLM :表格基础模型流学习上下文管理

AI总结 针对表格流学习中分布漂移问题,提出上下文管理策略CURE,通过不确定性门控准入和冗余感知驱逐管理上下文,在七个流上相对提升最高27.0%。

Comments Accepted as a spotlight oral (top 5%) at the 2nd ICML Workshop on Foundation Models for Structured Data (FMSD@ICML2026)

详情
AI中文摘要

表格流学习需要在分布漂移下对顺序到达的样本进行预测。虽然标准方法通过更新模型状态来适应,但表格基础模型(TFMs)以上下文方式基于标记上下文进行预测,使其成为流学习的自然替代方案。这便将挑战从如何更新模型转移到如何管理上下文。我们提出一种未来信息视角,为上下文管理导出三个实际需求:保留最近样本、保留不确定样本、移除冗余样本。我们将这些需求实例化为CURE(通过不确定性感知准入和冗余感知驱逐的上下文管理),一种具有熵门控准入和冗余感知驱逐的上下文管理策略。在七个流上,CURE相比经典流学习器相对提升高达27.0%,在多个TFM骨干上保持鲁棒,并在其他策略变体中排名第一。代码和数据集可在该https URL获取。

英文摘要

Tabular stream learning requires predictions on sequentially arriving examples under distribution shift. While standard methods adapt by updating model states, tabular foundation models (TFMs) make predictions conditioned on a labeled context in an in-context manner, making them a natural alternative for stream learning. This shifts the challenge from how to update the model to how to manage the context. We propose a future information view that yields three practical requirements for context management: preserve recent examples, retain uncertain examples, and remove redundant examples. We instantiate these requirements as CURE (Context management via Uncertainty-aware admission and Redundancy aware Eviction), a context-managing policy with entropy-gated admission and redundancy-aware eviction. Across seven streams, CURE shows up to 27.0% relative improvement over classical stream learners, remains robust across multiple TFM backbones, and ranks first among other policy variants. Code and datasets are available at https://github.com/morcellinus/CURE-ICML-FMSD.

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 80%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb:基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

专题命中 其他LLM :评估基础模型的可废止溯因推理

AI总结 提出DeFAb基准,通过将知识库转换为可验证的溯因实例,评估基础模型在可废止推理中的创造力与理论推理能力,发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情
AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例;而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%,最差降至23.5%(四种表面渲染的最坏情况)。我们引入DeFAb(可废止溯因基准),这是一个数据集和生成流水线,将四十年的公共资助知识库转换为形式化可废止溯因实例:通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查(有效推导、保守性和最小性),DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具,评分的是理论修正的规范构建,而非流畅但破坏理论的散文。该流水线将分类层次结构(OpenCyc、YAGO、Wikidata)与行为属性图(ConceptNet、UMLS)配对,从18个来源生成372,648+个实例,涉及33.75M条实例化规则,分为三个级别,并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理:渲染鲁棒的Level 2准确率为7.8-23.5%;思维链方差(约36个百分点)超过任何模型间差距;匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard(235个实例的Level 3难度变体;最佳模型53.3% vs 符号100%)和CONJURE(一个内核验证的变革性创造力变体,包含560个Lean 4/Mathlib实例,其金答案证明内核先前未包含的定义,无需判断的验证器;试点发现零新概念)。同一验证器还可作为偏好优化(DPO、RLVR/GRPO)的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

2606.18383 2026-06-18 cs.LG cs.CL 新提交 80%

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理:认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院巴特那分校计算机科学与工程系)

专题命中 其他LLM :认证基于SAE的语言模型可解释性

AI总结 提出一种后验泛化框架,通过稀疏代理(SAE重建)认证语言模型,推导期望风险上界,并在GPT-2 Small等模型上验证非平凡界,揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于从语言模型(LM)中提取可解释特征,但一个核心问题仍然存在:基于SAE的解释何时可以被视为底层冻结LM的忠实视图?我们通过一个后验泛化框架来研究这个问题,该框架通过稀疏代理来认证LM,稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界:代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地,非平凡界表明提取的稀疏特征保留了有意义的预测信息,而小的重建和匹配误差表明代理在行为上接近原始模型。实验上,我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上,该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性,较深层变得更容易认证,这与更强的局部保真度和更弱的下游误差放大相关。最后,通过特征洗牌消融,我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性,为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

2606.18042 2026-06-18 cs.DC 新提交 80%

Latency Prediction for LLM Inference on NPU Systems

NPU系统上LLM推理的延迟预测

Juhyun Park, Seungwoo Jeong, Jingyu Lee, Kyungyong Lee

专题命中 其他LLM :预测LLM在NPU上的推理延迟

AI总结 针对NPU上LLM推理延迟预测面临微架构不公开、编译器优化不可预测和分桶导致非线性延迟的挑战,提出LENS延迟估计器,通过每个桶两次端到端测量组合预测任意输入输出长度组合的延迟,平均预测误差2.15%。

Comments 12 pages, 9 figures

详情
AI中文摘要

部署大型语言模型(LLM)需要探索涵盖并行化策略、批处理技术和调度策略的庞大配置空间。在此空间上进行穷举测量是不切实际的,因此延迟预测对于系统优化至关重要。尽管NPU已成为专为LLM推理设计的加速器,但尚未建立针对它们的预测方法。具体来说,将先前的工作应用于NPU上的LLM推理延迟预测面临三个挑战:商用NPU的微架构不公开、不可预测的编译器优化以及由分桶引起的延迟非线性。我们提出了LENS,一种延迟估计器,它可以在没有微架构或编译器信息的情况下预测NPU推理延迟,并捕获由分桶引起的非线性延迟。LENS通过两次端到端(E2E)测量对每个桶进行剖析,并组合结果以预测任意输入-输出长度组合的延迟。我们在来自多个供应商的NPU、多个LLM以及多样化工作负载上验证了LENS,平均预测误差为2.15%。我们进一步将LENS与两个方法相关的基线进行比较,确认了其方法的有效性。

英文摘要

Deploying Large Language Models (LLMs) requires exploring a large configuration space spanning parallelization strategies, batching techniques, and scheduling policies. Exhaustive measurement across this space is impractical, making latency prediction essential for system optimization. While NPUs have emerged as accelerators designed for LLM inference, no prediction methodology has been established for them. Specifically, applying prior work to LLM inference latency prediction on NPUs faces three challenges: undisclosed microarchitecture of commercial NPUs, unpredictable compiler optimizations, and latency non-linearity induced by bucketing. We present LENS, a latency estimator that predicts NPU inference latency without information on the microarchitecture or compiler, and captures the non-linear latency induced by bucketing. LENS profiles each bucket with two end-to-end (E2E) measurements and composes the results to predict latency for arbitrary input-output length combinations. We validate LENS across NPUs from multiple vendors, several LLMs, and diverse workloads, achieving a mean prediction error of 2.15\%. We further compare LENS against two methodologically related baselines, confirming the validity of its approach.

2606.12629 2026-06-18 cs.LG cs.AI 新提交 80%

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims:通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services(亚马逊云服务)

专题命中 其他LLM :无需训练的Transformer机制可解释性方法

AI总结 本文提出Bag of Dims框架,证明Transformer隐藏状态的标准基即可作为无需训练的特征基,通过维度符号模式编码语义,并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情
AI中文摘要

我们表明,Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容,通过其幅度编码置信度,充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B)上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容:将所有幅度替换为1,通过LM头实现72-93%的top-5下一个token准确率,而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征:使用单token类型缓存(每个词汇token一次前向传播,无上下文),我们通过每维度符号一致性(平均AUC 0.80)从50个锚点发现了175个类别,无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重,证实了可忽略的跨维度结构。这种结构扩展到注意力:所有175个类别在K和V投影中仍然可发现。在写入端,静态FFN权重检查将20%的特征与单个写入神经元联系起来(一致性>0.70;随机对照:0%),通过多数投票,top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现(随机种子,无标签)在所有三个模型上扩展到1500个特征,产量100%,稀疏度99%,成对互信息为0.0014比特,证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取,无需训练、无需优化,且每个词汇token仅需一次前向传播,无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

2606.08532 2026-06-18 cs.AI 新提交 80%

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline:一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China(中国科学院计算机网络信息中心)

专题命中 其他LLM :LLM驱动的假设生成工作流

AI总结 提出DN-Hypo-Pipeline,利用大语言模型和科学解释作为先验知识,从现有文献中推导新假设,在数据科学建模中通过统计推断和专家评估证明优于直接生成方法,并验证了生成假设对应的算法性能。

详情
AI中文摘要

科学假设是研究的第一步并经过实验验证,但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline,一种基于大语言模型的AI驱动工作流,旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项(即结论),它识别潜在的定律、理论和原理,并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明,我们的流水线比直接生成方法更有效。此外,我们通过开发相应新颖算法验证了得分最高的两个生成假设,这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用,DN-Hypo-Pipeline还提供了一个理论框架,不仅包含了理论指导的数据科学建模方法,还揭示了建模过程更基础的结构。此外,这种方法本质上是理论指导建模的推广,具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

2606.18829 2026-06-18 cs.LG cs.CL 新提交 75%

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem:多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Shanghai Jiao Tong University(上海交通大学) King Abdullah University of Science and Technology (KAUST)(卡尔斯鲁厄大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学)

专题命中 其他LLM :评估多主体共享内存代理的记忆治理,涉及LLM代理

AI总结 提出GateMem基准,评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力,发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情
AI中文摘要

LLM代理的内存基准主要假设单用户设置,而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中,多个主体写入公共内存池并根据不同角色、范围和关系进行查询,因此内存质量需要治理和召回。我们引入GateMem,一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用(含状态更新)、跨上下文授权边界的访问控制,以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域,包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上,没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数,而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明,当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

2606.18389 2026-06-18 cs.CL 新提交 75%

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据?引导它:面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所) German Research Institute for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

专题命中 其他LLM :激活引导用于低资源语言合成数据生成

AI总结 提出激活引导作为低资源语言合成数据生成的替代方法,包括语言引导和质量引导,实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)已成为合成数据生成的有效工具,包括低资源语言,生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示,这增加了推理成本,并可能通过词汇锚定降低多样性。在这项工作中,我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略:语言引导,针对语言的 linguistic identity;以及质量引导,通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法,通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用,并与非引导对应方法进行比较。我们的结果表明,早期层的引导一致地提高了生成数据的多样性,同时通常产生更强的下游模型性能,特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

2606.18304 2026-06-18 cs.LG cs.AI 新提交 75%

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

基于归因引导和覆盖最大化的结构MoE剪枝

Yifu Ding, Jiacheng Wang, Ge Yang, Yongcheng Jing, Jinyang Guo, Xianglong Liu, Dacheng Tao

发表机构 * School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Nanyang Technological University(南洋理工大学)

专题命中 其他LLM :针对MoE模型的结构剪枝,属于LLM压缩与部署。

AI总结 针对MoE模型专家级剪枝粒度粗、冗余识别不足的问题,提出基于归因引导和覆盖最大化的结构剪枝框架,将剪枝分配转化为通道分数覆盖优化问题,在50%剪枝率下结合4位量化保持精度,内存减少5.27倍。

Comments 9 pages, 5 figures. Submitted to ICML 2026

详情
AI中文摘要

混合专家(MoE)模型在计算上高效扩展,但由于其巨大的内存占用和推理开销,部署成本仍然很高。先前的压缩方法主要在专家级别操作,要么移除整个专家,要么通过粗粒度的重要性分数对专家进行排序。然而,这种专家级别的决策通常过于粗糙,无法捕捉细粒度的冗余,导致剪枝预算分配不当和压缩效果有限。为了解决这个问题,我们观察到MoE专家内的信息高度集中在一小部分通道中,即使在被认为重要的专家中也存在大量冗余。基于这一观察,我们提出了一种针对MoE模型量身定制的结构剪枝框架。我们的方法将剪枝比例分配重新表述为通道分数覆盖最大化问题,并使用基于归因的近似方法高效求解。在DeepSeek和Qwen MoE模型上的实验表明,我们的方法在结合4位量化时,在50%或25%的结构化剪枝下仍能保持模型精度。在Qwen3-30B-A3B上,我们的方法将内存占用减少了5.27倍,并在各种基准测试中持续优于最先进的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts or ranking experts by coarse-grained importance scores. However, such expert-wise decisions are often too coarse to capture fine-grained redundancy, leading to misallocated pruning budgets and limited compression. To address this problem, we observe that information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in experts deemed important. Based on this observation, we propose a structural pruning framework tailored for MoE models. Our method reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it efficiently using an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show that our method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization. On Qwen3-30B-A3B, our approach reduces memory footprint by 5.27$\times$ and consistently outperforms state-of-the-art baselines across diverse benchmarks.

2606.18105 2026-06-18 cs.NI cs.LG 新提交 75%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan:一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University(浙江大学) Fuzhou University(福州市大学) Yangzhou University(扬州大学) The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) College of Computer Science and Technology(计算机科学与技术学院)

专题命中 其他LLM :LLM用于解析用户意图进行网络规划

AI总结 提出OmniPlan自适应框架,利用大语言模型解析用户意图,通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型,实现网络规划优化的及时性与近乎最优性,在分布式机器学习推理卸载任务中延迟降低97.8%,资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情
AI中文摘要

网络规划优化是跨多个领域(包括交通系统、通信网络和电网)的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划(MIP)求解器、启发式算法和深度强化学习(DRL)模型来计算规划决策。然而,它们缺乏对多样化和动态用户意图的有效适应性,从而导致执行时间与最优性之间的权衡。在本文中,我们提出OmniPlan,一种自适应框架,在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性,OmniPlan采用基于大语言模型(LLM)的解释器,将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后,它采用混合专家架构,集成MIP求解器、启发式算法和DRL模型作为专门专家,OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后,它包含一个基于DRL的专家配置模块,该模块微调优化目标权重,使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载(即分布式机器学习(ML))评估OmniPlan,其中我们利用OmniPlan将广泛的ML推理任务(例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林)卸载到硬件设备网络。我们在真实测试平台上的实验表明,OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载,延迟降低高达97.8%,网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

2606.17276 2026-06-18 cs.IR cs.LG 新提交 75%

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

LLM在生成式推荐中的记忆行为:观察、启示与训练策略

Sunwoo Kim, Sunkyung Lee, Clark Mingxuan Ju, Donald Loveland, Bhuvesh Kumar, Kijung Shin, Neil Shah, Liam Collins

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(成均馆大学) Snap Inc.(Snap公司)

专题命中 其他LLM :研究LLM在生成式推荐中的记忆行为

AI总结 研究LLM在生成式推荐中的记忆倾向,发现其过度依赖一跳记忆,提出IIRG训练策略以学习多跳协同与语义关系,显著提升对非一跳记忆用户的推荐效果。

详情
AI中文摘要

生成式推荐(GR)已成为推荐系统的一个有前景的方向。最近,大型语言模型(LLM)越来越多地被用于GR,因为其丰富的预训练知识有望帮助它们泛化到传统以记忆为导向的基线所能捕捉的常见用户行为模式之外。然而,现有的基于LLM的GR工作很大程度上忽略了LLM众所周知的记忆倾向,如果这种倾向存在于为GR微调的LLM中,将限制它们对预训练知识的利用。在这项工作中,我们通过检查一跳记忆(即模型推荐训练数据中项目的直接后继项目)来研究这一担忧。我们表明,LLM比非LLM的GR模型更频繁地这样做——事实上,它们相对于GR基线的大部分增益实际上来自那些目标项目可以通过一跳记忆预测的用户。我们直觉认为,提高剩余用户的性能需要LLM学习更丰富的项目-项目关系,超越一跳转换。为此,我们提出了IIRG,一种新颖的训练策略,教导LLM捕获:(1)从用户序列中跨多跳的项目共现导出的协同关系,以及(2)具有相似主题的项目之间的语义关系,这两者都可以作为有用的推荐信号。我们表明,IIRG显著优于仅使用标准下一项目预测训练的LLM,尤其是对于那些测试项目在训练时的一跳转换中未覆盖的用户,增益尤为显著。

英文摘要

Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs' well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.

2606.19317 2026-06-18 cs.LG cs.AI 新提交 70%

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT(新泽西理工学院) MIT(麻省理工学院) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

专题命中 其他LLM :用程序合成解释注意力头

AI总结 提出用可执行程序近似深度网络组件行为的方法,针对Transformer注意力头,通过生成Python程序再现注意力模式,实现可解释性。

详情
AI中文摘要

可解释深度学习研究的一个长期目标是,用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头,我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着,我们向预训练语言模型提供这些矩阵的摘要,并指示它生成一组Python程序,这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后,我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明,少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式,在TinyStories上平均交并比相似度超过75%。此外,最佳匹配程序可以替代神经注意力头而不会显著影响模型行为:在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%,同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程,推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

2606.19264 2026-06-18 cs.LG cs.CL 新提交 70%

Structured Inference with Large Language Gibbs

大语言吉布斯结构化推理

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

发表机构 * University of Edinburgh, School of Informatics(爱丁堡大学信息学院)

专题命中 其他LLM :利用LLM条件分布进行结构化概率推理

AI总结 提出大语言吉布斯方法,利用大语言模型的条件分布作为转移算子进行结构化概率推理,通过迭代重采样变量避免顺序偏差,在合成分布、一致性推理和贝叶斯结构学习中验证有效性。

Comments Code: https://github.com/hyeok9855/large-language-gibbs

详情
AI中文摘要

大型语言模型(LLMs)中编码的知识可以作为描述复杂世界变量的结构化推理的基础,但以概率一致的方式访问这些知识构成了一个困难的推理问题。我们提出了大语言吉布斯,一种结构化概率推理方案,它使用LLM的条件分布作为转移算子。不是通过单次自回归生成来采样结构化对象,而是利用LLM的下一个标记条件分布,在给定其他变量的条件下迭代地重采样单个变量。这种方法避免了顺序依赖偏差,并产生一个反映所有局部条件分布之间折衷的平稳分布。我们将这种方法应用于从合成分布中采样、一致性推理任务和贝叶斯结构学习。结果表明,在通过噪声LLM条件分布可访问的世界先验下,MCMC中使用LLM条件分布是用于结构化概率推理的一次性生成的实际替代方案。

英文摘要

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

2606.19218 2026-06-18 cs.CL 新提交 70%

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM:开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville(阿拉巴马大学亨茨维尔分校) University of North Alabama(北阿拉巴马大学) Stanford University(斯坦福大学) Meta AI Amazon GenAI(亚马逊生成人工智能)

专题命中 其他LLM :评估LLM生成文本的自动指标,属于LLM应用

AI总结 提出 RECOM 数据集,发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性,余弦相似度有效性高但区分性差,BERTScore 区分性受长度影响且有效性弱。

详情
AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法,但一个指标被默默要求完成两项任务:区分真实内容对齐与表面巧合(有效性),以及区分更好的系统与更差的系统(区分性)。在开放式、观点驱动的问答中,这两者存在矛盾。我们引入了 RECOM(Reddit Evaluation for Correspondence of Models),一个无污染评估数据集,包含 15,000 个 r/AskReddit 问题(2025 年 9 月),每个问题都配有真实的社区回复,这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM(7-10B)的每个回复与每个指标配对,并加入随机乱序噪声基线,我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答(Cohen's $d \approx 2$),但无法对五个模型进行排序($|d| < 0.1$);BERTScore 精确度看似能对模型排序(原始 $|d|$ 高达 0.63),但一旦控制回复长度,这一数值骤降至 $|d| = 0.09$,且其有效性较弱($d \approx 0.8$,而余弦相似度约为 2)。由于每个指标对相同的输出进行评分,这种有效性与区分性的权衡是指标的属性,而非模型的属性,我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距,同样只能微弱地区分五个模型。我们建议在两个轴上报告指标,并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

2606.19172 2026-06-18 cs.AI 新提交 70%

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹:将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

专题命中 其他LLM :将用户记忆内化为参数编辑,属于LLM个性化

AI总结 提出User as Engram方法,将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑,推理技能共享一个适配器,实现高精度间接推理且内存占用极小。

详情
AI中文摘要

语言模型中的个人记忆涉及两个问题:内容和推理技能。大脑将两者分开(每个情节在海马体中有一个稀疏的局部印迹,解释它的共享技能在缓慢的新皮层中),因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外,存储在自然语言记忆文件或检索索引中。当事实被写入模型时,标准方法是每用户的LoRA适配器,这与大脑相反,将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本;将相同事实写为局部Engram行则数学上保持不变,导致内存占用大约减少33,000倍。因此,我们提出User as Engram:将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑,并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回,同时平均提供5.6倍更高的间接推理准确性,并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒:写入一个事实会在精确触发时打开其查找,添加答案所需的值,保持其他每个位置不变到最后一位,如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中,它们的编辑可组合:许多用户同时共享一个表,可加性且无损地堆叠,而每用户LoRA(一个全局权重增量)只允许一个。在检索时,每用户Engram表不会随着检索器必须搜索的群体增长,因此在大约100个事实后,它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

2606.18851 2026-06-18 eess.SY cs.SY 新提交 70%

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

从令牌到能量灵活性:面向LLM推理工作负载的数据中心量化使能需求响应

Bojun Du, Xiaoyi Fan, Ershun Du, Long Chen, Jianpei Han, Qingchun Hou, Ning Zhang, Chongqing Kang

专题命中 其他LLM :LLM推理数据中心需求响应,量化管理。

AI总结 提出一种量化使能的能量管理框架,通过建立量化-功率模型和两阶段需求响应模型,实现多园区协同优化,降低数据中心运营成本34.3%。

Comments 10 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)推理的快速增长正在造成显著的数据中心负载,在日益紧张的电网条件和需求响应(DR)要求下,这些负载面临着越来越多的能量管理挑战。传统的数据中心能量管理主要依赖于时间和空间上的工作负载转移以及园区级能量资产调度,但通常将LLM推理需求视为聚合负载。因此,这些方法未能利用LLM服务的内部特性,从而忽视了模型量化等LLM特定技术所提供的灵活性。为了释放这种灵活性,本文提出了一种面向电网响应型LLM推理数据中心的量化使能能量管理框架。首先,建立了一个量化-功率模型,将每个模型-量化配置映射到一个紧凑的可调度参数集。其次,开发了一个两阶段量化使能的需求响应模型,以考虑模型实例切换、请求路由和精度选择。第三,引入了一种多园区协同优化方法,通过将电网侧电力和碳信号与量化使能的需求响应模型相结合,参与需求响应。案例研究表明,所提出的框架在不减少服务令牌量的情况下,将数据中心总运营成本降低了34.3%,验证了模型量化作为电网响应型LLM数据中心能量管理的有效灵活性杠杆。

英文摘要

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.

2606.18832 2026-06-18 cs.LG cs.AI 新提交 70%

Target-confidence Recourse Using tSeTlin machines: TRUST

使用Tsetlin机器的目标置信度追索:TRUST

K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

发表机构 * Group Research and Development Det Norske Veritas (DNV)(挪威船级社(DNV)集团研发部)

专题命中 其他LLM :提出TRUST框架,使用概率Tsetlin机器生成反事实解释,属于LLM应用

AI总结 提出TRUST框架,通过概率Tsetlin机器和贝叶斯优化直接搜索满足用户指定置信度目标的最小输入变化,生成更稳健和可解释的反事实解释。

详情
AI中文摘要

反事实解释被广泛用于高风险决策系统中的算法追索。大多数现有方法寻求最小化改变输入以翻转模型决策。然而,决策者通常不仅依赖预测标签,还依赖置信度阈值和风险边际。刚好越过决策边界的反事实在噪声或模型变化下可能脆弱且不稳定。本文提出使用Tsetlin机器的目标置信度追索(TRUST),一种用户明确指定追索所需预测置信度的框架。TRUST不是先生成反事实再评估置信度,而是直接搜索满足用户定义置信度目标的最小变化,从而在成本、置信度和鲁棒性方面比较追索选项。我们使用概率Tsetlin机器(PTM)结合贝叶斯优化实例化TRUST。PTM基于概率子句的结构将预测置信度与决策规则的稳定性联系起来。我们表明,满足相同规则的反事实在可靠性上可能差异很大,取决于它们满足这些规则的安全程度,揭示了决策是由稳健还是脆弱的子句激活支持的。在合成和真实数据集上的实验表明,目标置信度反事实比传统的基于边界的方法产生更稳健和可解释的追索。在多个基准测试中,TRUST实现了完美的鲁棒性,同时保持较低的追索成本,包括在Haberman数据集上以0.92置信度达到0.10的L2距离。通过显式控制置信度和暴露规则级稳定性,TRUST为高风险决策支持提供了可操作的追索。

英文摘要

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

2606.18795 2026-06-18 cs.SI 新提交 70%

Opinion Polarization in LLM-Based Social Networks: Manipulation and Mitigation

基于LLM的社交网络中的意见极化:操纵与缓解

Ali Safarpoor Dehkordi, Mohammad Shirzadi, Ahad N. Zehmakan

专题命中 其他LLM :基于LLM的社交网络意见极化研究

AI总结 研究在基于大语言模型模拟的社交网络中,对手如何通过有限预算操纵意见极化,并评估两种防御机制(反应性和主动性)的效果,发现两者均无法完全恢复基线极化状态。

Comments 14 pages, 7 figures

详情
AI中文摘要

在线社交网络在面对试图通过操纵意见来放大意见极化的对手时有多脆弱?缓解这种操纵有多困难?现有研究使用意见动态的数学模型来探讨这一问题。虽然这些模型提供了有价值的理论见解,但它们依赖于关于交互、消息内容和意见更新的简化假设,限制了它们能够捕捉的对抗策略及其发现在现实环境中的适用性。基于大语言模型的模拟提供了一种更丰富的替代方案:智能体可以被赋予多样化的角色,通过自然语言进行交流,并以上下文相关的方式回应说服性或对抗性内容。这使得研究难以用经典数学模型表示的操纵策略成为可能。据我们所知,本研究首次在基于LLM的模拟社交网络框架中系统分析了极化的放大和缓解。在我们的框架中,具有多样化角色的LLM智能体通过交换自然语言帖子在社交网络上进行交互,并相应地更新他们的意见。我们表明,即使预算有限的对手也能显著增加极化。然后,我们研究了两类防御机制:反应性缓解(指派特定用户主动对抗操纵)和主动性干预(通过不针对特定用户的一般机制增加抵抗力)。我们的结果表明,尽管这些机制减少了对抗攻击的影响,但它们通常无法将网络恢复到其基线极化状态。这些发现表明,这两种方法都不能完全克服网络的脆弱性,凸显了此类攻击的潜在风险。

英文摘要

How vulnerable are online social networks to adversaries who seek to amplify opinion polarization by manipulating opinions, and how difficult is it to mitigate such manipulation? Existing studies have examined this question using mathematical models of opinion dynamics. While these models offer valuable theoretical insights, they rely on simplified assumptions about interactions, message content, and opinion updates, limiting the adversarial strategies they can capture and the applicability of their findings to real-world settings. Large language model (LLM)-based simulations provide a richer alternative: agents can be assigned diverse personas, communicate through natural language, and respond to persuasive or adversarial content in a context-dependent way. This enables the study of manipulation strategies that are difficult to represent using classical mathematical models. To the best of our knowledge, this study provides the first systematic analysis of polarization amplification and mitigation in an LLM-based simulated social network framework. In our framework, LLM agents with diverse personas interact over a social network by exchanging natural language posts and updating their opinions accordingly. We show that even an adversary with a limited manipulation budget can considerably increase polarization. We then study two classes of defense mechanisms: reactive mitigations, which assign specific users to actively counter manipulation, and proactive interventions, which increase resistance through general mechanisms not tied to particular users. Our results show that although these mechanisms reduce the impact of adversarial attacks, they generally do not restore the network to its baseline polarization state. These findings suggest that neither approach fully overcomes the vulnerability of the network, highlighting the potential risk of such attacks.

2606.18726 2026-06-18 cs.LG cs.AI 新提交 70%

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

基于图锚定交叉注意力Transformer神经网络的预测过程监控中结构约束完整事件序列生成

Fang Wang, Ernesto Damiani

发表机构 * Department of Computer Science, University of Milan(米兰大学计算机科学系)

专题命中 其他LLM :预测过程监控,图锚定交叉注意力Transformer。

AI总结 提出图锚定交叉注意力Transformer(GGATN),通过全局过程图作为结构化记忆、Transformer自注意力编码序列位置、图锚定交叉注意力注入过程拓扑,结合维特比式图约束解码,一次性生成完整事件序列,在六个基准日志上优于LLM基线。

Comments 40 pages

详情
AI中文摘要

结构约束的事件序列生成仍然具有挑战性,因为生成的路径必须保持转移可行性、时间顺序、终止和属性一致性。在预测过程监控(PPM)中,这一挑战表现为完整事件序列生成,而现有工作主要处理子任务,如下一个活动、剩余时间、结果和属性预测。本文提出了图锚定交叉注意力Transformer神经网络(GGATN)用于这一统一的PPM任务。GGATN使用全局过程图作为结构化活动记忆,通过Transformer自注意力对序列位置进行上下文化,并通过图锚定交叉注意力注入过程拓扑。与自回归解码不同,GGATN一次性生成活动、时间戳、长度以及事件级和序列级属性,随后进行维特比风格的图约束解码以获得可行路径和显式终止。在六个基准事件日志上的实验表明,其生成质量优于局部指令提示的LLM基线。GGATN在序列相似性、Damerau-Levenshtein相似性、基于二元组的控制流相似性和持续时间分布方面取得了强劲性能,同时保持零幻觉活动和零序列级属性不一致。消融分析证实了全局图编码器作为稳定的结构先验。可解释性分析展示了图结构、序列上下文、反馈细化和约束解码如何塑造生成过程。

英文摘要

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

2606.18717 2026-06-18 cs.CL cs.AI 新提交 70%

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher(独立研究者)

专题命中 其他LLM :土耳其语形态感知分词器与词嵌入。

AI总结 针对土耳其语粘着特性,提出Morpheus神经词素边界模型,实现无损可逆分词与结构化词嵌入,在可逆分词器中达到最低比特每字符(1.425),词素对齐F1提升至0.61,GPU内存节省约19%。

详情
AI中文摘要

土耳其语是粘着语:意义由词素承载,然而驱动现代语言模型的子词分词器根据语料库统计分割单词,切碎了承载语义的后缀,并且在WordPiece和基于规则的分析器的情况下,无法将其输出解码回原始文本。本文提出\textbf{Morpheus},一个面向土耳其语的神经词素边界模型,它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度,在推理时转化为精确的片段,无需字符串归一化,因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型,相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符(1.425),将子词家族的金标准词素对齐大致翻倍(MorphScore宏F1从约0.32提升至0.61),并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器,冻结的Morpheus向量在词汇检索(根家族MAP 0.85)和同根验证(ROC-AUC 1.00)上领先,超越了多语言检索器BGE-M3和BERTurk;在上下文和屈折依赖的任务(NER、格/数探测)上,更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码:此https URL 模型:此https URL 交互演示:此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.