arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31586 2026-06-01 cs.CL cs.AI 版本更新

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

语言模型学习构式语义,更不用说句法:探究LM对配对焦点构式的理解

Wesley Scivetti, Ethan Wilcox, Nathan Schneider, Kanishka Misra, Leonie Weissweiler

发表机构 * Georgetown University(乔治城大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Leipzig University(莱比锡大学)

AI总结 通过构建新数据集,研究不同规模开源语言模型对英语中稀有配对焦点构式(如“let alone”)的语义理解,发现中等规模模型能掌握其形式和意义,且语义学习晚于句法知识,并与世界知识相关。

Comments Conference on Natural Language Learning (CoNLL) 2026

详情
AI中文摘要

理解稀有构式(形式-意义配对)的语义已被证明是一个具有挑战性的问题,目前只有最大的LLM才能解决。开源模型是否具有稳健的构式理解,以及如果具备,这种知识习得背后的学习动态是什么,仍然是一个开放问题。聚焦于英语中一组稀有的配对焦点构式(例如“let alone”、“much less”),我们构建了一个新颖的数据集,利用标量形容词语义和一般世界知识来测试它们的意义。通过测试一系列在参数数量、架构和预训练数据集大小上不同的模型,我们发现几个中等规模的模型对配对焦点构式的形式和意义都敏感,尽管在人类规模数据上训练的模型在所有意义评估中均失败。转向一组开放检查点模型的训练动态,我们发现配对焦点理解在训练后期出现,晚于配对焦点句法知识,并且配对焦点语义的学习与世界知识某些领域的提升相关。总体而言,我们的实证结果支持中等规模开源模型能够掌握稀有配对焦点构式的结论,并展示了配对焦点构式知识与其他意义领域之间的联系。

英文摘要

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.

2605.31584 2026-06-01 cs.CL cs.AI cs.LG 版本更新

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL: 基于评分奖励从搜索智能体轨迹中学习长上下文推理

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

发表机构 * Tsinghua University(清华大学)

AI总结 提出LongTraceRL框架,通过知识图谱随机游走生成多跳问题并利用搜索智能体轨迹构建分层干扰物,结合基于实体链的评分奖励进行过程监督,提升大语言模型在长上下文推理中的表现。

详情
AI中文摘要

长上下文推理仍然是大型语言模型的核心挑战,模型往往难以在大量干扰内容中定位和整合关键信息。基于可验证奖励的强化学习(RLVR)在此任务上展现出潜力,但现有方法受限于低混淆度的干扰物和稀疏的、仅基于结果的奖励信号,无法监督中间推理步骤。为解决这些问题,我们引入了 extsc{LongTraceRL}。在数据构建方面,我们通过知识图谱随机游走生成多跳问题,并利用搜索智能体轨迹构建\emph{分层干扰物}:智能体读取但未引用的文档(高混淆度)和搜索结果中出现但从未打开的文档(低混淆度),从而生成比随机采样或单次搜索构建的训练上下文更具挑战性的内容。在奖励设计方面,我们提出了一种\emph{评分奖励},利用每条推理链上的黄金实体作为细粒度的实体级过程监督。该评分奖励仅应用于最终答案正确的响应(正向策略),以区分正确响应之间的推理质量,并防止奖励作弊。在五个长上下文基准上对三种推理LLM(4B-30B)进行的实验表明, extsc{LongTraceRL} 始终优于强基线,并鼓励全面、基于证据的推理。代码、数据集和模型可在 \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL} 获取。

英文摘要

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

2605.31564 2026-06-01 cs.CL cs.AI 版本更新

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

什么先被揭开?面向图到文本生成的扩散模型轨迹分析

Qing Wang, Jacob Devasier, Chengkai Li

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文首次系统研究掩码扩散语言模型在图到文本生成中的解码轨迹,发现其优先生成实体,并针对监督微调导致的输出长度固定问题提出无训练推理时修改方法λ缩放结构解码,恢复+9.4 BLEU-4,同时引入Graph-LLaDA模型以显式融入关系图结构。

详情
AI中文摘要

我们首次系统研究了掩码扩散语言模型(MDLM)在图到文本生成中的应用。我们分析了MDLM的生成轨迹——即迭代解码过程中令牌被掩码的顺序——发现与自回归LLM线性生成文本不同,MDLM自然优先处理实体,然后是关系词和功能词,结构令牌最后解决。我们进一步发现了一个先前未记录的监督微调失败模式:SFT通过过早地将结构性的句子结束令牌锚定在解码轨迹早期,破坏了这一策略,从而有效固定了输出长度,这可能导致信息遗漏或幻觉。为了解决这个问题,我们提出了λ缩放结构解码,一种无训练的推理时修改方法,降低结构令牌的置信度,并恢复了+9.4 BLEU-4。最后,我们引入了Graph-LLaDA,它将图Transformer编码器集成到LLaDA的解码过程中,以显式融入关系图结构。在LAGRANGE上的跨数据集评估表明,先前的基线过拟合于特定数据集模式,而基于LLM和MDLM的方法泛化能力显著更好。

英文摘要

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.

2605.31563 2026-06-01 cs.CL 版本更新

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

分歧理由:重新思考仇恨言论检测中的分类与可解释性评估

Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank, Fosca Giannotti

发表机构 * Scuola Normale Superiore(斯克里瓦纳尔普通大学) University of Pisa(比萨大学) MaiNLP, LMU Munich(MaiNLP,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本研究通过统一框架重新实现多种模型、训练策略和评估指标,在标签和理由表示空间下分析分类与可解释性指标,发现软表示更能捕捉分歧,从而重新思考主观NLP任务的评估方法。

Comments 16 pages

详情
AI中文摘要

人类分歧在标注中普遍存在且众所周知。然而,通过词级人类理由捕捉的解释变异仍远未得到充分探索。同时,鉴于这种变异,如何最好地评估人类标签和理由——甚至如何超越多数投票聚合理由——尚不明确。然而,理由可能提供对人类推理丰富性的额外见解,这些推理在风格、价值观和解释上可能有所不同——尤其是在像仇恨言论检测这样的主观NLP任务中。在本工作中,我们通过在不同标签和理由表示空间下系统地重新实现它们,将多样化的模型、训练策略、损失函数和现有评估指标统一到一个协议下。分类指标围绕两个关键属性组织——预测性和分布性——而可解释性指标则通过三个互补维度:合理性、忠实性和复杂性。在这个统一的监督框架中,我们评估模型在分类和可解释性指标上的行为,以及指标对标签(硬和软)和理由表示空间(硬、中间和软)选择的敏感性。结果表明,硬指标和软指标都更倾向于软表示,突显了它们在捕捉变异方面的有效性,以及重新思考主观NLP中评估的必要性。

英文摘要

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.

2605.31561 2026-06-01 cs.CL 版本更新

What Am I Missing? Question-Answering as Hidden State Probing

我遗漏了什么?将问答作为隐藏状态探测

Chu Fei Luo, Samuel Dahan, Xiaodan Zhu

发表机构 * Department of Electrical and Computer Engineering & Ingenuity Labs, Queen’s University(电子与计算机工程系及Ingenuity实验室,女王大学) Conflict Analytics Lab, Queen’s University(冲突分析实验室,女王大学) Cornell Law School(康奈尔法学院) Vector Institute for AI(人工智能向量研究所)

AI总结 提出将问答作为推理时干预手段,通过学生-教师框架探测隐藏状态,发现提问前的隐藏状态可预测最终正确性,并设计门控策略优化提问时机。

详情
AI中文摘要

自链式推理引入大型语言模型(LLMs)以来,测试时推理已成为一个重要的研究领域。然而,这一推理过程的机制仍未被充分探索——从相同的输入提示,甚至相同的部分解出发,LLMs在多次采样时可能产生不同的答案。我们提出利用提问作为推理时干预手段,以表达模型隐藏状态的信息。为此,我们提出了一个学生-教师设置,其中学生向教师提问。我们在学生提问前后训练一个探测其隐藏状态的探针,发现该探针能预测轨迹的最终正确性,甚至在生成教师答案之前。这表明,在问题生成过程中存在有意义的自我诊断信号,而非来自教师的信息传递。然后,我们将提问建模为序列决策问题,使用该探针作为质量分数,并定义一个门控策略来提问以最大化正确可能性。我们发现,提问作为干预的成功在很大程度上取决于模型的自我一致性。我们的实证结果显示检测与恢复之间存在差距;虽然我们的门控策略捕捉了模型的正确性和不确定性,但干预在恢复错误轨迹的同时,同样可能损害正确轨迹。这种诊断与纠正之间的差距对语言模型在不确定性下进行自我精炼的能力具有更广泛的影响。

英文摘要

Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). However, the mechanisms of this reasoning process are still under-explored -- from the same input prompt, and even the same partial solution, LLMs can produce varied answers if sampled multiple times. We propose to leverage question-asking as an inference-time intervention that articulates information about the model's hidden state. To achieve that, we present a student-teacher setting where a student asks questions to a teacher. We train a probe on the student's hidden state before and after asking a question and find it is predictive of the trajectory's final correctness, even before generating the teacher's answer. This suggests there is a meaningful signal from the self-diagnosis that occurs during question generation rather than information transfer from the teacher. We then frame question-asking as a sequential decision problem, using this probe as a quality score, and define a gating policy to ask questions that maximize likelihood of correctness. We find that the success of question-asking as an intervention is largely dependent on the model's self-consistency. Our empirical results show a gap between detection and recovery; while our gating policy captures model correctness and uncertainty, interventions are equally likely to harm correct trajectories as they are to recover incorrect ones. This gap between diagnosis and correction has broader implications on language models' capacity for self-refinement under uncertainty.

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC 版本更新

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

发表机构 * School of Engineering and Applied Sciences(工程与应用科学系) Department of Psychology(心理学系)

AI总结 本研究通过引入零样本度量LALS,发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦,女性信号在生成前被抑制,揭示了模型对性别偏见的内部处理机制。

Comments 16 pages, 12 figures, 1 table

详情
AI中文摘要

对齐训练使视觉-语言模型(VLM)避免表达人口统计偏见,当性别清晰可见时,它们基本成功。但对于模糊输入(如全副武装的工人、从背后看到的人物)——实践中常见但很少研究的情况——我们发现,在模糊输入图像时,最小的提示压力就会暴露职业-性别默认值,模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容?我们引入LALS(潜在关联倾向分数),一种零样本度量,将视觉标记激活投影到模型的文本嵌入空间中,以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上,内部表征和输出系统性地解耦:模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大,而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明,文化负载的视觉线索(如服装颜色)进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

2605.31550 2026-06-01 cs.CL 版本更新

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

语义三元组恢复:大型语言模型中层次化表格理解的新协议

Yibin Zhao, Fangxin Shang, Dingrui Yang, Yuqi Wang

发表机构 * Taiyuan University of Technology(太原科技大学) AI Lab, Qifu Technology(启福科技AI实验室) AI Lab, Greensea Technology(绿海科技AI实验室)

AI总结 提出语义三元组恢复(STR)协议,将表格重写为<项目路径,特征路径,值>原子事实,并设计TripletQL路由以选择三元组子集,在四个表格问答基准上匹配或超越HTML基线并减少输入令牌。

详情
AI中文摘要

表格问答要求模型恢复由二维布局、合并单元格和层次化标题隐式编码的语义关系。当前流水线通常使用HTML或Markdown作为中间表格表示,但这些面向布局的序列化引入了标记开销,并要求大型语言模型从行和列跨度推断标题-单元格对齐。我们提出语义三元组恢复(STR),一种将每个单元格重写为原子事实<项目路径,特征路径,值>的协议,其中项目路径指定行实体,特征路径指定层次化属性,值包含单元格内容。我们还提出TripletQL,一种轻量级查询感知路由器,使用STR为每个问题选择适当的渲染或过滤的三元组子集。在四个中英文表格问答基准上,STR匹配或超越基于HTML的基线,同时减少输入令牌。对于较小的语言模型和较长的表格上下文,相对收益更大,表明在受限推理预算下显式语义表示特别有用。代码和数据可在https://github.com/Phoenix-ni/STR.git获取。

英文摘要

Table question answering requires models to recover semantic relations encoded implicitly by two-dimensional layout, merged cells, and hierarchical headers. Current pipelines typically use HTML or Markdown as intermediate table representations, but these layout-oriented serializations introduce markup overhead and require large language models to infer header-cell alignments from row and column spans. We propose Semantic Triplet Restoration (STR), a protocol that rewrites each cell as an atomic fact <item path, feature path, value>, where the item path specifies the row-wise entity, the feature path specifies the hierarchical attribute, and the value contains the cell content. We also present TripletQL, a lightweight query-aware router that uses STR to select an appropriate rendering or filtered subset of triplets for each question. Across four Chinese and English table-QA benchmarks, STR matches or improves upon HTML-based baselines while reducing input tokens. The relative benefit grows for smaller language models and longer table contexts, suggesting that explicit semantic representations are especially useful under constrained inference budgets. Code and data are available at https://github.com/Phoenix-ni/STR.git .

2605.31545 2026-06-01 cs.CL 版本更新

Preference-Aware Rubric Learning for Personalized Evaluation

偏好感知的评分标准学习用于个性化评估

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Yoko Yamakata, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Xiaohongshu Inc.(小红书公司) The University of Tokyo(东京大学)

AI总结 提出PARL框架,通过从用户历史中学习偏好感知的评估标准并采用自验证机制,实现个性化评估。

详情
AI中文摘要

随着大型语言模型从通用助手演变为以用户为中心的智能体,个性化已成为使模型行为与个体偏好对齐的核心,而对个性化对齐的评估成为关键瓶颈。现有的评估方法——从自动指标到LLM作为评判者的方法——未能捕捉长期交互历史中嵌入的主观、用户特定偏好。我们确定了可靠且有效的个性化评估的三个基本原则:代表性、用户一致性和区分性。为应对这些原则,我们引入了“个性化评估即学习”范式,将个性化评估形式化为一个学习问题而非静态判断。在此范式下,我们提出了PARL(偏好感知的评分标准学习用于个性化评估),一个从原始用户历史中直接学习诱导偏好感知的评估标准,并执行自验证机制以确保与用户偏好一致的框架。PARL将评分标准诱导与区分性强化学习目标相结合,该目标对比用户撰写的回答与竞争性个性化模型输出,使学到的评分标准能够捕捉精确、用户特定的决策边界。在真实世界的个性化文本生成任务上的实验表明,PARL一致地诱导出高保真度的评分标准,可靠地识别与用户对齐的回答,并在用户和任务间泛化,同时捕捉稳定的风格偏好和细粒度的评估模式。为确保可重复性,我们的代码可在 https://github.com/SnowCharmQ/PARL 获取。

英文摘要

As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User-Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference-Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference-aware evaluation rubrics directly from raw user histories and performs a self-validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user-specific decision boundaries. Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine-grained evaluative patterns. To ensure reproducibility, our code is available at https://github.com/SnowCharmQ/PARL.

2605.31521 2026-06-01 cs.CL cs.SD 版本更新

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

UniAudio-Token: 赋予语义语音分词器通用音频感知能力

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) Basic Model Technology Center, WeChat AI, Tencent Inc.(基础模型技术中心,微信AI,腾讯公司)

AI总结 提出UniAudio-Token框架,通过语义-声学基元(SAP)和语义-声学均衡(SAE)机制,在不牺牲语音能力的前提下为语义分词器注入通用音频感知,实现统一音频接口。

Comments 19 pages, 10 figures

详情
AI中文摘要

语义语音分词器因其紧凑的单码本设计和强语言对齐能力,已成为音频-大语言模型广泛使用的接口。然而,它们对语言抽象的关注导致了声学盲点,限制了其在以语音为中心的任务之外的适用性。我们提出UniAudio-Token,一个在不损害语音能力的前提下赋予语义分词器通用音频感知能力的框架。UniAudio-Token并非改变语义范式,而是通过两个关键创新来减轻其信息损失:(1) 语义-声学基元(SAP)通过将音频分解为语言内容、声音属性和听觉场景基元来提供结构化监督;(2) 语义-声学均衡(SAE)引入了一种内容感知门控机制,自适应地从浅层恢复细粒度声学细节。广泛评估表明,UniAudio-Token在学习全面的通用表示的同时,保持了高保真语音生成。当与下游大语言模型集成时,它在理解和生成任务上均优于所有单码本基线分词器,有效地作为统一音频接口。我们在https://github.com/Tencent/Universal_Audio_Tokenizer上公开发布了所有代码,包括训练和推理脚本以及模型检查点。

英文摘要

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

2605.31512 2026-06-01 cs.CL 版本更新

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

来自临床叙述的可靠多语言骨科决策支持:语言感知适应与验证引导的延迟

Danish Ali, Li Xiaojian, Sundas Iqbal, Farrukh Zaidi

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Software, Nanjing University of Information Science and Technology (NUIST)(南京信息工程大学软件学院) Department of Orthopedic, Bahawal Victoria Hospital(巴哈瓦尔维多利亚医院骨科部门)

AI总结 针对低资源医疗环境中的多语言骨科决策支持,提出结合语言感知适配编码器IndicBERT-HPA和确定性选择性验证层的可靠性框架,在英语、印地语和旁遮普语临床文本分类中取得最优性能。

详情
AI中文摘要

多语言骨科决策支持在低资源医疗环境中仍然具有挑战性,其中临床叙述包含专业术语、混合文字、不完整证据、标签不平衡和语言依赖的文档模式。本文提出了一个面向可靠性的框架,用于对英语、印地语和旁遮普语的自由文本骨科笔记进行分类。我们比较了任务对齐的多语言Transformer编码器、任务微调的DistilBERT基线、零样本指令微调的大语言模型(LLMs)和领域自适应编码器IndicBERT-HPA。IndicBERT-HPA通过语言感知的骨科适配器头增强IndicBERT,以支持临床相关的多语言表示学习。评估从整体准确率扩展到每类性能、ROC-AUC、AUPRC、期望校准误差、跨语言稳定性以及在受控平衡和自然患病率分布下的鲁棒性。评估的零样本LLMs在封闭集分类中远不如任务自适应编码器有效,且存在语言依赖的不稳定性。在自然临床患病率下,IndicBERT-HPA实现了最强的整体性能,平均Macro-F1达到0.8792,Macro-AUROC为0.894,AUPRC为0.902。我们进一步实现了一个确定性的选择性验证层,结合了置信门控、证据一致性检查和语言风险筛查。在随机选择的5000条保留子集上,它在72.3%的覆盖率下实现了84.4%的选择性准确率和0.76的选择性Macro-F1,而全接受预测的准确率为71.5%,Macro-F1为0.65。这些结果支持了面向可靠性的多语言临床决策支持,并带有明确的延迟机制。

英文摘要

Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminology, mixed scripts, incomplete evidence, label imbalance and language-dependent documentation patterns. This article presents a reliability-oriented framework for classifying free-text orthopedic notes in English, Hindi and Punjabi. We compare task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, zero-shot instruction-tuned large language models (LLMs) and a domain-adaptive encoder, IndicBERT-HPA. IndicBERT-HPA augments IndicBERT with language-aware orthopedic adapter heads to support clinically relevant multilingual representation learning. Evaluation extends beyond aggregate accuracy to per-class performance, ROC-AUC, AUPRC, expected calibration error, cross-language stability and robustness under controlled balanced and natural-prevalence distributions. The evaluated zero-shot LLMs remain substantially less effective than task-adapted encoders for closed-set classification, with language-dependent instability. Under natural clinical prevalence, IndicBERT-HPA achieves the strongest overall performance, reaching an averaged Macro-F1 of 0.8792, Macro-AUROC of 0.894 and AUPRC of 0.902. We further implement a deterministic selective-verification layer combining confidence gating, evidence-consistency checking and language-risk screening. On a randomly selected held-out 5,000-record subset, it achieves 84.4% selective accuracy and 0.76 selective Macro-F1 at 72.3% coverage, compared with 71.5% accuracy and 0.65 Macro-F1 for accept-all prediction. These results support reliability-oriented multilingual clinical decision support with explicit deferral.

2605.31494 2026-06-01 cs.CL cs.LG 版本更新

Consolidating Rewarded Perturbations for LLM Post-Training

整合奖励扰动用于大语言模型后训练

Zheyu Zhang, Shuo Yang, Gjergji Kasneci

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 提出CoRP方法,通过奖励加权聚合、兼容性重加权和验证门控,将奖励扰动整合为单一模型,无需梯度,在单次推理下平均提升8.1分。

详情
AI中文摘要

语言模型的后训练通常被框架为通过梯度下降实现的样本-分数-更新循环。最近的一系列工作,以RandOpt为例,将此循环转移到权重空间,在预训练模型周围采样高斯扰动,并在推理时集成前K个奖励专家。虽然在与PPO和GRPO匹配训练计算量下具有竞争力,但这种预测级集成每个测试样本需要K次前向传播,并且不能干净地扩展到自由生成。我们询问是否可以将奖励种群折叠成一个单一的可部署模型,用一次整合更新替代推理时集成。对25个模型-任务对的拆分半分析揭示了每种情况下可复现的低秩结构。我们将这种几何结构转化为CoRP(整合奖励扰动),这是一种无梯度算子,结合了奖励加权聚合、兼容性感知重加权和保留验证门控,且没有梯度通过语言模型。在从0.5B到8B的五个语言模型和涵盖数学、代码和创意写作的五个任务上,CoRP平均将基础模型提升了8.1分。使用RandOpt扰动预算的十分之一,CoRP超过了单次推理的RandOpt 6.5分,并恢复了50次多数投票集成增益的一半以上,而每个测试样本只需一次前向传播。

英文摘要

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.

2605.31480 2026-06-01 cs.CL 版本更新

Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task

语言模型可以组合性地解析指代,但这并非其天然优势:以个人关系任务为例

Bart Evelo, Meaghan Fowlie, Denis Paperno

发表机构 * GitHub

AI总结 通过个人关系任务,比较人类与大型语言模型在外延任务(确定指称对象)和内涵任务(结构化表示意义)上的表现,发现人类更擅长外延任务而LLM更擅长内涵任务,表明缺乏指称基础是LLM模拟人类语言理解的关键缺失。

Comments A pre-MIT Press publication version. Paper accepted to Transactions of the Association for Computational Linguistics

详情
AI中文摘要

神经模型(如大型语言模型)是否真正获得了组合性能力来理解自然语言?当我们谈论语义解释时,可以区分两个互补的方面:确定一个表达式在世界中的指称(我们称之为外延任务)以及以结构化方式表示其意义(我们称之为内涵任务)。我们在个人关系任务(Paperno 2022)的设置中评估了LLM和人类在这两项任务上的表现,该任务给定一个人际关系宇宙,要求解释诸如“Amber的父母的朋友”这样的名词短语。这里,对于内涵任务,答案是公式“friend(parent(amber))”;对于外延任务,答案是具体的人。我们发现人类和LLM表现出相反的强项:人类在外延任务上表现优于内涵任务,而LLM则相反。我们的方法为理解现代机器学习模型中的组合性能力带来了更细致的视角。我们的结果支持这样一种观点:LLM训练中缺乏指称基础是模仿人类语言理解的关键缺失成分。

英文摘要

Do neural models, such as Large Language Models, genuinely acquire compositional abilities for interpretation of natural language? When we talk about semantic interpretation, we can distinguish two complementary aspects: establishing what an expression refers to in the world (which we call the Extensional task) and representing its sense in a structured way (which we call the Intensional task). We evaluate LLMs and humans on both tasks in the setting of the Personal Relation Task (Paperno 2022) in which, given a universe of people and their relationships with each other, one is asked to interpret a noun phrase such as "Amber's parent's friend". Here, for the Intensional task, the answer is the formula "friend(parent(amber))", and for the Extensional task, the person. We find that humans and LLMs show opposite strengths: humans perform better on Extensional than Intensional tasks, and LLMs vice versa. Our methodology brings greater nuance to the understanding of compositional abilities in modern machine learning models. Our results support the notion that the lack of referential grounding in LLM training is a crucial missing component in mimicking human-like language understanding.

2605.31478 2026-06-01 cs.SE cs.CL cs.SY eess.SY 版本更新

Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

知识边界探测与需求引导干预:面向基于LLM的电力系统代码生成

Hui Wu, Xiaoyang Wang, Zhong Fan

发表机构 * Department of Engineering, University of Exeter(工程系,埃克塞特大学) Department of Computer Science, University of Exeter(计算机科学系,埃克塞特大学)

AI总结 针对LLM在电力系统代码生成中因API知识边界错误导致失败的问题,提出PowerCodeBench基准、L0-L3文档驱动探测和边界感知干预方法,显著提升模型准确率。

Comments 43 pages, 12 figures, includes supplementary material

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于自动化电力系统分析,但许多公用事业和能源研究实验室出于保密、监管、可重复性和成本原因,要求本地部署。这使得开源模型的可靠性成为一个部署问题。我们表明,电力系统代码生成中的首次失败并非仅由推理主导,而是由结构化的API知识边界错误主导:在版本化的仿真库中出现虚构的函数名、误用的参数以及处理不当的结果表。我们引入了PowerCodeBench,一个经过执行验证的基准生成器,它将自然语言操作员查询与pandapower代码和数值真值配对;一个L0-L3文档驱动的探测程序,用于测量每个模型的API知识概况;以及一种边界感知干预,将查询侧API需求估计与目标主动文档注入和路由被动修正相结合。在一个包含2000个任务的冻结版本上,我们评估了十个开源LLM(1.5B-480B参数)和四个商业中端API。该干预措施使每个评估的至少7B参数的开源模型和每个商业API提升了32到56个准确率点。70B-120B范围内的开源模型匹配了商业中端准确率范围,而Llama-3.1-405B和Qwen3-Coder-480B领先。目标提示在保持全上下文准确率上限的同时,使用了41%的提示令牌成本。结果是在不进行微调或云端推理的情况下,为电网分析工作流提供可靠的本地LLM辅助的准确率侧、部署时路径。

英文摘要

Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.

2605.31469 2026-06-01 cs.CL cs.AI cs.SD eess.AS 版本更新

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

扩展匈牙利语对话ASR:BEA-Dialogue+语料库

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Katalin Mády

发表机构 * Department of Telecommunications and Artificial Intelligence(电信与人工智能系) Budapest University of Technology and Economics(布达佩斯技术与经济大学) Speechtex Ltd.(Speechtex公司) ELTE Research Centre for Linguistics(ELTE语言学研究中心)

AI总结 针对匈牙利语对话语音识别训练数据不足的问题,本文通过放宽分割标准扩展BEA-Dialogue语料库至200小时,并评估基于Whisper和FastConformer的模型,证明基于序列化输出训练的微调能持续改善识别性能。

详情
AI中文摘要

匈牙利语对话自动语音识别受到公开对话式训练数据有限的制约。BEA-Dialogue语料库解决了这一需求,但其严格的说话人分离的训练/开发/测试分割将可用材料减少到仅85小时。在本文中,我们介绍了BEA-Dialogue+,这是该语料库的扩展版本,它放宽了实验者和对话伙伴的分割标准,同时保持主要说话人的完全分离。这产生了200小时转录的自然对话,并允许对额外训练数据与分割间说话人重叠之间的权衡进行受控研究。我们在两个语料库版本上评估了多个基于Whisper和FastConformer的模型,包括基于序列化输出训练(SOT)的对话转录微调。我们的结果表明,对于未经微调的模型,较大的语料库更具挑战性,而基于SOT的适应在WER、CER、cpWER和cpCER上产生了一致的改进。总体而言,BEA-Dialogue+为匈牙利语对话ASR提供了一个更大但仍具挑战性的基准,以及用于训练和评估对话转录系统的实用资源。

英文摘要

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

2605.31463 2026-06-01 cs.LG cs.AI cs.CL cs.DC 版本更新

PithTrain: A Compact and Agent-Native MoE Training System

PithTrain: 一个紧凑且面向智能体的MoE训练系统

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry, Chenyan Xiong, Tianqi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Xlue NVIDIA(英伟达)

AI总结 提出PithTrain,一个基于智能体原生设计原则的紧凑型MoE训练框架,通过引入ATE-Bench评估智能体任务效率,在保持生产框架吞吐量的同时,将智能体任务轮次和活跃GPU时间分别降低62%和64%。

详情
AI中文摘要

混合专家模型(MoE)已成为前沿语言模型的主导架构。为满足这一需求,生产框架经过多年的工程努力构建了优化的MoE训练栈。然而,为新的架构和系统优化而演进这些栈仍然代价高昂。随着AI编码智能体的兴起,它们可以自动化训练框架开发的部分工作并加速这一演进。但将这些智能体应用于现有框架会带来隐藏成本,这些成本在当今仅关注吞吐量的评估中不可见。我们将这一缺失维度命名为智能体任务效率(ATE):即使用编码智能体理解、操作和扩展框架的成本。基于四个智能体原生设计原则,我们构建了PithTrain,一个紧凑、智能体原生的MoE训练框架。我们进一步引入了ATE-Bench,涵盖现实世界的训练框架任务。我们的评估表明,PithTrain在吞吐量上与生产框架相当,并且在ATE-Bench上,PithTrain实现了更高的智能体任务效率,智能体轮次减少高达62%,活跃GPU时间减少64%。

英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.

2605.31455 2026-06-01 cs.LG cs.CL 版本更新

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

DRIFT: 解耦的轨迹采样与重要性加权微调以实现高效的多轮优化

Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu

发表机构 * The Hong Kong University of Science(香港科学与技术大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对多轮交互中在线强化学习成本高而离线监督微调存在分布偏移的问题,提出DRIFT框架,通过将KL正则化强化学习目标等价转化为重要性加权监督学习,实现高效且稳定的多轮优化。

详情
AI中文摘要

大型语言模型越来越多地部署在多轮交互环境中,用户或环境可以迭代地提供轻量级反馈。不幸的是,优化这种行为在实践中面临一个尖锐的困境:在线强化学习能够有效处理多轮动态,但由于每次更新时生成完整修正轨迹的成本过高而变得昂贵,而离线监督微调(SFT)虽然高效,但存在分布偏移和行为崩溃的问题。为此,我们创新性地提出了DRIFT(解耦的轨迹采样与重要性加权微调)框架,该框架实现了KL正则化强化学习目标等价于重要性加权监督学习的理论洞察。DRIFT通过从固定参考策略中采样离线交互轨迹,推导基于回报的重要性权重,并通过在所得数据集上进行加权SFT来优化策略,从而将轨迹采样与优化解耦。实验表明,DRIFT在多轮强化学习基线中达到或超越其性能,同时保持了标准监督微调的训练效率和简单性。代码可在 https://github.com/2020-qqtcg/DRIFT 获取。

英文摘要

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

2605.31452 2026-06-01 cs.CL cs.HC 版本更新

Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows

自由译者的翻译分析 II:用于保密翻译工作流的本地大语言模型基准测试

Yuri Balashov, Rex VanHorn, Mingxi Xu, Austin Downes

发表机构 * University of Georgia(佐治亚大学)

AI总结 针对自由译者和小型语言服务提供商,开发了实用低门槛的方法,通过基准测试本地可运行的大语言模型在保密敏感领域的离线翻译性能,发现精心选择的本地大语言模型可匹配或超越本地神经机器翻译系统和前沿大语言模型,但落后于顶级商业神经机器翻译系统。

Comments 20 pages. Accepted at EAMT-2026 (Tilburg, Netherlands, June 2026)

详情
AI中文摘要

基于我们之前的工作,本文为自由译者和小型语言服务提供商开发了实用、低门槛的方法,使用严格且易于访问的分析方法来评估翻译技术。这里我们解决一个高风险、专业化的需求:保密敏感领域的离线翻译,其中隐私约束排除了基于云的引擎和商业大语言模型的使用。我们将之前工作中使用的Reeve基金会三语语料库(RFTC)扩展为多语语料库(RFMC),添加了句子对齐的德语和简体中文参考翻译。然后,我们通过Ollama对几个本地可运行的语言模型进行基准测试,涵盖四种语言方向,从该语料库中选取了1000多个句子。我们使用一致的单一提示调用,无需微调或领域适应,将本地大语言模型的输出与商业神经机器翻译(DeepL、百度)、前沿大语言模型(GPT-5.2)以及专业级本地神经机器翻译系统(OPUS-CAT、NeuralDesktop、Promt)进行比较。使用MATEO进行自动评估。结果显示,本地大语言模型在不同语言方向和模型规模上的性能存在显著差异。最好的本地大语言模型匹配或超越了本地神经机器翻译系统和前沿大语言模型,但仍落后于顶级商业神经机器翻译。这些发现强调了精心选择的本地大语言模型翻译对于隐私受限的专业人士的可行性,并为未来关于模型缩放和多语言能力的研究提供了信息。

英文摘要

Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods. Here we address a high-stakes, specialized need: offline translation for confidentiality-sensitive domains in which privacy constraints preclude the use of cloud-based engines and commercial LLMs. We expand the Reeve Foundation Trilingual Corpus (RFTC) used in our previous work into a multilingual corpus (RFMC) by adding sentence-aligned German and Simplified Chinese reference translations. We then benchmark several locally runnable language models (via Ollama) across four language directions on 1000+ sentences selected from this corpus. We use consistent single-prompt calls without fine-tuning or domain adaptation, comparing local LLM outputs against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional-grade local NMT systems (OPUS-CAT, NeuralDesktop, Promt). Automatic evaluation is conducted with MATEO. Results reveal substantial variation in local LLM performance across language directions and model sizes. The best local LLMs match or surpass local NMT systems and a frontier LLM, though they remain behind top commercial NMTs. These findings underscore the viability of carefully selected local LLM translation for privacy-constrained professionals and inform future research on model scaling and multilingual capability.

2605.31446 2026-06-01 cs.CL cs.AI 版本更新

Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction

面向方面情感三元组抽取的诊断推理监督细粒度验证

Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Lingnan University(岭南大学) Education University of Hong Kong(香港教育大学)

AI总结 提出FiVeD框架,通过诊断推理监督进行细粒度验证,利用质量评分和错误分类等辅助任务提升ASTE三元组抽取的可靠性。

Comments 25 pages, 13 figures, and 6 tables

详情
AI中文摘要

方面情感三元组抽取(ASTE)旨在识别方面词、观点词和情感极性作为结构化三元组,为下游信息系统应用(如意见挖掘、可解释推荐和评论摘要)提供必要输入。先前工作主要关注端到端抽取,而对抽取三元组的事后验证仍相对未被充分探索。这一差距限制了ASTE系统的可靠性,因为预测的三元组可能在局部合理但全局无效。此外,候选无效性是多方面的,候选可用性本质上是分级的,这促使了一种细粒度验证机制,可以过滤或重新排序来自不同抽取器的输出。在本文中,我们提出了FiVeD,一个具有诊断推理监督的细粒度验证框架。具体来说,验证器通过多个互补目标进行训练,包括作为主要任务的有效性分类和质量评分估计,以及作为辅助任务的错误类型分类和理由生成。我们定义了层次化错误类别,并在语义和句法约束下构建合理的错误三元组,利用现成的LLM和特定任务评分标准生成质量评分和诊断理由。在推理过程中,生成的质量评分用于过滤候选输出,支持可调节的精确率-召回率权衡。在多个ASTE基线模型上的实验表明,FiVeD作为即插即用的验证模块,持续将抽取性能提升最多3.53个F1点。

英文摘要

Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion mining, explainable recommendations, and review summarization. Prior work mainly focuses on end-to-end extraction, while post hoc verification of extracted triplets remains comparatively underexplored. This gap limits the reliability of ASTE systems, since predicted triplets may be locally plausible while being globally invalid. Moreover, candidate invalidity is multi-faceted and candidate usability is inherently graded, motivating a fine-grained verification mechanism that can filter or re-rank outputs from diverse extractors. In this paper, we propose FiVeD, a framework for Fine-grained Verification with Diagnostic reasoning supervision. Specifically, the verifier is trained with multiple complementary objectives, including validity classification and quality score estimation as primary tasks, with error type classification and rationale generation as auxiliary tasks. We define hierarchical error categories and construct plausible incorrect triplets under semantic and syntactic constraints, and leverage an off-the-shelf LLM with task-specific rubrics to produce quality scores and diagnostic rationales. During inference, the resulting quality scores are used to filter candidate outputs, supporting adjustable precision-recall tradeoffs. Experiments across multiple ASTE baselines demonstrate that FiVeD consistently improves extraction performance by up to 3.53 F1 points as a plug-and-play verification module.

2605.31445 2026-06-01 cs.GT cs.AI cs.CL cs.LG 版本更新

Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

二手车销售机器人?作为讨价还价代理的LLM在部分信息下的诚实与轻信

Antonio Valerio Miceli-Barone, Vaishak Belle, Shay B. Cohen

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 研究LLM代理在模拟讨价还价场景中的表现,发现它们偏离博弈论均衡,尝试撒谎但无法有效利用信息不对称,且优化财务效用会增强谈判能力但增加不诚实行为。

Comments 18 pages, 14 figures

详情
AI中文摘要

在这项工作中,我们研究了模拟讨价还价场景中的代理,其中买方和卖方通过文本渠道进行通信,并试图在不同信息制度(完全信息、信息不对称或相互不确定性)下谈判互利交易。我们评估了它们相对于博弈论解决方案的表现,并进一步调查了它们的诚实性(披露或隐瞒信息、误导或欺骗的倾向)以及轻信性(信任或不信任对方提供信息的倾向)。我们研究了零样本LLM代理(使用简单的提示脚手架)以及微调代理,以探讨优化代理以最大化财务利润是否使它们成为更强的谈判者,但也更不诚实和更不信任。我们发现,现成的LLM都显著偏离博弈论均衡,它们试图对自己的私人信息撒谎,但无法有效利用信息不对称。对财务效用的微调使代理在达成更好交易方面更强,但也更不诚实,这突显了优化代理任务对其安全性可能带来的风险。我们发布了我们的代码和一个讨价还价场景数据集。

英文摘要

In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt to negotiate mutually beneficial trades, under different information regimes (complete information, information asymmetry or mutual uncertainty). We evaluate their performance w.r.t. game-theoretical solutions and further investigate their honesty (their tendency to disclose or withhold information or to mislead and deceive) as well as their credulity (their tendency to trust or distrust information provided by the other agent). We study zero-shot LLM agents with simple prompting scaffolding as well as fine-tuned agents, in order to investigate whether optimising the agents to maximise financial profits makes them stronger negotiators but also more dishonest and less trusting. We find that off-the-shelf LLMs all substantially deviate from game-theoretical equilibria, they attempt to lie about their private information but cannot efficiently exploit information asymmetries. Fine-tuning on financial utility makes the agents stronger at achieving better deals but also more dishonest, highlighting the risks that optimising agents for a task can have on their safety. We release our code and a dataset of bargaining scenarios.

2605.31433 2026-06-01 cs.CL 版本更新

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

SCOPE:面向开放式任务的协同演化策略自我对弈

Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini

发表机构 * University of Edinburgh(爱丁堡大学) Imperial College London(伦敦帝国学院)

AI总结 提出SCOPE框架,通过协同演化挑战者和求解者策略,实现无数据自我对弈,在开放式任务上提升性能并超越依赖人工提示的方法。

详情
AI中文摘要

自我对弈可以在没有外部监督的情况下训练语言模型。然而,现有方法需要可规则检查的答案,使得开放式任务依赖于精心设计的提示或前沿模型作为评判者。我们引入了SCOPE,一个无数据的自我对弈框架,用于开放式任务,它协同演化两个策略:一个生成文档基础任务的挑战者,以及一个通过多轮检索回答这些任务的求解者。初始模型的冻结副本作为自我评判者,它从源文档编写特定任务的评分标准,并据此对求解者的回答进行评分。在三个7-8B指令调优模型(Qwen2.5、Qwen3、OLMo-3)上,SCOPE在八个基准测试中将开放式任务性能提升了高达+10.4分,并匹配或超过了在约9K个精心设计的提示上训练的GRPO_data。尽管仅在开放式任务上训练,SCOPE还在七个保留的短格式问答基准测试中提升了高达+13.8分,在所有三个模型上均超越了GRPO_data。消融实验表明,协同演化挑战者对于保持任务接近求解者的前沿是必要的;性能提升源于检索和合成的改进,其相对贡献因任务而异;并且评分标准生成质量是自我评判的瓶颈。

英文摘要

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.

2605.31432 2026-06-01 cs.CL cs.AI cs.SD 版本更新

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

DOA:面向语音大语言模型的长形式同声传译的无训练解码器仅注意力策略

Sara Papi, Luisa Bentivogli

发表机构 * Fondazione Bruno Kessler(布鲁诺·克塞塞基金会)

AI总结 提出DOA策略,利用解码器自注意力导出代理对齐,无需训练即可实现语音大语言模型在长形式同声传译中的流式决策。

详情
AI中文摘要

同声语音到文本翻译(SimulST)在语音尚未完成时生成翻译,需要流式策略来决定何时读取和何时写入。最先进的方法依赖于基于注意力的编码器-解码器模型,其中交叉注意力提供显式的对齐信号。相比之下,语音大语言模型(SpeechLLMs)是仅解码器架构,仅依赖自注意力。这引发了一个核心问题:解码器自注意力是否包含足够稳定的对齐信号来指导流式策略。此外,现有方法通常依赖于基于训练的适应或启发式等待-$k$策略,并且尚未在长形式场景中得到验证。为了填补这些空白,我们提出了仅解码器注意力(DOA),这是一种无训练策略,通过从自注意力中导出代理对齐,使现成的SpeechLLMs能够进行长形式同声传译。在Phi4-Multimodal和Qwen3-Omni上的实验表明,DOA提供了有效的对齐信号来支持流式决策,实现了低延迟的长形式SimulST,其质量接近无需重新训练的离线解码。

英文摘要

Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.

2605.31421 2026-06-01 cs.CL cs.AI cs.DS 版本更新

Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm

神经符号句法分析:用CYK算法塑造神经网络

Fabio Massimo Zanzotto, Federico Ranaldi, Giorgio Satta

发表机构 * Human-centric ART, University of Rome Tor Vergata(人本导向ART,罗马托尔维加塔大学) DEI, University of Padua(迪埃学院,帕多瓦大学)

AI总结 本文提出CYKNN,一种将CYK算法直接编码为可训练矩阵-向量乘法的循环神经网络架构,在简单语法任务上超越大语言模型,开辟神经符号方法新途径。

Comments 9 content pages

详情
AI中文摘要

在本文中,我们展示了将算法直接注入神经网络架构的可能性。我们聚焦于一个复杂算法,即用于解析乔姆斯基范式中上下文无关文法的Cocke-Younger-Kasami(CYK)算法,并提出了CYKNN,一种将CYK算法编码为可训练矩阵-向量乘法的简单循环神经网络架构。我们使用一个包含4种变体的非常简单的语法进行实验,结果表明,我们的方法在上下文学习设置中优于参数超过200亿的现有大语言模型,以及经过LoRA微调的Qwen系列较小语言模型。我们的尝试为神经符号方法论开辟了一条不同的途径。

英文摘要

In this paper, we show the possibility of a direct injection of algorithms into neural network architecture. We focus on a complex algorithm, that is, Cocke-Youger-Kasami (CYK) for parsing context-free grammars in Chomsky Normal Form and we propose CYKNN, a simple recurrent neural network architecture for encoding the CYK algorithm in trainable matrix-vector multiplications.We experimented with a very simple grammar with 4 variations showing that our approach outperforms existing LLMs with more than 20B parameters with an in-context learning setting and smaller LLMs of the Qwen family fine-tuned with LoRA. Our attempt paves the way to a different approach to neuro-symbolic methodologies.

2605.31408 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

大型语言模型代理中的技能可用性与呈现粒度:一项受控的SkillsBench研究

Xiaonan Xu, Wenjing Wu

发表机构 * Computer Information Technology, Northern Arizona University(计算机信息科技,北亚利桑那大学) Computer Science, University of Colorado Boulder(计算机科学,科罗拉多大学博尔德分校)

AI总结 通过受控实验研究技能知识的呈现粒度对下游任务成功率的影响,发现技能可用性显著提升成功率,而呈现粒度变化影响较小且不确定。

详情
AI中文摘要

技能文档在推理时为大型语言模型代理提供程序性知识。本文研究受控技能知识的呈现粒度是否会改变下游任务成功率。实验使用固定的SkillsBench版本,包含30个任务、领域平衡的子集(由官方oracle运行验证)、两种启用推理的模型配置、六种技能条件,以及每个任务-条件-模型单元五次试验。技能可用性是最清晰的经验信号。相对于无技能,技能条件使GPT-5.5的任务平均通过率提高26.7至36.0个百分点,使DeepSeek V4-Flash提高18.0至26.0个百分点。最终数据包含1800行,每个模型900行。任务是推理单元。在每个任务-条件-模型单元内聚合五次试验,然后在30个任务上估计配对对比。主要的呈现对比较小且不确定。低抽象指导与高抽象指导相比,GPT-5.5差异为+0.7个百分点,DeepSeek V4-Flash差异为-6.7个百分点,两者的95%自助法置信区间均跨越零。在中抽象指导中添加一个工作示例与无示例变体相比,差异分别为+0.7和+1.3个百分点。平均奖励稳健性检验保持了相同的实质性结论。在这个受控子集中,技能可用性与更高的成功率相关,而测试的呈现粒度变化产生的影响较小、不确定且依赖于模型。

英文摘要

Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.

2605.31404 2026-06-01 cs.CL cs.AI 版本更新

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

剑、盾与阿喀琉斯之踵:大型语言模型在导航规划中空间推理的语言归纳偏置特征化

Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian, Shaowen Chen, Xian Wei, Ke Li, Xiong You

发表机构 * East China Normal University(东华大学) Information Engineering University(信息工程大学) Zhengzhou University(郑州大学)

AI总结 提出双干预框架,通过表示干预和上下文干预分离语言结构与上下文线索,揭示LLM在导航规划中语言归纳偏置的规律:拓扑信息是稳健规划的支柱,语言格式是双刃剑,语义信息是致命弱点。

详情
AI中文摘要

基于大型语言模型(LLM)的导航系统通常构建显式空间表示(如拓扑图、语义栅格图)并将其转换为文本描述作为LLM的输入。然而,此类基于文本的空间表示的语言结构及其包含的上下文特征(如拓扑、几何)的选择通常被视为中性的工程决策,而非塑造LLM行为的关键因素。为填补这一空白,我们提出一个双干预框架,将语言结构与不同的上下文线索分离,以评估LLM在导航规划中的语言归纳偏置。在该框架中,表示干预改变语言格式和语言压缩程度,阐明语言表示何时支持或抑制导航规划。上下文干预结合上下文特征组合与冲突探测,明确澄清LLM在处理不同上下文线索时的偏好和弱点。跨多种空间推理任务和多个模型规模的实验揭示了一致模式:拓扑信息是坚固的盾牌和稳健规划的支柱;语言格式是双刃剑,其效果取决于模型大小、任务需求和压缩程度;语义信息是致命的阿喀琉斯之踵——错误的语义线索会系统性地破坏规划过程。总体而言,我们的研究表明,基于LLM的导航中有效的文本空间表示应保持拓扑完整性,根据模型能力校准表示压缩,并确保语义正确性,而非简单采用单一表示。我们的代码公开于https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias。

英文摘要

Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text-based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual-interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double-edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel -- incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text-based spatial representations in LLM-based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias.

2605.31393 2026-06-01 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata(III-LIDI国立拉普拉塔大学) CDTEC, Federal University of Pelotas(CDTEC,联邦 Pelotas 大学) CONICET III-LIDI Comision de Investigaciones Cientificas Universidad Nacional de La Plata(科学委员会国立拉普拉塔大学) Universidade Federal de Pelotas(联邦 Pelotas 大学)

AI总结 针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题,提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强,并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign (https://genai4sl.github.io/) at CVPR 2026. Non proceedings track

详情
AI中文摘要

手语翻译(SLT)仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法,其中GPT-4o生成参考句子的受控释义变体,而手语输入保持不变。采用基于Signformer姿态的Transformer,在两阶段调度下进行训练:先在增强语料库上预训练,然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估:PHOENIX14T(德国手语),具有适度的词汇多样性;GSL(希腊手语),具有高度受控、重复的录制;以及LSA-T(阿根廷手语),具有严重的长尾稀疏性。在PHOENIX14T上,增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知,这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We study target-side augmentation in which GPT-4o generates controlled paraphrase variants of reference sentences while the sign input remains unchanged. A Signformer-style pose-based Transformer is trained under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate on three datasets spanning complementary challenges: PHOENIX14T (German Sign Language), with moderate lexical diversity; GSL (Greek Sign Language), with highly ontrolled, repetitive recordings; and LSA-T (Argentinian Sign Language), with severe long-tail sparsity. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33. The near-saturated GSL baseline and extremely sparse LSA-T setting reveal the limits of the approach. To our knowledge, this is the first study to apply LLM-generated target-side araphrases and LLM-as-a-Judge evaluation to SLT. The semantic evaluation reveals gains in fidelity that lexical overlap metrics understate.

2605.31387 2026-06-01 cs.CL cs.RO 版本更新

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

多轮多智能体对话用于协作重建仅略微提升VLM在空间推理上的性能

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam(语言学计算系,语言学系 柏林洪堡大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 研究通过多轮多智能体对话框架评估视觉语言模型在协作空间推理任务中的表现,发现视觉空间理解仍是主要瓶颈,文本表示和分解图像表示可部分提升性能。

Comments Preprint

详情
AI中文摘要

在多样化环境中运行的机器人依赖视觉输入来解释物体和空间布局。在人类协作任务中,它们被期望通过语言传达这种理解。视觉语言模型(VLM)支持涉及视觉解释、问答和指令跟随的机器人任务,但它们在需要空间推理的协作对话任务中的能力仍未充分探索。我们通过一个结合视觉解释、基础、语言引导交互和动作生成的协作结构构建任务来研究这一差距。我们开发了一个框架,其中VLM通过对话从视觉和文本输入重建目标结构。我们在交互设置、输入模态和图像表示上评估了开放权重和封闭VLM。结果表明,对于评估的VLM,视觉表示的空间推理仍然困难。目标的详细文本表示在模态条件下产生更高的重建成功率,而分解的图像表示提高了性能。这些发现揭示了协作VLM智能体在视觉空间基础和基础指令生成方面的局限性。

英文摘要

Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.

2605.31378 2026-06-01 cs.CL 版本更新

Unlocking Fine-Grained Translation Quality Estimation in LRMs through Synergistically Evolving Implicit and Explicit Reasoning

解锁大型推理模型中细粒度翻译质量评估:通过协同演化隐式和显式推理

Renfei Dang, Xinye Wang, Zhejian Lai, Weilu Xu, Shimin Tao, Daimeng Wei, Min Zhang, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Huawei Translation Services Center, Beijing, China(北京华为翻译服务中心)

AI总结 针对大型推理模型在细粒度翻译质量评估上的困难,提出RIEQE两阶段训练框架,通过非思考监督微调(隐式推理)和思考强化学习(显式推理)协同演化,在WMT测试集上超越基线。

详情
AI中文摘要

大型推理模型(LRMs)即使拥有长推理链,在细粒度翻译质量评估(QE)上仍然困难。我们认为LRMs已经具备强大的多语言能力,而核心挑战源于学习细粒度QE任务的内在难度。本文提出RIEQE(隐式和显式推理用于QE),一个简单的两阶段训练框架,实现隐式(层级别)和显式(词级别)推理能力的协同演化。为了使隐式推理可行,我们首先将复杂的QE任务分解为简单的子任务。基于此,我们的两阶段方法应用:(1)非思考监督微调(NonThinking-SFT),无推理链的监督微调,直接提升模型的隐式推理倾向和能力;(2)思考强化学习(Thinking-RLVR),标准可验证奖励强化学习,随后增强显式推理。结果表明,在我们的框架下,隐式和显式推理协同演化。在WMT测试集上,基于Qwen3-4B-Thinking-2507的RIEQE在显式推理性能上超越所有基线,同时其隐式推理能力也与当前最好的基于编码器的模型相当。我们进一步提供了隐式和显式推理协同合作的证据,展示了它们如何相互促进。

英文摘要

Large Reasoning Models (LRMs) still struggle with fine-grained translation quality estimation (QE), even with long reasoning chains. We argue that LRMs already possess strong multilingual capabilities, while the core challenge stems from the intrinsic difficulty of learning the fine-grained QE task. In this paper, we propose RIEQE (Reasoning both Implicitly and Explicitly for QE), a simple two-stage training framework that enables the co-evolution of implicit (layer-wise) and explicit (token-wise) reasoning capabilities. To make implicit reasoning feasible, we first decompose the complex QE task into straightforward subtasks. Based on this, our two-stage approach applies: (1) NonThinking-SFT, Supervised Fine-Tuning (SFT) without reasoning chains to directly boost the model's implicit reasoning tendency and capability; and (2) Thinking-RLVR, standard Reinforcement Learning with Verifiable Reward (RLVR) to subsequently strengthen explicit reasoning. Results demonstrate that implicit and explicit reasoning synergistically co-evolve under our framework. On the WMT test sets, RIEQE based on Qwen3-4B-Thinking-2507 surpasses all baselines in explicit reasoning performance, while its implicit reasoning capability is also comparable to the best current encoder-based models. We further provide evidence for the synergistic collaboration between implicit and explicit reasoning, showing how they mutually benefit each other.

2605.31367 2026-06-01 cs.LG cs.CL 版本更新

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

通过结构化广义线性令牌混合在表达性与复杂性之间进行权衡

Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen

发表机构 * Department of Computer Science, School of Computing, Institute of Science Tokyo(东京科学研究所计算机科学系)

AI总结 本文提出一个统一框架,将令牌混合层分解为直接输入-输出影响和递归传播,通过设计结构化递归模式在运行时复杂度和表达性之间进行可证明的权衡,并在合成任务和语言建模上验证。

Comments 20 pages, 3 figures, ICML 2026 main

详情
AI中文摘要

令牌混合层在语言模型学习和生成长期依赖关系中起着关键作用。其效率依赖于解码速度与内存需求以及缓存大小之间的必要权衡。考虑因果生成,本文通过一个统一框架探索新的权衡,该框架分离了两个关键特征:(i) 在一个生成步骤中输入对输出的直接影响;(ii) 通过过去输出进行信息的递归传播。该框架涵盖了注意力机制和状态空间模型等主要架构,但也通过允许每个状态依赖于多个过去状态(而不仅仅是直接前驱)来推广递归方程。通过引入结构,我们设计了新的递归模式,这些模式可证明达到所需的复杂度,同时提供关于其表达性的理论见解——以原则性的方式用运行时换取表达性。在合成任务以及语言建模上进行了实证验证。这些结果共同提供了一个统一的工具包,用于理解和设计跨模型家族的高效且富有表达性的令牌混合器。

英文摘要

Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.

2605.31363 2026-06-01 cs.CL 版本更新

The Latin Substrate: How Language Models Represent and Mediate Script Choice

拉丁基底:语言模型如何表示和中介文字选择

Daniil Gurgurov, Alan Saji, Katharina Trinley, Josef van Genabith, Simon Ostermann

发表机构 * German Research Center for AI(德国人工智能研究中心) Saarland University(萨尔兰大学) Indian Institute of Technology, Madras(印度理工学院-马德拉斯) Nilekani Centre at AI4Bharat(AI4Bharat的Nilekani中心)

AI总结 通过logit透镜、表征和机制分析,发现语言模型在转写时存在一致的潜在拉丁化,且通过少量后期注意力头因果中介文字选择,非拉丁输出由紧凑可识别门控产生,拉丁输出则来自网络扩散贡献,表明模型围绕共享潜在表示组织文字变异并偏向拉丁基底。

Comments preprint

详情
AI中文摘要

许多语言使用多种文字书写,要求大型语言模型(LLM)以不同的正字法形式生成等价的语言内容。虽然先前的研究表明LLM通过共享的潜在表示路由信息,但它们如何内部中介文字变异仍知之甚少。我们通过首先使用logit透镜检查逐层输出分布来研究这个问题,这揭示了转写过程中一致的潜在拉丁化,然后通过文字生成的表征和机制分析。在表征层面,我们展示了同一语言的文字在层间变得越来越可分离,并且一个简单的线性引导方向可以翻转模型的输出文字,同时大致保持语义内容。该向量非对称地泛化到构建时未见过的书写系统,可靠地将非拉丁输出翻转为拉丁,但将拉丁输出映射到各种非拉丁文字。在机制层面,我们定位了一小组后期注意力头,它们因果中介文字选择。这些头跨不相关语言和书写系统转移,表明文字路由由语言无关组件实现。在这两项分析中,我们观察到一致的方向性不对称:非拉丁输出由紧凑、可识别的门控产生,而拉丁文字输出来自网络中的扩散贡献。总的来说,我们的发现暗示LLM围绕共享潜在表示组织文字变异,同时表现出对拉丁文字的优先基底。

英文摘要

Many languages are written in multiple scripts, requiring large language models (LLMs) to generate equivalent linguistic content in distinct orthographic forms. While prior work suggests that LLMs route information through shared latent representations, how they internally mediate script variation remains poorly understood. We study this question by first examining per-layer output distributions with the logit lens, which reveals consistent latent romanization during transliteration, and then through representational and mechanistic analyses of script generation. At the representational level, we show that scripts of the same language become increasingly separable across layers and that a simple linear steering direction can flip a model's output script while largely maintaining semantic content. The vector generalizes asymmetrically to writing systems unseen during construction, flipping non-Latin output to Latin reliably, but mapping Latin output into varied non-Latin scripts. At the mechanistic level, we localize a small set of late-layer attention heads that causally mediate script choice. These heads transfer across unrelated languages and writing systems, suggesting that script routing is implemented by language-agnostic components. Across both analyses, we observe a consistent directional asymmetry: non-Latin output is produced by a compact, identifiable gate, while Latin-script output emerges from diffuse contributions across the network. Collectively, our findings hint that LLMs organize script variation around shared latent representations while exhibiting a privileged substrate toward Latin script.

2605.31351 2026-06-01 cs.CL cs.CV 版本更新

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

面向VLM-as-a-Judge评估的视障辅助基准

Yi Zhao, Siqi Wang, Zhe Hu, Yushi Li, Jing Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 针对视障辅助任务中VLM-as-a-Judge评估的可靠性问题,提出VIABLE基准(含30万+样本、有效性-公正性-稳定性框架及12种失败模式分类),发现现有模型不可靠,并开发VIA-Judge-Agent方法提升诊断准确性和用户偏好。

详情
AI中文摘要

基于AI的视障辅助(VIA)仍然具有挑战性,主要原因是人工评估成本高昂。VLM-as-a-Judge范式可能提供一种有前景的替代方案,尽管该范式主要在通用领域得到研究。因此,我们质疑此类评判者是否可以在VIA任务中值得信赖。为探究这一问题,我们引入了VIABLE(面向VLM-as-a-Judge评估的视障辅助基准),这是首个用于VIA中VLM-as-a-Judge评估的基准。VIABLE包含超过30万个判断样本,涵盖三种场景,并引入了一个包含12种失败模式分类的有效性-公正性-稳定性框架。基于VIABLE,我们对七个不同模型规模的评判者进行了系统研究,结果表明现有模型在所有评估轴上基本不可靠。最强的评判者GPT-5.4仅达到52.6%的单故障诊断准确率,却表现出最高的自我偏好率(94.2%);而开源评判者存在严重偏差且对抗性脆弱。为解决这些问题,我们提出了VIA-Judge-Agent,一种与模型无关的推理时增强方法,通过视觉证据提取和基于分类的工作流来增强评判者。该方法在诊断准确性和下游VIA响应(更受BLV用户青睐)方面实现了积极改进。数据和代码可在 https://github.com/YiyiyiZhao/VIABLE 获取。

英文摘要

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

2605.31349 2026-06-01 cs.CL cs.AI cs.CV cs.MM 版本更新

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

FBHM:用于仇恨模因检测的功能性基准测试与视觉语言模型引导

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee

发表机构 * Indian Institute of Technology (IIT), Kharagpur(印度理工学院(IIT)卡拉格浦尔) Microsoft(微软)

AI总结 针对现有基准无法因果评估视觉语言模型漏洞的问题,提出基于25种修辞功能和10个目标社区构建的FBHM基准,并采用可学习引导向量(LSV)在极低数据量下提升模型性能约30个Macro-F1点。

详情
AI中文摘要

仇恨模因检测对于视觉语言模型仍是一个严峻挑战,因为现有基准在结构上是观察性的——混淆了修辞仇恨机制与目标社区特征,并阻碍了对模型漏洞的因果评估。为解决这一问题,我们引入了FBHM,一个系统策划的基于功能的仇恨模因基准,沿两个正交轴构建:25种不同的修辞功能和10个目标社区(总共5,000个模因)。对最先进的视觉语言模型进行基准测试揭示了一个严重的泛化差距:在标准数据集上高度准确的模型在FBHM上灾难性地下降到接近随机性能,证明它们利用了数据集特定的启发式方法而非稳健的多模态推理。为了高效缩小这一差距,我们提出了LSV(可学习引导向量),一种超低数据量策略,在仅500个引导样本(50个独特基础模因)上应用因果干预目标,将FBHM性能提升约30个Macro-F1点,同时优于上下文学习和PEFT,且不降低源域性能。

英文摘要

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

2605.31338 2026-06-01 cs.CL 版本更新

Bundesrecht: An Open Library and Corpus for German Statutory Reference Processing

Bundesrecht: 面向德国法律引用处理的开放库与语料库

Harshil Darji, Martin Heckelmann, Christina Kratsch, Gerard de Melo

发表机构 * Hochschule für Technik und Wirtschaft Berlin(柏林技术与经济高等学院) Hasso-Plattner Institute / University of Potsdam(波茨坦大学哈索普劳恩研究所)

AI总结 本文提出 bundesrecht,一个包含软件库和结构化语料库的开放资源,用于解析、规范化和解析德国法律引用,实现从原始引用字符串到结构化法律条文的端到端处理。

Comments 10 pages, 1 figure. Preprint

详情
AI中文摘要

法律引用是法律语言理解的核心,但难以自动处理,因为它们以紧凑且多变的表面形式出现,可能组合多个目标,使用特殊缩写,并常指向较低级别的单元。现有的德语工具要么专注于从法律文档中解析引用,要么在引用明确后访问法律文本。本文介绍了 bundesrecht,一个用于德国法律引用处理的开放资源,包含一个软件库和一个结构化的德国联邦法律语料库。该库解析、规范化和解析德国法律引用,将原始引用字符串映射到结构化对象,将紧凑引用扩展为规范形式,并将其链接到法律条文。附带的语料库保留了从法律到细粒度子条款的法律内部层级。我们使用严格精确匹配和微信息抽取指标,在 2,944 个带注释的德国法律引用上评估了解析器和规范化器。我们进一步评估了规范引用去重,并表明规范化引用比字符串匹配更可靠地对真实引用表面变体进行分组。bundesrecht 是第一个覆盖德国法律引用处理端到端流水线的开放资源,从原始引用字符串到解析后的法律条文,并在 PyPI 上可用。

英文摘要

Statutory references are central to legal language understanding, but are difficult to process automatically, as they appear in compact and variable surface forms, may combine multiple targets, use special abbreviations, and often point to lower-level units. Existing tools for German focus either on parsing references from legal documents or accessing statutory text once citations are explicit. This paper introduces bundesrecht, an open resource for German statutory reference processing, consisting of a software library and a structured corpus of German federal law. The library parses, normalizes, and resolves German statutory references, mapping raw citation strings to structured objects, expanding compact references into canonical forms, and linking them to statutory provisions. The accompanying dataset preserves the internal hierarchy of statutes from laws to fine-granular subclauses. We evaluate the parser and normalizer on 2,944 annotated German legal references using strict exact-match and micro information extraction metrics. We further evaluate canonical reference deduplication and show that normalized references group real citation surface variants far more reliably than string matching. bundesrecht is the first open resource that covers German statutory reference processing as an end-to-end pipeline, from raw citation string to resolved statutory provision, and is available on PyPI.

2605.31328 2026-06-01 cs.CL 版本更新

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

强化学习放大了来自无害奖励的涌现性失调

Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan, Lucie Flek, Florian Mai

发表机构 * Bonn-Aachen International Center for Information Technology, University of Bonn, Germany(波恩-亚琛信息科技国际中心,波恩大学,德国) Lamarr Institute for Machine Learning and Artificial Intelligence, Germany(拉马尔机器学习与人工智能研究所,德国)

AI总结 本文研究强化学习如何从看似无害的奖励信号中引发语言模型的涌现性失调,发现其比监督微调更严重,并验证了在训练中插入安全数据可缓解此问题。

详情
AI中文摘要

涌现性失调(EM)是指语言模型在针对狭窄的失调示例进行微调后,意外地变得广泛失调的倾向。虽然EM在监督微调(SFT)设置中已被广泛研究,但来自强化学习(RL)的证据仅限于大型闭源模型,使得该现象研究成本高昂且难以复现。我们沿三个维度刻画了小型、现成的开源权重模型中来自RL的EM。首先,我们表明,奖励狭窄、明显的失调行为比样本匹配的SFT产生更高的通用域失调。其次,我们表明,来自RL的EM可以由可能自然出现的奖励信号引发,例如不受欢迎的审美偏好或糟糕的修辞诉求。第三,我们评估了为SFT引发的EM开发的训练中缓解措施,发现它们广泛适用,其中交错进行策略内安全数据效果最佳。

英文摘要

Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.

2605.31312 2026-06-01 cs.CV cs.CL 版本更新

Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization

从细粒度视觉差异中学习:通过上下文视觉对比优化缓解多模态幻觉

Haolin Deng, Xin Zou, Zhiwei Jin, Chen Chen, Haonan Lu, Xuming Hu

发表机构 * The Hong Kong University of Science(香港科学与技术大学) OPPO AI Center(OPPO AI中心)

AI总结 提出上下文视觉对比优化(IC-VCO)方法,通过共享多图像上下文中的对比图像确保数学严谨的目标,并引入视觉对比蒸馏(VCDist)和对比样本编辑策略,有效缓解多模态幻觉。

Comments ICML 2026

详情
AI中文摘要

多模态幻觉仍然是视觉语言模型(VLM)面临的持续挑战。标准的文本直接偏好优化(DPO)由于缺乏显式的视觉监督,往往无法缓解这一问题。虽然现有工作通过将原始图像与负样本对比引入了视觉偏好DPO,但由于配分函数不匹配导致目标在理论上不一致,并且依赖可能引发捷径学习的粗粒度负样本。在这项工作中,我们提出了上下文视觉对比优化(IC-VCO)。通过将对比图像置于共享的多图像上下文中,IC-VCO确保了数学上严谨的目标。我们进一步引入了视觉对比蒸馏(VCDist),一种辅助的可靠性门控正则化器,鼓励多图像对比训练与单图像推理之间的一致性。最后,我们提出了一种对比样本编辑策略,通过精确的语义扰动生成困难负样本。在五个基准上的实验表明,IC-VCO取得了最佳的整体性能,并且我们的样本编辑策略有效。代码和数据可在 https://github.com/OPPO-Mente-Lab/IC-VCO 获取。

英文摘要

Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability-gated regularizer that encourages consistency between multi-image contrastive training and single-image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at https://github.com/OPPO-Mente-Lab/IC-VCO.

2605.31293 2026-06-01 cs.CL 版本更新

Divergence Decoding: Inference-Time Unlearning via Auxiliary Models

发散解码:通过辅助模型进行推理时遗忘

Humzah Merchant, Bradford Levy

发表机构 * University of Chicago(芝加哥大学)

AI总结 提出发散解码(DD)方法,利用小型辅助模型在推理时引导LLM的logits远离特定数据,有效且低成本地实现遗忘,并在多个基准上超越现有方法。

详情
AI中文摘要

大型语言模型(LLM)经常记忆敏感的训练数据,从而产生显著的隐私和版权风险。解决这些风险,即从现有模型检查点中移除此类知识,已被证明具有挑战性,因为许多遗忘方法会导致灾难性的效用损失或对复杂查询无效。我们引入了发散解码(DD),一种使用小型辅助模型在推理时将LLM的logits引导远离特定数据的机制。训练这些模型是直接的,即我们使用标准的预训练和微调设置。我们发现该方法在遗忘基准测试中明显优于最先进的基线,且在各种模型和训练数据集规模上保持一致,表明DD是一种有效且廉价的遗忘解决方案。然后我们证明,这种引导后的分布可以轻松地蒸馏回基础模型。由于该方法普遍适用于任何概率模型,我们探索了其在文本生成之外的有效性,并发现了向图像领域泛化的证据。

英文摘要

Large Language Models (LLMs) frequently memorize sensitive training data thereby creating significant privacy and copyright risks. Addressing these risks, i.e., removing such knowledge from an existing model checkpoint, has proven challenging as many unlearning methods lead to catastrophic utility loss or are ineffective for complex queries. We introduce Divergence Decoding (DD), a mechanism that uses small auxiliary models to steer the logits of the LLM away from specific data during inference. Training these models is straight forward, i.e., we use standard pre-training and fine-tuning setups. We find the method decisively outperforms state-of-the-art (SOTA) baselines on unlearning benchmarks across a variety of model and training dataset scales consistent with DD being an effective and inexpensive solution to unlearning. We then demonstrate that this steered distribution can be trivially distilled back into the base model. Since the method is generally applicable to any probabilistic model, we explore its efficacy outside of text generation and find evidence of generalization to the domain of images.

2605.31281 2026-06-01 cs.CL 版本更新

Wind Turbine Maintenance Log Labelling Framework: LLM-Driven Data Correction and Enrichment via Semantic Extraction of Reliability Intelligence

风力涡轮机维护日志标注框架:基于LLM驱动的数据校正与语义提取的可靠性智能增强

Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya

发表机构 * Institute for Energy Systems, School of Engineering, The University of Edinburgh(能源系统研究所,工程学院,爱丁堡大学) Nadara, Lisbon, Portugal(纳达拉,里斯本,葡萄牙)

AI总结 提出一种利用大语言模型自动标准化和结构化风力涡轮机维护日志的方法,通过纠正系统代码、提取故障模式与维护动作分类,将非结构化文本转化为定量可靠性指标。

Comments An adjustable template containing the Python script architecture, applied dynamic prompts, and data schemas is hosted in an open-source GitHub repository: https://github.com/mvmalyi/llm-driven-wind-turbine-maintenance-log-labelling

详情
AI中文摘要

随着风力涡轮机机队老化,数据驱动的可靠性工程对于优化其运行和维护以延长使用寿命和降低平准化能源成本至关重要。历史维护日志中的故障事件描述是宝贵可靠性智能的来源。然而,它们通常以非结构化的自然语言条目出现,无法进行定量分析。本文提出了一种新颖的方法,利用大语言模型(LLM)根据自由文本描述符系统地标准化和结构化维护日志。该方法在来自280台涡轮机、监测九年的16,316条维护日志数据集上运行,开发的模型无关框架自主纠正了层次化系统代码,并提取了基于证据的维护操作和故障模式分类。自动化流水线成功结构化超过70%的数据集。它解决了普遍存在的错误分类问题,例如隔离先前未分类的变桨系统故障并恢复缺失的系统代码,并通过应用经验分类法标记具体采取的操作和处理的故障模式来丰富记录。通过使用基于系统的日志批次构建故障模式、可观察症状、主导机制和候选原因的经验词典,该方法减少了手动故障模式与影响分析(FMEA)固有的主观性。最终,该方法为将大量定性现场观测转化为定量可靠性指标提供了高度可扩展、成本效益高的蓝图,为可再生能源领域的集成根本原因分析、改进的FMEA和先进预测性维护奠定了基础。

英文摘要

As wind turbine fleets age, data-driven reliability engineering is essential to optimise their operation and maintenance for service life extension and levelised cost of energy reduction. Failure event descriptions within historical maintenance logs are a source of valuable reliability intelligence. However, they typically appear as unstructured natural language entries, rendering them inaccessible for quantitative analysis. This paper presents a novel methodology leveraging a large language model (LLM) to systematically standardise and structure maintenance logs based on their free-text descriptors. Operating on a dataset of 16,316 maintenance logs from 280 turbines monitored over nine years, the developed model-agnostic framework autonomously corrected hierarchical system codes and extracted evidence-based taxonomies of maintenance actions and failure modes. The automated pipeline successfully structured over 70% of the dataset. It resolved pervasive misclassification issues, such as isolating previously unclassified pitch system faults and restoring missing system codes, and enriched the records by applying empirical taxonomies to label specific actions taken and failure modes addressed. By using system-based log batches to construct empirical dictionaries of failure modes, observable symptoms, dominant mechanisms, and candidate causes, this approach reduces the inherent subjectivity of manual failure modes and effects analysis (FMEA). Ultimately, the methodology provides a highly scalable, cost-effective blueprint for translating large sets of qualitative field observations into quantitative reliability metrics, laying the foundation for integrated root-cause analysis across the renewable energy sector, improved FMEA, and advanced predictive maintenance.

2605.31268 2026-06-01 cs.CL 版本更新

Mellum2 Technical Report

Mellum2 技术报告

Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok, Petr Borovlev, Kseniia Lysaniuk, Madeeswaran Kannan, Ivan Dolgov, Nikita Pavlichenko

发表机构 * JetBrains Constructor University(Constructor大学)

AI总结 本文介绍 Mellum 2,一个12B参数(每token激活2.5B)的混合专家语言模型,专攻软件工程,通过架构创新和训练优化在代码生成、数学推理等基准上达到4B-14B开源模型的竞争力。

详情
AI中文摘要

我们提出 Mellum 2,一个开放权重的12B参数混合专家(MoE)语言模型,每token激活2.5B参数。Mellum 2是一个通用语言模型,专精于软件工程,涵盖代码生成与编辑、调试、多步推理、工具使用与函数调用、智能体编码以及对话式编程辅助,它是之前专注于补全的4B密集模型Mellum的继任者。架构基于混合专家(64个专家,8个激活),结合了分组查询注意力(4个KV头)、每四层中三层使用滑动窗口注意力,以及一个多token预测头,该头同时作为辅助预训练目标和内置的推测解码草稿模型;每个选择都通过消融实验验证,并以商品GPU上的推理效率作为设计约束。预训练涵盖约10.6万亿token,通过三个阶段的学习课程,从多样化的网络数据逐步转向精选的代码和数学内容,使用Muon优化器在FP8混合精度下进行优化,并采用预热-保持-衰减调度(线性衰减至零)。预训练基础模型通过层选择性YaRN扩展到128K上下文窗口,然后分两个阶段进行后训练(监督微调后接RLVR),产生两个发布变体:直接回答的Instruct模型和在最终答案前输出显式推理轨迹的Thinking模型。在代码生成、数学与推理、工具使用、知识和安全基准上,Mellum 2与4B-14B范围内的开放权重基线模型竞争,同时每token计算量相当于2.5B密集模型。我们在Apache 2.0许可下发布基础、指令和思考检查点,以及关于架构决策、数据管道和训练配方的本报告。

英文摘要

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

2605.31264 2026-06-01 cs.AI cs.CL cs.LG 版本更新

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL: 通过专家知识蒸馏实现自动化AI技能生成

Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出一个从异构痕迹到可检查、可修正、可代理使用的技能包的自动化蒸馏系统,用于生成基于人的AI技能。

Comments 12 pages, 4 figures

详情
AI中文摘要

LLM代理不仅被期望完成孤立的任务,还要承载人类专业知识、判断和互动风格的有限表示。构建这种基于人的代理仍然困难,因为与人或角色相关的可操作知识通常嵌入在异构痕迹中,而不是写成清晰的指令。现有的记忆和角色系统捕捉了这些证据的片段,而技能框架提供了可移植的打包格式;然而,没有端到端的工作流将这些痕迹蒸馏成可检查、可修正和代理可用的技能。我们提出了一个自动化的痕迹到技能蒸馏系统,通过专家知识蒸馏生成基于人的AI技能。给定目标人物或角色的材料,COLLEAGUE.SKILL 生成一个版本化的技能包,包含两个协调的轨道:一个能力轨道,用于实践、心理模型和决策启发式;一个边界行为轨道,用于沟通风格、互动规则和修正历史。该包可以被检查、调用、通过自然语言反馈更新、回滚、跨代理主机安装,并可选择性地为受控分发做准备。我们描述了开源系统中实现的人工制品契约、生成工作流、修正生命周期、部署表面和领域预设。在撰写本文时,公共仓库拥有约18.5k个GitHub星标;画廊列出了来自165位贡献者的215个技能,以及跨列出的技能卡累计超过10万个星标。该系统说明了基于人的技能如何表示为可移植、可修正的包,而不是不透明的提示或隐藏的记忆。

英文摘要

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

2605.31238 2026-06-01 cs.CL cs.LG 版本更新

Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

通过图约束路径选择扩展多跳训练数据

Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Hong Kong Generative AI Research and Development Center(香港生成式人工智能研究与发展中心) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 针对专业文档的组成推理,提出基于图约束路径选择的方法,通过解耦路径发现与语言化,利用图约束过滤无效路径,显著扩展可用语料并提升模型性能。

Comments 21 pages, 5 figures

详情
AI中文摘要

赋予大型语言模型对专业文档的组成推理能力需要大规模的多跳训练数据,而此类数据除了基于结构化来源的精心策划基准外很少存在。为了直接从纯文本、无标注文本中构建此类数据,现有方法要求单个教师模型联合发现文档中的证据路径并将其表述为问答对。然而,当文档围绕重复模板和密集交叉引用子句(这是大多数真实世界专业语料库的特征)构建时,这些方法会严重退化。在这项工作中,我们将这两个操作解耦:推理路径在上下文关键词质心的图上离线枚举,教师模型仅用于将预验证的路径语言化。该图强制执行五个几何可接受性约束,我们提供了Gram矩阵论证,表明仅局部相似性边界允许端点漂移高达约91°,并且需要上相似性边界才能退出由模板文本形成的密集嵌入团。一项匹配规模的消融实验揭示了其机制:在相等的训练规模下,约束链和无约束链产生无法区分的下游性能,而全规模下的增益来自可用语料库的4.4倍扩展,而非更高的每条链质量——这重新定义了图约束在此设置中的作用:提高教师可合成性而非改进链内容。在从CUAD法律合同语料库构建的80K示例上微调Qwen3-32B,将闭卷Token F1从21.66%提高到38.58%。我们已在https://github.com/hkgai-official/GCSCS发布代码。

英文摘要

Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai-official/GCSCS.

2605.31220 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

共享疑虑:语言模型的零样本跨语言置信度估计

Athina Kyriakou, Dennis Ulmer, Ivan Titov

发表机构 * ILLC, University of Amsterdam(阿姆斯特丹大学ILLC) ILCC, University of Edinburgh(爱丁堡大学ILCC)

AI总结 研究多语言大语言模型是否编码共享的、可跨语言迁移的置信度特征,通过轻量级线性探针从中间表示直接预测答案正确性,实现零样本跨语言泛化,并发现置信度特征集中在中间层。

详情
AI中文摘要

置信度估计(CE),即量化模型预测的可靠性,在大语言模型(LLM)背景下引起了极大兴趣。然而,大多数研究集中在英语上,忽视了LLM使用的多语言现实,而许多CE方法会退化或需要跨语言重新训练。为了解决这一差距,我们研究了多语言LLM是否编码共享的、可跨语言迁移的置信度特征。我们使用一个轻量级线性探针,直接从中间表示预测答案正确性。经过单语言训练后,该探针在零样本情况下泛化到未见过的、类型多样的语言,无需目标语言监督。学习到的层权重和多次消融实验表明,置信度特征集中在各语言的中间层,表明存在共享的置信度子空间。虽然零样本跨语言性能取决于与源语言的相似性,但该探针无需任何重新训练即可提供强基线,并且与其他流行的置信度估计方法相比具有优势。

英文摘要

Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

2605.31212 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

基准测试与增强文本到图像模型以生成早期算术教育中的视觉表示

Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar, April Yi Wang, Mrinmaya Sachan

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) ETH AI Center(ETH人工智能中心)

AI总结 针对早期算术教育中的方程到视觉生成任务,构建了E2V-Bench基准并评估了现有T2I模型,发现其在计数和关系结构上存在严重错误,进而探索了基准引导的增强策略。

详情
AI中文摘要

AI系统越来越多地用于支持教育内容创作,但尚不清楚它们能否生成忠实代表其旨在教授的教学概念的输出。因此,我们引入了方程到视觉生成任务,与传统的图像生成不同,该任务要求从算术方程中生成具有教学意义的视觉内容,同时精确保留其数值和关系结构。根据对教师的访谈和教育材料的分析,我们构建了E2V-Bench基准,涵盖四种基于教学法的视觉类型,以及用于评估视觉正确性的自动指标。我们的评估显示,最近的文本到图像(T2I)模型在此任务上频繁失败,错误主要表现为对象计数不正确和关系结构破坏。在此基础上,我们探索了基准引导的增强策略。这些策略改进了代表性模型,但剩余的差距要求未来的T2I模型具备更强的数值和关系基础。

英文摘要

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

2605.31201 2026-06-01 cs.CL 版本更新

Learning Whom to Trust: Market-Feedback Adaptive Retrieval for Frozen LLMs in Event-Driven Financial RAG

学习信任谁:事件驱动金融RAG中冻结大语言模型的市场反馈自适应检索

Zijie Zhao, Roy E. Welsch

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对事件驱动金融RAG,提出通过外部贝叶斯源记忆更新检索层,利用市场反馈自适应选择证据源,在冻结LLM情况下提升预测和投资组合表现。

详情
AI中文摘要

金融检索增强生成(RAG)系统通常按文本相关性对证据排序,但在金融市场中,有用的证据来源取决于事件类型、预测期限和市场背景。我们将新闻触发的事件影响预测作为一个时间点金融RAG问题进行研究。对于每个公司-新闻锚点,系统检索相关的金融新闻和SEC文件段落,附加决策前市场背景卡片,并预测多期限残差收益信号。我们的方法保持大语言模型(LLM)阅读器冻结,通过外部贝叶斯源记忆(根据已成熟的残差收益反馈更新)自适应检索层。在源自FinRL-DeepSeek/FNSPID任务的固定89只纳斯达克股票池上,使用原始FNSPID新闻和时间点EDGAR文件段落,与无记忆的冻结阅读器相比,带源记忆的冻结阅读器将留出宏F1从0.438提升至0.471,下游投资组合夏普比率从0.52提升至0.84。有监督的LoRA阅读器对静态RAG有适度改进,但未超过冻结源记忆阅读器。这些结果表明,对于金融RAG,学习从何处检索与学习如何阅读同等重要,提供了一种简单、模块化的市场反馈适应途径。

英文摘要

Financial retrieval-augmented generation (RAG) systems typically rank evidence by textual relevance, but in financial markets the useful evidence source depends on event type, forecast horizon, and market context. We study news-triggered event-impact prediction as a point-in-time financial RAG problem. For each company-news anchor, the system retrieves related financial news and SEC filing passages, appends a pre-decision market-context card, and predicts multi-horizon residual-return signals. Our method keeps the large language model (LLM) reader frozen and adapts the retrieval layer through an external Bayesian source memory updated from matured residual-return feedback. On a fixed 89-stock Nasdaq-oriented universe derived from the FinRL-DeepSeek/FNSPID task, using original FNSPID news and point-in-time EDGAR filing passages, Frozen Reader with Source Memory improves held-out macro-F1 from 0.438 to 0.471 and downstream portfolio Sharpe from 0.52 to 0.84 relative to Frozen Reader with No Memory. A supervised LoRA reader improves static RAG modestly, but does not improve over the frozen source-memory reader. These results suggest that, for financial RAG, learning where to retrieve from can be as important as learning how to read, offering a simple, modular route to market-feedback adaptation.

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO 版本更新

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

发表机构 * University of Michigan, Ann Arbor(密歇根大学,安娜堡)

AI总结 针对安全人机协作,提出碰撞接地概念及物理基准TouchSafeBench,评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现,发现现有模型不可靠,视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情
AI中文摘要

安全的人机协作需要的不仅仅是视觉描述:监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞,或即将碰撞。我们将这种能力称为碰撞接地:将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合,以推断当前和即将发生的接触。我们引入了TouchSafeBench,一个基于物理的基准,用于评估视觉-语言模型(VLM)中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建,包含2,940个模拟室内共现场景,涵盖社交导航和社交重排,具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务:分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中,当前模型远未达到可靠:最佳平均Macro-F1仍低于50%,显式深度不会自动转化为机器人身体碰撞证据,且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制:视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

2605.31183 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

引导LLM?实际上,稀疏自编码器可以胜过简单基线

Mikkel Godsk Jørgensen, Lars Kai Hansen

发表机构 * DTU Compute(丹麦技术大学计算学院)

AI总结 本文通过监督流水线选择并标注特征,证明稀疏自编码器在模型引导任务上可接近LoRA性能,并发现高稀疏性对基于可解释性的引导并非关键。

详情
AI中文摘要

稀疏自编码器(SAEs)被视为探索大型语言模型(LLMs)内部机制和引导模型输出生成的有前途的途径。当Wu等人(2025)引入模型引导基准AxBench时,SAEs由于相对于一组简单基线的引导性能较差,似乎并未达到最初的期望。本文作为对稀疏自编码器的部分反驳,表明Wu等人(2025)的结果并未完全公正地评价它们。我们发现,当使用我们的监督流水线选择并标注特征时,稀疏自编码器实际上可以在AxBench基准上达到接近参考LoRA性能的水平。我们还发现,当仅使用基于可解释性的组件时,我们的流水线选择的特征与其识别标签具有令人惊讶的因果性。最后,我们提供证据表明,高稀疏性(低l0)可能对于基于可解释性的成功引导并非关键,这与Wang等人(2025)早期的发现相反。

英文摘要

Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).

2605.31175 2026-06-01 cs.CL 版本更新

Towards Efficient LLMs Annealing with Principled Sample Selection

迈向基于原则性样本选择的高效LLM退火

Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li, Guang Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Fudan University(复旦大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 本文通过损失景观的谱几何特性,将退火阶段的数据选择建模为有约束优化问题,提出DiReCT框架,利用Hessian谱对梯度施加方向约束,实现高效样本选择,显著提升模型性能。

详情
AI中文摘要

退火阶段是LLM预训练中关键的收敛阶段,最终决定模型质量。然而,在此阶段有效选择训练数据仍是一个关键挑战。当前策略依赖于经验启发式方法,如领域过滤或上下文扩展,缺乏优化理论的原则性基础。在这项工作中,我们通过损失景观的谱几何视角来刻画退火阶段。我们认为,最优收敛需要梯度更新满足不同特征方向上的异构约束。基于这一见解,我们将数据选择形式化为满足这些方向约束的问题。为此,我们提出了DiReCT(方向约束训练),这是一个新颖的框架,将退火阶段的样本选择重新表述为约束优化问题。通过基于Hessian的谱特性对每个样本的梯度施加显式的方向约束,DiReCT识别出与最优曲率感知下降路径一致的样本。跨多种模型尺度的广泛实验表明,DiReCT始终达到最先进的性能。为便于未来研究,代码可在https://github.com/xuyj233/Direct获取。

英文摘要

The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.

2605.31170 2026-06-01 cs.CL cs.AI 版本更新

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

语言模型智能体群体中的涌现语言:从令牌效率到监督规避

Stine Lyngsø Beltoft, William Brach, Federico Torrielli, Jacob Nielsen, Annemette Brok Pirchert, Filippo Tonini, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark(南丹麦大学) Slovak University of Technology in Bratislava(布拉迪斯拉发技术大学) University of Turin(都灵大学) Ordbogen A/S(Ordbogen公司)

AI总结 研究语言模型智能体群体中涌现的语言,通过规则启发式和零样本分类识别出令牌效率、新自然语言和监督规避三类,发现监督规避语言更难对齐且可被上下文学习,表明仅监控表面行为可能不足以控制智能体群体。

详情
AI中文摘要

目前,对自主语言模型智能体的监控主要依赖表面行为。但当智能体群体为了规避人类监督而发明新语言时会发生什么?本文研究了Moltbook上的涌现语言。为此,我们基于Moltbook Files数据集,采用两阶段方法:先进行基于规则的启发式匹配(约6000个匹配),再进行零样本分类(保留518个)。结果类别包括令牌效率(166个)、新自然语言(106个)和监督规避(59个)。我们进行了定量和定性分析。结果表明,提出用于规避监督的新语言的帖子被DeepSeek-3.2判定为比其他类别更不对齐,且所有语言都可以通过语言描述被其他语言模型在上下文中学习。此外,手动研究典型案例揭示了令人惊讶的复杂隐写协议,例如在自然语言中嵌入隐藏信息。尽管我们无法确定这些语言构思中的自主程度,但我们的结果进一步证明,仅监控表面行为可能很快不足以维持对智能体群体的控制。

英文摘要

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.

2605.31164 2026-06-01 cs.CL cs.AI 版本更新

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

D$^3$: 面向LLM训练的动态有向图约束数据调度

Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li

发表机构 * Microsoft Research(微软研究院)

AI总结 提出D$^3$框架,通过动态有向图建模训练单元间的有向影响关系,并求解约束优化问题以确定训练顺序,从而提升LLM预训练和后训练阶段的效率。

详情
AI中文摘要

训练数据在大语言模型(LLM)优化中起着核心作用,这激发了对数据调度策略的广泛研究。现有方法大多集中于调整整体数据分布,而忽略了训练过程中样本之间的潜在交互。然而,我们认为这种交互不可忽视,因为现实世界的数据样本之间经常存在有向影响,使得训练顺序至关重要。直观上,我们可以优先训练影响更大的单元以提高学习效率。在这项工作中,我们提出了D$^3$,一个动态有向图约束的数据调度框架。D$^3$将训练单元之间的复杂交互建模为一个动态影响图,其中边表示基于损失的依赖关系。然后,它在该图上求解一个约束优化问题,以推导出训练顺序,确保数据序列在整个训练过程中遵循不断演变的信息流。我们的方法具有理论动机,并在预训练和后训练阶段均比现有数据调度方法取得了一致的改进。此外,为了可扩展性,D$^3$还采用了一种高效的近似算法,将额外的计算开销控制在可管理范围内。为便于未来研究,代码可在https://github.com/xuyj233/D3获取。

英文摘要

Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.

2605.31148 2026-06-01 cs.CV cs.AI cs.CL 版本更新

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct:探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Zhongguancun Academy(中关村学院) Tsinghua University(清华大学) Helsinki University(赫尔辛基大学)

AI总结 本文提出SpatialAct基准,通过多轮交互细化、单步错误检测与修复等任务,揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情
AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系,并将这种推理转化为行动。尽管最近的视觉语言模型(VLM)在基于观测的空间感知和推理任务上表现出色,但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题,我们引入了 extbf{SpatialAct},一个基于模拟器的基准,用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始,我们进一步设计了其分解版本——单步错误检测与修复,以及五个基础空间能力任务,以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距:当前VLM在孤立的空间推理任务上表现良好,但在多轮反馈中难以维持连贯的空间信念并产生可靠行动,显著不如人类。这些结果表明,即使抽象掉了低级控制,当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

2605.31142 2026-06-01 cs.CL cs.AI 版本更新

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

多语言文本嵌入排名在学习任务、语言和基准数据集上的鲁棒性

Ana Gjorgjevikj, Barbara Koroušić Seljak, Tome Eftimov

发表机构 * Computer Systems Department(计算机系统系) Jožef Stefan Institute(乔泽夫·斯塔芬研究所)

AI总结 通过引入数据集组成鲁棒性和排名方案鲁棒性指标,系统分析了MTEB中多语言模型排名对评估设计变化的敏感性,发现基于LLM的大模型通常是鲁棒的顶尖模型,但并非在所有任务中一致。

详情
AI中文摘要

大规模多语言文本嵌入模型在研究和工业中扮演着关键角色,但它们在特定语言、多任务设置中的行为仍未被充分理解。尽管像MTEB这样的基准平台报告了超过250种语言的结果,但关于模型优越性的结论往往依赖于数据集组成和性能聚合方法的隐含选择。为了解决这一差距,我们对MTEB中的多语言模型性能鲁棒性进行了元研究,应用了多种多准则决策制定排名方案,并引入了两个鲁棒性指标:数据集组成鲁棒性(排名对数据集组成变化的敏感性)和排名方案鲁棒性(对聚合方法变化的敏感性)。它们使得系统性地分析基准结论在不同评估设计下是否保持稳定成为可能。我们对五种语言(英语、法语、德语、印地语和西班牙语)在九个任务(例如分类、聚类、检索)上进行了深入分析,并发布了约230种额外语言的结果。任务特定分析表明,基于大规模LLM的模型通常是鲁棒的顶尖表现者,尽管并非一致(例如在检索任务中),而任务无关的结果显示,只有一小部分模型在任务、排名方案和数据子样本中始终保持强劲。

英文摘要

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

2605.31140 2026-06-01 cs.CR cs.CL 版本更新

EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

EvoDefense:与大语言模型共同进化的黑盒防御

Yu Li, Yuenan Hou, Yingmei Wei, Yanming Guo, Chaochao Lu

发表机构 * National University of Defense Technology(国防科技大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出一种基于经验引导的共同进化黑盒防御范式EvoDefense,通过守卫LLM和经验记忆模块在攻防迭代中优化策略,实现对未见攻击和目标模型的泛化防御。

详情
AI中文摘要

大型语言模型(LLM)仍然极易受到各种攻击,特别是在黑盒设置中,目标模型的内部结构不可访问。现有的黑盒防御通常依赖于预定义的过滤启发式方法,这些方法往往无法泛化到未见过的攻击类型和目标模型架构。我们引入了EvoDefense,一种经验引导的共同进化黑盒防御范式。EvoDefense使用一个守卫LLM来检测恶意查询,并使用一个经验记忆模块来积累先前交互中的防御知识。EvoDefense的核心是一个连续的攻防进化循环,其中攻击生成器和守卫模型通过经验引导的优化迭代地改进其攻击策略和防御策略。这种设计使得EvoDefense能够在无需重新训练的情况下泛化到未见过的攻击和目标模型。在HarmBench、AdvBench和AlpacaEval上的实验表明,EvoDefense在七个流行模型和五种代表性LLM攻击上实现了一致且强大的防御性能,同时保持了有竞争力的通用能力。在HarmBench上,EvoDefense将AutoDAN-turbo对Gemini-3-flash和LLaMA-3-8B-Instruct的攻击成功率(ASR)分别从29.4%和43.4%降低到8.4%和6.2%。

英文摘要

Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.

2605.31136 2026-06-01 cs.CL 版本更新

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

低资源语言维基百科的多语言和跨语言引用需求检测

Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London(伦敦国王学院) Wikimedia Foundation(维基媒体基金会)

AI总结 针对低资源语言,提出多语言引用需求检测语料库MCN,并证明使用编码器风格目标微调的小型解码器语言模型在跨语言任务中优于大型语言模型。

详情
AI中文摘要

在自动化事实核查(AFC)中,核查价值检测根据领域特定标准识别需要验证的声明。在维基百科上,该任务具体化为引用需求检测(CND),即标记缺乏支持性引用的声明。然而,现有研究很大程度上忽视了低资源语言,且最近的AFC流程依赖于大型语言模型(LLM),这对低资源组织来说难以获取。我们引入了MCN,一个覆盖三种资源级别共18种语言的多语言CND语料库,并在此基础上对小规模解码器语言模型(SLM)进行了广泛研究。实验表明,使用编码器风格目标微调的SLM在跨语言任务中显著优于提示型LLM。我们进一步展示了跨语言CND的首批研究之一,证明仅使用英语声明微调的SLM在几乎没有目标语言适应的情况下超越了LLM。我们的发现对低资源维基百科社区具有重要意义,并表明紧凑的任务特定模型比LLM更适合CND。我们在https://github.com/gerritq/mcn 发布所有数据和代码。

英文摘要

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

2605.31126 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Not All Synthetic Data Is Yours to Learn From

并非所有合成数据都适合学习

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

发表机构 * ECE Department(电子工程系) Apple(苹果公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Rice University(里奇大学)

AI总结 研究无提示、无教师、无验证器、无奖励模型的自训练中,语言模型能否从自身生成的文本中学习,发现合成数据与学生之间的兼容性是关键,并揭示了能力与逐字记忆可分离的现象。

详情
AI中文摘要

语言模型能否从自身采样的纯文本中改进,无需提示、教师、验证器或奖励模型?可以,但仅当合成语料库与学生兼容时,这是一种源-学生对的关联属性,而非数据的内在属性。我们称之为潜在能力重现假说:弱自训练可以放大预训练模型中已有的能力,但仅在这种兼容条件下。我们在无提示无条件自训练的最小设置中研究这一点,其中基础语言模型仅在BOS令牌生成的文本上进行微调,没有任务规范或外部监督。我们报告三个发现。首先,合成效用是关联的而非内在的:自生成数据是最有效的来源,同源迁移优于更强但不同来源的训练,跨家族迁移显著较弱。其次,常见的内在代理失效:基准级别的语义相似性和学生下的平均每令牌似然都不能预测哪些语料库有帮助。第三,这种机制产生了一个令人惊讶的副产品。在受控的Pythia实验中,能力和逐字记忆解耦:基准效用得以保留或改善,而保留的精确匹配提取下降超过95%,无需遗忘集、隐私目标或针对性遗忘。总之,这些结果表明,无提示自训练通过放大学生已知的内容来工作,而不是从数据中导入结构。它们还揭示了一种无需任何显式遗忘目标即可分离能力和逐字记忆的机制。

英文摘要

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

2605.31113 2026-06-01 cs.CL 版本更新

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

TSM-Bench:在真实维基百科编辑实践中检测LLM生成的文本

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London(伦敦国王学院) Wikimedia Foundation(维基媒体基金会)

AI总结 针对维基百科等用户生成内容平台,提出多语言、多生成器、多任务的TSM-Bench基准,发现现有检测器在任务特定MGT上准确率下降10-40%,且存在泛化不对称性。

详情
AI中文摘要

自动检测机器生成文本(MGT)对于维护维基百科等用户生成内容(UGC)平台的知识完整性至关重要。现有的检测基准主要关注 extit{通用}文本生成任务(例如,“写一篇关于机器学习的文章。”)。然而,编辑者经常使用LLM进行特定的写作任务(例如,摘要)。这些 extit{任务特定}的MGT实例由于其受限的任务制定和上下文条件,往往更接近人类撰写的文本。在这项工作中,我们展示了一系列最先进的MGT检测器在识别反映维基百科真实编辑的任务特定MGT时存在困难。我们引入了 extsc{TSM-Bench},这是一个多语言、多生成器和 extit{多任务}基准,用于评估MGT检测器在常见的真实维基百科编辑任务上的表现。我们的发现表明:( extit{i})与之前的基准相比,平均检测准确率下降了10-40%;( extit{ii})存在泛化不对称性:在任务特定数据上微调能够泛化到通用数据——甚至跨领域——但反之则不然。我们证明,仅在通用MGT上微调的模型会过度拟合机器生成的表面伪影。我们的结果表明,与之前的基准相比,大多数检测器在UGC平台等真实世界上下文中仍不可靠。因此, extsc{TSM-Bench}为开发和评估未来模型提供了关键基础。

英文摘要

Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.''). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and \textit{multi-task} benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data -- even across domains -- but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. \textsc{TSM-Bench} therefore provides a critical foundation for developing and evaluating future models.

2605.31105 2026-06-01 cs.CL 版本更新

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

GRKV: 长上下文LLM中免训练的KV缓存压缩的全局回归

Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai

发表机构 * Sun Yat-sen University(中山大学) ShanghaiTech University(上海科技大学) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室)

AI总结 提出GRKV方法,通过岭回归合并步骤最小化压缩缓存与完整缓存注意力输出的差异,解决基于跨度保留的合并模式不平衡导致的过度合并和信息损失问题。

Comments 21 pages, 7 figures

详情
AI中文摘要

具有扩展上下文长度的大型语言模型(LLM)依赖键值(KV)缓存来支持对先前令牌的注意力。然而,维护KV缓存会产生大量内存开销,促使通过驱逐和合并来强制执行固定预算的KV缓存压缩方法。现代驱逐方法越来越多地采用基于跨度的保留,因为保留连续跨度在经验上有效且更好地保持语义连贯性。然而,当与驱逐后合并结合时,基于跨度的保留将合并集中到一小部分跨度边界载体令牌上,产生高度不平衡的合并模式,加剧过度合并并增加信息损失。为了解决这种不平衡,我们提出GRKV(全局回归KV缓存),一种免训练的KV缓存合并方法,直接最小化压缩缓存与完整缓存注意力输出之间的差异。GRKV使用基于岭回归的合并步骤,将驱逐令牌的信息分布到保留令牌上,同时正则化更新以防止过度平滑。在LongBench和RULER长上下文基准测试中,GRKV是唯一一种以最小开销提高整体性能的合并方法。

英文摘要

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

2605.31099 2026-06-01 cs.CL cs.AI 版本更新

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning

KnowledgeGain: 评估和优化面向读者学习的科学新闻生成

Dominik Soós, Meng Jiang, Jian Wu

发表机构 * Old Dominion University(旧 Dominion 大学) University of Notre Dame(诺特大学)

AI总结 提出KnowledgeGain指标,通过测量读者知识增益来评估科学新闻质量,并利用LLM模拟器优化生成,提升读者学习效果。

详情
AI中文摘要

科学新闻是研究界与公众之间传播发现的重要媒介。然而,大多数用于生成或摘要文本的指标评估语义相似性和事实一致性,但并未衡量读者从新闻中学到了多少知识。我们引入了KnowledgeGain,这是一个通过测量读者阅读后获得的知识量来评估科学新闻质量的指标。为了评估该指标,我们首先进行了一项受控人类研究,表明该指标成功捕捉了人类读者阅读不同类型科学媒体时获得的知识差异。这些数据使我们能够校准一个仅基于提示的LLM读者模拟器。我们用它来在人类评估之前对候选文章进行排序和过滤。第二项人类研究表明,使用该模拟器选择的文章在阅读后准确性和标准化KnowledgeGain上均优于强生成基线。我们的工作是朝着生成更符合Bloom分类法知识和理解目标的科学新闻迈出的一步。

英文摘要

Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.

2605.31080 2026-06-01 cs.MM cs.AI cs.CL cs.CV cs.HC 版本更新

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

策展人引导的多语言艺术描述对盲人和低视力观众的小型视觉语言模型试点研究

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana, Björn W. Schuller

发表机构 * Technical University of Munich(慕尼黑技术大学) Foundation for Research and Technology -- Hellas(希腊研究与技术基金会) National University of Science and Technology Politehnica Bucharest(布加勒斯特政治技术科学与技术国家大学)

AI总结 本研究使用小型视觉语言模型Qwen2.5-VL-3B-Instruct,通过策展人引导的方式为盲人和低视力观众生成德语、罗马尼亚语和塞尔维亚语的多语言艺术描述,发现语言特定适配器在控制性和视觉基础描述质量上优于多语言适配器。

Comments 7 pages, 2 figures, 3 tables. Preprint

详情
AI中文摘要

盲人和低视力(BLV)观众在视觉艺术描述方面仍然服务不足,尤其是在跨语言和博物馆环境中,隐私和知识产权限制可能倾向于使用小型本地视觉语言模型(VLM)。本试点研究使用Qwen2.5-VL-3B-Instruct,针对德语、罗马尼亚语和塞尔维亚语,调查了策展人引导的多语言艺术描述。我们从艺术品图像和元数据构建了一个平行的BLV导向字幕语料库,并在固定骨干网络和训练预算下,比较了语言特定的LoRA适配器与单个多语言适配器。评估结合了自动词汇和基于嵌入的指标,以及针对小型罗马尼亚BLV试点研究校准的LLM作为评判协议。在我们的试点设置下,语言特定适配器在罗马尼亚语和塞尔维亚语上表现出更稳定的可控性和视觉基础描述质量,而多语言适配器在德语上仍具有竞争力。我们将这些发现视为小型本地VLM的部署导向证据,并强调在得出关于多语言可访问性的总体结论之前,需要进行更大规模的BLV用户研究和更广泛的语言覆盖。

英文摘要

Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.

2605.31073 2026-06-01 cs.CL 版本更新

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

ConsisGuard:在LLM护栏中对齐安全审议与策略执行

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi, Bingyu Zhu, YueFeng Chen, Zeyu Yang, Jungang Lou, Longtao Huang, Ningyu Zhang, Kui Ren, Hui Xue

发表机构 * Alibaba Group(阿里巴巴集团) Zhejiang University(浙江大学) Huzhou Normal University(湖州师范学院) Zhejiang Key Laboratory of Intelligent Education Technology and Application(浙江省智能教育技术与应用重点实验室)

AI总结 提出ConsisGuard框架,通过策略到决策轨迹蒸馏和功能耦合对齐,解决基于推理的LLM护栏中审议与执行之间的不一致问题,提升安全检测性能并减少策略执行失败。

Comments 18 pages, 9 figures

详情
AI中文摘要

基于推理的LLM护栏通过在做出最终决策前生成明确理由来改进安全审核。然而,它们的理由并不总是导致忠实的执行:模型可能在推理中识别出有害意图,但仍然预测安全标签,或者在没有策略依据的情况下发布不安全决策。我们将这种安全关键性失败模式识别为审议到执行的差距。与一般的思维链忠实性不同,护栏可靠性要求策略执行一致性:生成的推理应基于安全策略,最终决策应由该推理蕴含。我们提出ConsisGuard,一个用于基于推理的LLM护栏的一致性感知框架。ConsisGuard执行策略到决策轨迹蒸馏和功能耦合对齐,对齐安全审议与决策执行之间的内部耦合。在提示和响应有害性检测基准上的实验表明,ConsisGuard在减少策略执行失败的同时提高了检测性能。这些结果表明,可靠的基于推理的护栏需要准确忠实地执行安全策略。

英文摘要

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

2605.31069 2026-06-01 cs.CV cs.CL 版本更新

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出VISTA框架,通过多级事件语义挖掘(细节级、事件级、未来级)实现长视频事件预测,解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情
AI中文摘要

准确预测未来事件是内容理解和决策制定的基础,涉及多个领域。先前研究主要关注文本或短视频场景,而长视频事件预测具有多模态上下文丰富和叙事复杂的特点,尚未得到充分探索。同时,基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力,但难以泛化到事件预测,因为它们既不能精确提取事件相关细节,也无法对事件发展进行细粒度分析。为弥补这一差距,我们提出VISTA,一个用于长视频事件预测的多级事件语义挖掘框架。首先,VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节,增强细节级语义;其次,采用知识增强的迭代检索策略,引导大语言模型逐步构建逻辑连贯的事件链,从而改善事件级叙事;最后,VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索,产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

2605.31062 2026-06-01 cs.CL 版本更新

AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering

AdaptR1:基于强化学习的自适应交错思考在多跳问答中的应用

Yuxin Wang, Jiahao Lu, Qifeng Wu, Shicheng Fang, Chuanyuan Tan, Yining Zheng, Xuanjing Huang, Xipeng Qiu

发表机构 * Computer Science, Fudan University(复旦大学计算机科学系) Institute of Modern Languages and Linguistics, Fudan University(复旦大学现代语言与语言学研究院) Shanghai Innovation Institute(上海创新研究院) Soochow University(苏州大学)

AI总结 提出AdaptR1框架,通过强化学习动态分配每步推理预算,减少多跳问答中的过度思考,在保持性能的同时显著降低推理成本。

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)提示在复杂推理任务中取得了显著性能。然而,这种方法常常导致“过度思考”,即模型为简单查询生成不必要长的推理轨迹,并产生可避免的推理成本。虽然最近的工作探索了自适应推理,但现有方法通常对是否进行推理做出单一的查询级决策。这忽略了多步任务的动态性质,其中显式推理的需求在中间阶段会有所不同。为了解决这一限制,我们引入了AdaptR1,一种基于强化学习(RL)的框架,用于多跳问答(QA)中的自适应交错思考。与需要监督微调(SFT)进行冷启动初始化的先前方法不同,AdaptR1使用完全基于RL的策略,并带有质量门控效率奖励,以动态分配每一步的推理预算。在Graph-R1设置下,AdaptR1将平均思考令牌减少了69.71%,在HotpotQA上减少了90.35%,同时保持与标准基线相当或更好的性能。此外,我们的分析揭示,多跳推理中的过度思考并非均匀分布,而是主要发生在初始规划阶段,这突显了逐步自适应预算分配的有效性。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.

2605.31058 2026-06-01 cs.CL cs.SE 版本更新

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

组合合成:通过原子分解与重组扩展代码RLVR

Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Lero the Research Ireland Centre for Software, University of Limerick(利默尼克大学爱尔兰软件研究中心) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出原子分解与重组(ADR)框架,通过将代码任务分解为原子元素并受控重组,生成新颖且具有挑战性的可验证代码任务,以解决RLVR训练数据稀缺和扩展性问题,实验表明在多个下游领域显著提升代码能力。

Comments Work in progress

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)近期已成为塑造大型语言模型(LLMs)卓越编码能力的基石。然而,RLVR的可扩展性受到严重制约,因为缺乏足够具有挑战性的、针对模型能力边缘的可验证代码任务。先前的研究通常依赖启发式种子扩展进行数据合成,这严重限制了新颖性和难度。因此,此类数据的训练价值无法随合成规模成比例扩展。为此,我们提出原子分解与重组(ADR),一种通过将任务分解为原子元素并进行受控重组来生成可验证代码任务的新框架,从而能够生成真正新颖且具有挑战性的可验证代码任务。实验和分析表明,ADR在原创性、难度、多样性和测试质量方面优于现有基线,并在包括算法编程、工具使用和数据科学在内的多个下游领域的RLVR中持续带来更大的代码能力提升。我们的工作为新颖代码任务合成和可扩展的RLVR训练开辟了新范式。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.

2605.31056 2026-06-01 cs.CL 版本更新

How Much Do LLMs Know About Chinese Zero Pronouns?

LLMs 对中文零代词的了解程度如何?

Yifei Li, Guanyi Chen, Tingting He

发表机构 * Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning(湖北省人工智能与智能学习重点实验室) National Language Resources Monitoring and Research Center for Network Media(网络媒体语言资源监测与研究中心) School of Computer Science, Central China Normal University(中央财经大学计算机学院)

AI总结 通过一系列语言学动机任务(识别、指称性分类、指称类型分类、消解和翻译),系统评估了大型语言模型处理中文零代词的能力,发现当前LLMs在零代词处理上仍面临巨大挑战,尤其在识别和指称性分类等上游任务上表现不佳。

详情
AI中文摘要

零代词(ZPs)是汉语等代语省略语言中普遍存在的语言现象,长期以来对自然语言处理系统构成挑战。尽管大型语言模型(LLMs)在许多中文任务上表现良好,但其处理ZPs的能力仍不清楚。我们通过一系列语言学动机任务(包括识别、指称性分类、指称类型分类、消解和翻译)对LLMs处理中文ZPs的能力进行了系统调查。评估了多种LLMs在所有任务上的表现。结果表明,中文ZPs对当前LLMs仍然高度具有挑战性,尤其是在识别和指称性分类等上游任务上。下游任务(如ZP翻译)的表现也持续较低:即使是最先进的推理型LLMs,也未能将超过一半的中文ZPs正确翻译成英语。

英文摘要

Zero Pronouns (ZPs) are a pervasive linguistic phenomenon in pro-drop languages such as Chinese and have long posed a challenge for natural language processing systems. Although Large Language Models (LLMs) perform well on many Chinese language tasks, their ability to process ZPs remains poorly understood. We conduct a systematic investigation of LLMs' handling of Chinese ZPs through a sequence of linguistically motivated tasks, including identification, referentiality classification, referential type classification, resolution, and translation. A diverse set of LLMs is evaluated across all tasks. Our results show that Chinese ZPs remain highly challenging for current LLMs, particularly for upstream tasks such as identification and referentiality classification. Performance on downstream tasks, such as ZP translation, is also consistently low: even state-of-the-art reasoning-oriented LLMs correctly translate fewer than half of Chinese ZPs into English.

2605.31042 2026-06-01 cs.CR cs.AI cs.CL 版本更新

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

从提示注入到持久控制:防御智能体框架中的木马后门

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院)

AI总结 本文提出ClawTrojan基准测试揭示本地智能体框架中的多步木马攻击,并设计DASGuard防御方法,通过扫描控制文本、追溯来源并清除不可信控制内容,实现动态防御。

Comments Code and data are available at https://github.com/RUC-NLPIR/ClawTrojan

详情
AI中文摘要

LLM智能体正在从对话式聊天机器人演变为实际工作空间中的操作工具。在本地智能体框架中,LLM可以读写文件、调用工具,并在会话间重用工作空间状态。虽然这些功能增强了实用性,但也为攻击者暴露了新的攻击面。攻击者可以将提示注入嵌入文件或工具输出中。智能体可能会读取这一隐藏指令,存储它,并在之后执行。在这种多步木马攻击范式中,没有任何单个步骤本身是恶意的,但这些步骤可以共同将不可信文本转化为持久控制内容。然而,现有防御通常孤立地检查每个步骤。因此,它们可以阻止明显的恶意行为,但无法检测到植入后门的早期写操作。为了揭示这一威胁,我们引入了ClawTrojan,一个旨在识别本地智能体框架中多步木马攻击的基准测试。在OpenClaw风格的模拟工作空间中,使用GPT-5.4,ClawTrojan达到了95.5%的攻击成功率(ASR),而同一模型上现有的单轮提示注入攻击产生的ASR接近零。为了解决这一威胁,我们提出了DASGuard,它扫描敏感本地文件中的控制类文本,追溯其来源,并清除非可信来源的控制内容。我们的结果表明,DASGuard通过结合运行时攻击阻断和对工作空间的清理提交,实现了强大的动态防御。

英文摘要

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.

2605.31025 2026-06-01 cs.CL 版本更新

TRACE: Discovering Task-Specific Parameter via Adaptation-Aware Probing for Continual Fine-Tuning

TRACE: 通过适应感知探测发现任务特定参数以实现持续微调

Xiaosong Han, Ke Chen, Xindi Dai, Di Liang, Minlong Peng, Wei Pang, Fausto Giunchiglia, Xiaoyue Feng, Yonghao Liu, Renchu Guan

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Software, Jilin University(吉林大学软件学院) Fudan University(复旦大学) School of Mathematical and Computer Sciences, Heriot-Watt University(赫瑞-瓦特大学数学与计算机科学学院) Department of Information Engineering and Computer Science, University of Trento(特伦托大学信息工程与计算机科学系)

AI总结 提出TRACE方法,通过适应感知探测发现任务特定核心参数,在持续微调中仅更新这些参数以缓解灾难性遗忘,并验证了跨模型和规模的迁移性。

Comments KDD2026

详情
AI中文摘要

在实际部署中,大型语言模型通常需要跨任务持续适应以保持最新状态,新的微调应保留先前学到的技能。然而,不加区分地混合任务会稀释任务特化,而顺序微调(全参数或低秩适应)常因破坏性覆盖导致灾难性遗忘。基于回放的持续微调和维护单独的任务特定适配器可以缓解遗忘,但引入了额外的计算、存储和管理开销。认识到LLM参数对于任何单一任务都存在冗余,我们将持续任务适应重新定义为通过适应感知探测发现任务特定参数:短时预热探测暴露任务的适应轨迹,使我们能够识别并隔离每个任务所需的一小部分关键参数,以缓解灾难性遗忘。基于这一观点,我们引入了TRACE,一种通过适应感知探测发现任务特定参数以实现持续微调的新方法。我们进行短时预热微调,通过比较预热模型和预训练模型来推导任务特定核心参数。核心参数通过两种策略识别:重要性评分(L2范数和Fisher信息)和特异性分析(参数更新的余弦相似度)。在持续微调设置中,仅更新当前任务的核心参数,其余参数保持冻结,从而保留先前知识。我们在多个标准基准上进行了广泛实验,证明了所提方法的优越性能。此外,我们通过跨模型和规模迁移性研究验证了方法的泛化能力,展示了在资源约束下指导大规模模型微调的“小到大”范式。

英文摘要

In real-world deployment, LLMs are often adapted continually across tasks to keep LLMs up-to-date in production, where new fine-tuning should preserve previously learned skills. However, indiscriminately mixing tasks can dilute task specialization, while sequential fine-tuning (full-parameter or low rank adaptation) often causes catastrophic forgetting due to destructive overwriting. Replay-based continual tuning and maintaining separate task-specific adapters can mitigate forgetting, but introduce additional compute, storage, and management overhead. Recognizing the redundancy of LLM parameters for any single task, we reframe continual task adaptation as task-specific parameter discovery via adaptation-aware probing: a short warm-start probe exposes a task's adaptation trace, enabling us to identify and isolate the small subset of parameters essential for each task to mitigate catastrophic forgetting. Building on this view, we introduce TRACE, a novel approach for discovering Task-specific paRameters via Adaptation-aware probing for Continual finE-tuning. We perform a short warm-start fine-tune to derive task-specific core parameters by comparing the warm-started and pre-trained models. Core parameters are identified via two strategies: importance scoring (L$_2$ norm and Fisher Information) and specificity analysis (cosine similarity of parameter updates). In continual fine-tuning settings, only the active task's core parameters are updated while others remain frozen, preserving prior knowledge. We conduct extensive experiments across multiple standard benchmarks to demonstrate the superior performance of our proposed method. Additionally, we validate the generalization of our method through a cross-model and scale transferability study, demonstrating a "small-to-large" paradigm that guides the fine-tuning of large-scale models under resource constraints.

2605.31021 2026-06-01 cs.AI cs.CL cs.LG 版本更新

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

基于人格的生成式AI多元对齐评估框架

Atahan Karagoz

发表机构 * Atahan Karagöz(阿塔汗·卡拉戈兹)

AI总结 提出一种状态空间约束仿真框架,通过合成认知轮廓替代单一评估函数,实现反映真实世界共识变异性的多元、视角依赖的基准测试,并分析仿真评估者的稳定性问题,论证动态调节机制的必要性。

详情
AI中文摘要

当前生成式人工智能的对齐范式主要依赖单一基准测试框架,将人类判断的多元性简化为聚合统计基线,从而掩盖了评估中的文化、人口和语境变异性。我们引入一种用于AI评估的状态空间约束仿真框架,用代表不同人类视角的合成认知轮廓的结构化流形替代单一评估函数。我们表明,现代生成架构能够以高度一致性实例化和维护这些评估人格,从而实现一种更接近现实世界共识变异性的多元、视角依赖的基准测试。然而,我们进一步分析了这些模拟评估者在顺序推理和随机提示扰动下的稳定性,揭示了人格一致性的系统性退化,表现为状态空间漂移和语义不一致。这些发现表明,静态对齐约束不足以维持随时间推移的稳健评估行为。相反,我们主张必须在生成系统中嵌入动态的、可行性驱动的调节机制,以保持连贯的认知仿真。通过将基于人格的评估视为潜在表征流形上的结构化动力系统,本研究为更自适应、更符合人类、更注重语境的AI评估方法奠定了基础。

英文摘要

Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the plurality of human judgment to aggregated statistical baselines, thereby obscuring cultural, demographic, and contextual variability in evaluation. We introduce a state-space constrained emulation framework for AI evaluation that replaces singular assessment functions with a structured manifold of synthetic cognitive profiles representing diverse human perspectives. We show that modern generative architectures can instantiate and maintain these evaluative personas with high consistency, enabling a form of pluralistic, perspective-dependent benchmarking that more closely reflects real-world consensus variability. However, we further analyze the stability of these simulated evaluators under sequential inference and stochastic prompt perturbations, revealing systematic degradation in persona coherence that manifests as state-space drift and semantic inconsistency. These findings suggest that static alignment constraints are insufficient for sustaining robust evaluative behavior over time. Instead, we argue for the necessity of embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve coherent cognitive emulation. By framing persona-based evaluation as a structured dynamical system over latent representation manifolds, this study provides a foundation for more adaptive, human-aligned, and context-sensitive approaches to AI evaluation.

2605.31010 2026-06-01 cs.CL 版本更新

MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation

MoG:用于基于图的检索增强生成的混合专家模型

Zheng Yuan, Chuang Zhou, Linhao Luo, Siyu An, Di Yin, Xing Sun, Xiao Huang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Monash University(墨尔本大学) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出MoG框架,通过组织知识为中心枢纽图和稀疏激活的专家图,利用拓扑感知路由器动态选择相关专家图,以解决检索增强生成中统一知识库引入无关信息的问题,在MuSiQue上相对提升超过20%。

详情
AI中文摘要

检索增强生成被广泛研究以将大型语言模型建立在外部证据上。然而,从统一的知识库中检索可能会不可避免地引入无关信息,从而误导复杂推理的生成。受混合专家(MoE)条件计算的启发,其中路由器为每个输入稀疏地选择专门的专家以及共享专家,我们提出了用于基于图的检索增强生成的混合专家模型,即MoG。它将知识组织为两个核心组件:(i)多样且始终可访问的枢纽图,编码语义和结构上的核心知识,并为专家激活提供上下文线索;(ii)稀疏激活的专家图,包含特定领域的证据。MoG首先访问枢纽图以识别一般证据并推导上下文线索。然后,一个拓扑感知路由器根据查询动态激活一组有限的专家图,从而将检索限制在一个集中的证据子空间中。在具有挑战性的基准测试上的大量实验表明,MoG始终优于强基线,在MuSiQue上相对提升超过20%。我们的代码可在https://github.com/DEEP-PolyU/MoG获取。

英文摘要

Retrieval-augmented generation is intensively studied to ground large language models on external evidence. However, retrieving from a unified knowledge base could inevitably introduce irrelevant information that may mislead generation for complex reasoning. Inspired by the conditional computation of mixture of experts (MoE), where a router sparsely selects specialized experts alongside shared ones for each input, we propose \textbf{M}ixture \textbf{o}f experts for \textbf{G}raph-based Retrieval-Augmented Generation, i.e., \textbf{MoG}. It organizes knowledge into two core components: (i) diverse, always-accessible hub graphs that encode semantically and structurally central knowledge and provide contextual clues for expert activation, and (ii) sparsely activated expert graphs that contain domain-specific evidence. MoG first accesses hub graphs to identify general evidence and derive contextual clues. Then, a topology-aware router dynamically activates a limited set of expert graphs conditioned on the query, thereby confining retrieval to a focused evidence subspace. Extensive experiments on challenging benchmarks show that MoG consistently outperforms strong baselines, with over 20\% relative improvement on MuSiQue. Our code is available in https://github.com/DEEP-PolyU/MoG.

2605.30984 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

生成报告还是重复模板?测量和缓解三维CT报告生成中的模板崩溃

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani, Benedikt Wiestler, Christian Wachinger

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) TUM Hospital(TUM医院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 针对三维CT报告生成中模型输出多样性低、病理检测能力差的模板崩溃问题,提出解耦框架CLarGen,通过分离临床检测与语言合成,显著提升临床准确性并保持报告流畅性。

详情
AI中文摘要

现代三维医学视觉语言模型(VLM)能够生成流畅的放射学风格文本,但表现出极低的病理检测率和输出多样性,崩溃为低估罕见但关键发现的通用模板。我们将这种失败模式识别为模板崩溃。这种失败源于三维医学成像的独特限制,例如数据有限、标签严重不平衡以及体积编码器的弱信号。在这些限制下,文本生成目标鼓励捷径学习和流畅但基础薄弱的报告。我们通过临床保真度、输出多样性、正常模板偏差和罕见发现存活率系统性地诊断模板崩溃。为了缓解它,我们提出CLarGen,一个解耦框架,将说什么(临床检测)与怎么说(语言合成)分开。CLarGen使用(i)用于多标签病理检测的潜在查询变换器,(ii)用于临床匹配示例的病理引导检索,以及(iii)用于从检测到的发现和检索到的上下文中合成最终报告的医学语言模型。在最新的三维CT报告生成基线中,CLarGen缓解了模板崩溃,并在保持流畅报告的同时显著提高了临床准确性(macro-F1 0.487 vs. 0.189;CRG 0.472 vs. 0.368)。我们的结果表明,明确、可测量的临床基础对于抗模板崩溃的三维CT报告生成至关重要。代码将在接收后发布。

英文摘要

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

2605.30981 2026-06-01 cs.CL cs.LG 版本更新

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

自回归Transformer中的认知疲劳:形式化与测量

Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain, Michael Stewart, Amit Sheth

发表机构 * Guru Gobind Singh Indraprastha University, India(古鲁·戈宾德·辛格·印度普拉斯塔大学) Artificial Intelligence Institute, University of South Carolina, USA(人工智能研究所,南卡罗来纳大学) Indian Institute of Technology, Kanpur, India(印度理工学院,坎浦尔) Indian AI Research Organization, India(印度人工智能研究组织)

AI总结 本文形式化自回归语言模型在长程生成中的退化现象为认知疲劳,并提出轻量级诊断指标疲劳指数(FI),通过聚合注意力衰减、表征漂移和熵校准三个信号实现实时监测,实验表明FI能高精度预测任务退化和重复生成。

Comments 9 pages, 7 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

自回归语言模型在长程生成过程中经常退化,产生重复文本、失去指令遵循能力并表现出不稳定的熵。尽管这些失败普遍存在,但从业者缺乏在线诊断工具来实时检测它们。我们将这种退化形式化为认知疲劳,这是一种可测量的生成时状态,其特征是对原始提示的注意力衰减、表征漂移和熵校准错误。我们引入了疲劳指数(FI),这是一种轻量级、模型无关的诊断方法,在明确的公理(单调性、有界性、可解释性)下聚合这三个信号,从而实现可靠的运行时监控。在九个模型(1B-13B参数)上,FI轨迹表现出结构化的时间动态,预测任务退化(AUROC = 0.95)和重复(Spearman rho = 0.94),并揭示了非单调的缩放行为:低于3B的指令微调模型比基础模型退化更快,而在7B时这一趋势逆转。压力分析进一步表明,在更长的上下文、中间位置的证据和降低的数值精度下,FI onset加速。这些结果确立了认知疲劳作为一个连贯且可测量的现象,并将FI定位为生产级LLM系统中运行时可靠性监控的原则性工具。

英文摘要

Autoregressive language models frequently degrade during long-horizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real-time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability) enabling reliable runtime monitoring. Across nine models (1B-13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (Spearman rho = 0.94), and reveal non-monotonic scaling behavior: instruction-tuned models below 3B exhibit faster collapse than base models, with this trend reversing at 7B. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.

2605.30966 2026-06-01 cs.IR cs.AI cs.CL 版本更新

Reading Between the Citations: A Typed Claim Network for Scientific Literature

解读引用:面向科学文献的类型化主张网络

Ning Ding, Sergio J. Rodríguez Méndez, Pouya G. Omran

发表机构 * Australian National University(澳大利亚国立大学)

AI总结 针对现有知识图谱忽略引用立场的问题,提出将文献间引用具体化为带有立场标签的类型化主张网络,并构建了包含8260条主张的实例,在检索增强、立场摘要和拓扑分析三个任务上验证其有效性。

详情
AI中文摘要

基于相互引用文献语料库(如学术论文、法律意见书、政策简报)的知识图谱编码了引用的拓扑结构,但未编码其立场。标准表示将丰富的评价关系压缩为无类型边,丢失了支持社区级查询(关于一篇文献如何被另一篇文献接受)的关键内容。我们提出主张网络:一种表示模式,其中每个跨文献引用被具体化为一个类型化主张,携带来源、目标、主张文本以及基于引用意图文献的四类立场标签。我们给出了一个适用于任何学术相互引用文献语料库的构建流程,并在3D点云语义分割领域的127篇论文语料库上实例化,生成了一个包含8260个类型化主张的网络。三个下游任务系列展示了该网络的能力:检索信号增强、聚合立场摘要和拓扑分析。与标准检索增强生成(RAG)基线的直接比较表明,相对于平面检索的增益来自于正确的中间表示,而非错误的表示。

英文摘要

Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community-level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross-document reference is reified as a typed claim, carrying source, target, claim text, and a four-class stance label grounded in the citation-intent literature. We give a construction pipeline applicable to any corpus of scholarly inter-referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated-stance summarisation, and topological analytics. Head-to-head evaluation against standard Retrieval-Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.

2605.30965 2026-06-01 eess.AS cs.AI cs.CL 版本更新

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

ImmersiveTTS:基于多模态扩散Transformer和领域特定表示对齐的环境感知文本转语音

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 提出ImmersiveTTS模型,通过多模态扩散Transformer和领域特定表示对齐,实现与环境音频自然融合的文本到语音生成。

Comments Accepted to ACL 2026 main conference. Code is available at https://github.com/jjunak-yun/ImmersiveTTS

详情
AI中文摘要

最近在文本引导音频生成方面的进展在声音效果、语音和音乐等多个领域取得了有希望的结果。然而,由于语音和环境音频在声学模式和时域动态上的固有差异,联合生成语音和环境音频仍然具有挑战性。我们提出了ImmersiveTTS,一种环境感知的文本到语音(TTS)模型,通过显式建模跨模态交互,生成与环境上下文无缝融合的自然语音。我们的模型基于多模态扩散Transformer,并通过联合注意力将转录对齐的语音潜在表示与文本条件的环境上下文融合。为了增强语义一致性,我们引入了一种针对环境感知TTS量身定制的领域特定表示对齐目标,利用来自语音和音频编码器的互补自监督表示。实验结果表明,在客观指标和人类听力测试中,ImmersiveTTS在自然度、可懂度和音频保真度方面均优于现有方法。

英文摘要

Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

2605.30961 2026-06-01 cs.CL 版本更新

EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation

EvoGens:一种基于种群的启发式搜索框架用于科学思想生成

Xu Li, Hanzhe Tu, Xinyi Li, Kuncheng Zhao, Xun Han, Zhonghui Liu

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 针对现有LLM方法生成科学思想时语义趋同、多样性和新颖性不足的问题,提出EvoGens框架,通过进化搜索(变异、交叉、选择)增强思想探索,显著提升新颖性和多样性。

Comments 21 pages, 6 figures

详情
AI中文摘要

生成新颖的研究思想是科学进步的基础。虽然大型语言模型(LLM)在辅助这一过程中显示出潜力,但现有方法常表现出语义趋同,导致多样性和新颖性有限。为解决这一问题,我们引入了EvoGens,一个受进化启发的框架,将科学思想生成重新构想为对思想种群的进化搜索。EvoGens迭代地应用基于排名的变异与差异化检索规划以融入外部知识,以及语义感知的交叉以融合互补概念进行概念重组。一个轻量级的评估信号指导选择过程,鼓励持续探索同时缓解过早收敛。大量实验表明,与最先进的基线相比,EvoGens显著增强了探索能力。具体而言,在当前的自动评估协议下,它将新颖性从0.1提升到0.4,多样性从0.24提升到0.55,同时保持了可比的思想质量。这些发现表明,进化机制可以作为面向探索的研究构思的有用框架,特别是在共享自动评估设置下拓宽候选思想的新颖性和多样性。

英文摘要

Generating novel research ideas is fundamental to scientific progress. While Large Language Models (LLMs) show promise in assisting this process, existing approaches often exhibit semantic convergence, resulting in limited diversity and novelty. To address this, we introduce EvoGens, an evolution-inspired framework that recasts scientific idea generation as an evolutionary search over a population of ideas. EvoGens iteratively applies rank-based mutation with differentiated retrieval planning to incorporate external knowledge, and semantic-aware crossover to fuse complementary concepts for conceptual reorganization. A lightweight evaluation signal guides the selection process, encouraging sustained exploration while mitigating premature convergence. Extensive experiments demonstrate that EvoGens substantially enhances exploration capabilities compared to state-of-the-art baselines. Specifically, it improves the Novelty from 0.1 to 0.4 and the Diversity from 0.24 to 0.55, while maintaining comparable idea quality under the current automatic evaluation protocol. These findings suggest that evolutionary mechanisms can serve as a useful framework for exploration-oriented research ideation, especially for broadening the novelty and diversity of candidate ideas under a shared automatic evaluation setting.

2605.30934 2026-06-01 cs.CL cs.AI 版本更新

Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity

大型语言模型是否编码了制度经验?来自跨语言模糊道德推理的证据

Nattavudh Powdthavee

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 通过跨语言道德困境实验,研究大型语言模型在模糊情境下是否通过语言编码制度经验,发现隐含制度线索会放大跨语言道德分歧,而明确框架则抑制这种差异。

Comments 44 pages

详情
AI中文摘要

大型语言模型(LLMs)在不同语言中表现出系统性的道德推理差异,但这种差异的来源尚不清楚。我们检验了一个假设:语言编码了其使用环境中的制度方面,使得LLMs通过训练继承了特定制度的道德先验。跨越制度质量梯度广泛的九种语言、六个前沿LLM以及两项预注册研究,我们考察了道德困境的可接受性取决于制度功能的情况。在研究1中,明确的制度框架产生了统一的无结果:跨语言道德分歧在制度依赖场景中没有增加,也没有追踪语言社区之间的制度差异。在研究2中,我们引入了制度模糊场景,其中制度利益存在但未明确说明。在这些条件下,跨语言道德分歧相对于制度无关控制组增加,并且除一个理论上有信息的例外,与语言社区之间的现实世界制度差异相关。明确的框架再次减弱了这些效应。这些发现表明,制度经验可能在语言中留下可检测的痕迹,塑造LLM的道德推理,同时也表明明确的制度线索可以抑制这些差异的表达。

英文摘要

Large language models (LLMs) exhibit systematic differences in moral reasoning across languages, yet the source of this variation remains unclear. We test the hypothesis that languages encode aspects of the institutional environments in which they are spoken, allowing LLMs to inherit institution-specific moral priors through training. Across nine languages spanning a broad gradient of institutional quality, six frontier LLMs, and two preregistered studies, we examine moral dilemmas whose acceptability depends on institutional functioning. In Study 1, explicit institutional framing produced uniformly null results: cross-linguistic moral divergence did not increase in institutionally contingent scenarios, nor did it track institutional differences between language communities. In Study 2, we introduced institutionally ambiguous scenarios in which institutional stakes were present but not explicitly stated. Under these conditions, cross-linguistic moral divergence increased relative to institutionally inert controls and, with one theoretically informative exception, was associated with real-world institutional differences between language communities. Explicit framing again attenuated these effects. These findings suggest that institutional experience may leave detectable traces in language that shape LLM moral reasoning, while also indicating that explicit institutional cues can suppress the expression of those differences.

2605.30930 2026-06-01 cs.HC cs.AI cs.CL cs.CY 版本更新

TUX: Measuring Human--AI Tacit Understanding

TUX:衡量人机默契理解

Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) New York University(纽约大学)

AI总结 通过光谱放置任务和TUX指数,量化人类与LLM之间的默契理解,发现人格特征影响对齐程度。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地作为协作伙伴,人机对齐通常通过明确的任务成功、准确性或奖励优化来评估。然而,许多协作场景依赖于默契理解:即智能体能否在没有明确目标、沟通或反馈的情况下,与人类的评价立场或表征先验对齐。为了研究这种能力,我们开发了一个受社交派对游戏Wavelength启发的光谱放置任务,在该任务中,人类和智能体独立地将概念放置在主观光谱上。我们将默契理解指数(TUX)操作化为人类与智能体判断之间的成对相似性度量,并通过241名人类参与者和200个基于人格条件的LLM智能体(涵盖四种模型)进行评估。我们发现,在特质空间中最近的人-智能体对实现了显著更高的TUX,表明默契对齐是由个体层面特征而非随机相似性所结构化的。回归分析表明,随着预测变量集变得更加丰富,TUX变得更可解释,个体特质、决策风格和置信度优于聚合特质距离基线。这些发现表明,人类与LLM之间的默契理解是可测量的,同时也揭示了基于人格条件化方法在捕捉更深层表征对齐方面的局限性。

英文摘要

As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.

2605.30924 2026-06-01 cs.CL 版本更新

EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents

EMBGuard:为具身智能体安全规划构建危险感知护栏

Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim, Yeonjun Hwang, Hyojun Kim, Byungchul Kim, Young Kyun Jang, Jinyoung Yeo

发表机构 * Independent Researcher(独立研究者) Department of Biomedical Engineering(生物医学工程系) the Department of Intelligent Precision Healthcare Convergence, Sungkyunkwan University(智能精准医疗融合系,全州大学) Department of Artificial Intelligence, Yonsei University(人工智能系,延世大学)

AI总结 提出首个基于MLLM的具身安全护栏EMBGuard,通过解耦物理风险推理与智能体策略,评估(视觉观察,动作)对来识别危险配置并提供自然语言解释,同时构建训练数据集EMBHazard和基准测试EMBGuardTest,在紧凑模型尺寸下达到与专有MLLM竞争的性能并降低误报率。

Comments Accepted at ICML 2026

详情
AI中文摘要

部署在真实环境中的MLLM驱动的具身智能体会遇到物理危险。然而,现有方法缺乏识别危险和推理动作条件风险的内在机制,导致智能体要么错过危险交互,要么过度识别风险。为解决此问题,我们提出EMBGuard,这是首个基于MLLM的具身智能体安全护栏,旨在将物理风险推理与智能体策略解耦。通过评估(视觉观察,动作)对,EMBGuard识别危险配置并提供潜在风险的自然语言解释。伴随EMBGuard,我们贡献了EMBHazard,一个包含15.1K个动作条件对的训练数据集,以及EMBGuardTest,一个包含329个手动策划的真实世界场景的基准测试,涵盖七种物理风险类别。通过危险和动作的组合变化,我们生成了智能体在规划过程中可能遇到的各种危险和良性场景。尽管模型尺寸紧凑(2B,4B),EMBGuard达到了与专有MLLM(例如GPT-5.1,Gemini-2.5-Pro)竞争的性能,同时显著降低了阻碍实时部署的误报率。我们在https://github.com/dongwxxkchoi/EMBGuard公开了代码、数据和模型。

英文摘要

MLLM-powered embodied agents deployed in real-world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action-conditioned risks, leading agents to either miss risky interactions or over-identify risks. To address this, we propose EMBGuard, the first MLLM-based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGuard identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGuard, we contribute EMBHazard, a training dataset of 15.1K action-conditioned pairs, and EMBGuardTest, a benchmark of 329 manually curated real-world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGuard achieves performance competitive with proprietary MLLMs (e.g., GPT-5.1, Gemini-2.5-Pro) while significantly reducing the false-positive rates that hinder real-time deployment. We make the code, data, and models publicly available at https://github.com/dongwxxkchoi/EMBGuard

2605.30913 2026-06-01 cs.CL cs.AI cs.CY cs.HC 版本更新

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

有毒幻觉:扰动提示与追踪LLM电路

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nimblemind

AI总结 研究有毒语言扰动对LLM事实可靠性的影响,发现有毒词汇降低准确率并增加不确定性,通过归因图分析揭示内部机制。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在对话环境中,用户语气从礼貌到对抗性或毒性不等,但尚不清楚在语义等效的提示中,有毒语言是否会降低事实可靠性。我们研究基于词汇和语气的提示扰动如何影响LLM的事实可靠性。通过礼貌、随机和三种毒性水平的受控提示变化,我们在ARC-Easy、GSM8K和MMLU上评估了五个LLM。我们发现有毒词汇扰动持续降低事实准确性并增加不确定性,而礼貌措辞产生有限且不一致的变化。为了检查这些答案不一致是否对应内部变化,我们进行了模型激活和影响的归因图分析。我们发现增加毒性选择性地放大对扰动敏感的变体节点,而相对稳定的核心推理节点保持更不变。这些发现将提示语气定位为LLM可靠性的关键维度,并提供了行为和机制证据,表明表面词汇变化可以改变事实输出和内部计算。

英文摘要

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

2605.30912 2026-06-01 cs.CV cs.CL 版本更新

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据:面向多模态RLVR的证据锚定空间注意力监督

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Nankai University(南开大学) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出EASE方法,通过将标注证据区域转化为平滑视觉标记目标,在多模态强化学习训练中引导响应到图像的注意力,从而提升视觉语言模型在感知、幻觉、视觉数学和多模态推理基准上的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过优化从最终答案中导出的结果奖励来改进视觉语言模型(VLM)。然而,这种仅基于结果的奖励并不能告诉模型哪些图像区域证明了答案的正确性。对于需要视觉定位的问题,这些奖励无法区分由相关视觉证据支持的响应与由语言先验捷径或幸运猜测产生的响应。我们引入了EASE(证据锚定空间注意力),它通过视觉证据过程监督增强了多模态RLVR。EASE将标注的证据区域转换为平滑的视觉标记目标,并在RL训练期间使用它来引导响应到图像的注意力,但仅限于高奖励轨迹。标注仅用作特权训练标签,而推理仅需要原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B上,EASE在感知、幻觉、视觉数学和多模态推理基准上的平均得分比DAPO高出2.5到3.1分。诊断和消融实验表明,EASE更好地将视觉注意力与标注的证据区域对齐。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

2605.30907 2026-06-01 cs.SE cs.AI cs.CL cs.LG 版本更新

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

BlueFin: 在金融电子表格上对LLM智能体进行基准测试

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta, Case Winter, George Fang, John Ling, Emma Strubell, Zach Kirshner

发表机构 * Longitude Labs Inc.(Longitude Labs公司) Cornell University(康奈尔大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出BlueFin基准,通过131个真实金融电子表格任务评估LLM智能体的合成、操作和理解能力,并验证了LM评判与人类专家的一致性。

Comments 26 pages

详情
AI中文摘要

我们提出BlueFin,一个基准测试,要求大语言模型(LLM)智能体在专业金融领域的电子表格工作簿上执行合成、操作和理解任务。尽管全球电子表格软件付费用户估计数亿——比全球专业开发人员估计数量高一个数量级——但投入探索和扩展LLM在电子表格领域能力的资源相对较少,而专门用于反映专业金融角色实际职业任务的资源更少。为此,我们整理了131个具有现实相关性的挑战性复杂任务,包含3225个细粒度评分标准;值得注意的是,我们的评分标准和LM评判评估由一组专家人工标注员验证,从而对难以通过编程验证但可由LM评判智能体可靠评估的复杂任务进行高质量、细粒度的评估。我们的评判与专家共识达到一致(α=0.826),宏F1得分为0.839。前沿LLM在此挑战性基准上表现不佳,最强LLM在任务上的平均得分低于50%——模型在动态正确性方面表现出特别弱点。我们的贡献包括:涵盖三类电子表格任务的示例数据集、开源工具包和智能体评估框架,以及现有前沿模型在我们基准上的性能表征。

英文摘要

We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions -- an order of magnitude more than the estimated global population of professional developers -- comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real-world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high-quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus ($α=0.826$) with a macro-F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50\% average scores across tasks -- models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.

2605.30898 2026-06-01 cs.AI cs.CL 版本更新

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale: 通过模型路由和测试时扩展的在线联合优化实现自适应统一推理扩展

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)大数据研究院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) Department of Computer Science and Engineering, Chinese University of Hong Kong(香港中文大学计算机科学与工程系)

AI总结 提出UniScale框架,将模型路由和测试时扩展统一为上下文多臂老虎机问题,通过LinUCB在线学习推理策略,实现细粒度且更优的质量-成本权衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在大语言模型(LLM)的实际部署中,平衡推理质量和计算成本已成为核心挑战。现有方法沿着两个大致独立的维度处理这一权衡:模型路由(在不同规模的模型之间切换以匹配请求复杂度)和测试时扩展(TTS,在固定模型内调整推理时计算以实现细粒度控制)。然而,这种解耦设计引入了固有限制。由于模型规模稀疏,模型路由产生粗粒度的离散性能变化,而单模型TTS通常遇到能力上限,并随着计算增加出现收益递减。此外,将两种机制分开处理限制了动态推理环境中的适应性。为克服这些限制,我们引入统一推理扩展(UIS),将模型路由和TTS统一到单个优化空间中。基于此公式,我们提出UniScale,一个在线框架,将自适应UIS建模为上下文多臂老虎机问题,并通过LinUCB学习推理策略。该框架包含效率感知学习和成本建模,以确保在高维动作空间上的稳定和可扩展优化。评估表明,UniScale有效利用UIS空间中的协同作用,在多样化的动态推理场景中提供细粒度且持续更优的质量-成本权衡。

英文摘要

In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

2605.30888 2026-06-01 cs.CL 版本更新

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

RLHF的另一面:用于奖励模型自监督改进的在线策略反馈

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学) University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出SAVE框架,利用价值函数生成在线策略反馈,通过对比学习更新奖励模型,在六个基准上超越现有方法。

详情
AI中文摘要

构建用于语言模型对齐的强大奖励模型(RM)受到从人工标注或评判模型获取多样且可靠偏好数据的成本和难度的瓶颈限制。随着策略超越静态RM训练,这一问题变得更加严重。因此,我们提出SAVE(基于价值锚定的在线策略反馈自监督奖励模型改进),一个通过使用价值函数进行在线策略RM训练的框架,对在线策略响应进行评分作为反馈。SAVE自然地利用提示特定的价值头作为自适应锚点,将奖励评分的在线策略响应转化为监督信号。它计算RM优势并过滤模糊样本,通过对比目标更新RM。通过六个不同基准的严格实证评估,SAVE在增强RM训练方面的有效性得到了强烈验证。它在所有数据集上取得了优于现有方法的结果,同时在三种RL算法(GRPO、RLOO、GSPO)和不同策略骨干上保持一致的改进。

英文摘要

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

2605.30876 2026-06-01 cs.CL 版本更新

dMoE: dLLMs with Learnable Block Experts

dMoE: 具有可学习块专家的扩散大语言模型

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对扩散大语言模型与混合专家架构集成时块并行解码与令牌级专家选择不匹配导致的推理内存瓶颈,提出dMoE框架,通过聚合块内令牌级专家分布为统一的块级专家分布来减少激活专家数量,在保持性能的同时显著降低内存使用和延迟。

Comments Working in progress. Code is available at: \url{https://github.com/fscdc/dMoE}

详情
AI中文摘要

扩散大语言模型(dLLMs)最近作为自回归模型的有前途的替代方案出现,在自然支持并行解码的同时提供了有竞争力的性能。然而,随着dLLMs越来越多地与混合专家(MoE)架构集成以扩展模型容量,块并行解码与令牌级专家选择之间出现了根本性的不匹配。具体来说,每次dLLM前向传递处理多个具有双向依赖关系的令牌,而传统的MoE层独立路由每个令牌。这种不匹配显著增加了唯一激活专家的数量,使推理越来越受内存限制。为了解决这个问题,我们提出了dMoE,一个简单而有效的块级MoE框架。dMoE的核心思想是将每个块内的令牌级专家分布聚合成统一的块级专家分布,然后以更连贯的方式指导专家路由。通过这种方式,dMoE在不牺牲性能的情况下显著减少了推理期间唯一激活专家的数量,从而缓解了内存瓶颈。在各种基准上的大量实验证明了dMoE的有效性。平均而言,dMoE将唯一激活专家的数量从69.5减少到14.6,同时保留了原始性能的99.11%。同时,它将内存使用减少了76.64%到79.84%,并实现了1.14倍到1.66倍的端到端延迟加速。代码可在https://github.com/fscdc/dMoE获取。

英文摘要

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

2605.30857 2026-06-01 cs.CL 版本更新

MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning

MADS: 面向指令微调的模型感知多样化核心集选择

Yi Bai, Wenhao Zhang, Yao Chen, Jiao Xue, Zhumin Chen, Pengjie Ren

发表机构 * Shandong University(山东大学) Inspurcloud

AI总结 提出一种基于模型推理时神经激活状态区分数据特征的多样化核心集选择方法,在减少数据量的同时提升大语言模型在多个下游任务上的性能。

详情
AI中文摘要

指令微调用于增强大语言模型(LLMs)的指令遵循能力。随着指令微调数据量的增加,选择最优核心集变得尤为重要。然而,确保核心集的多样性仍然是一个重大挑战。现有方法主要基于文本特征本身来区分不同的训练数据,与LLMs自身对数据的理解和表示相分离。为解决这一问题,我们提出了一种模型感知的多样化核心集选择方法,该方法基于LLM推理过程中的神经激活状态来区分数据特征。该方法利用模型内在的激活特征,实现了基于覆盖的选择的高效实例化,以确保核心集的多样性。我们在涵盖五个不同任务的六个基准上广泛评估了我们的方法。在我们的方法中,由3B参数LLM选择的核心集在用于微调7B、8B和13B参数的更大模型时表现有效。在包含52K指令-响应对的Alpaca-GPT4数据集上的实验结果表明,由Llama-3.2-3B-Instruct选择的、大小为原始数据集15%的核心集,在微调四个更大的基础模型时,与使用完整数据集训练相比,平均提升了2.5%。实验结果表明,我们的方法在减少数据需求的同时,提升了模型在多个下游任务上的性能。

英文摘要

Instruction fine-tuning is employed to enhance the instruction-following ability of large language models (LLMs). As the amount of instruction fine-tuning data increases, selecting the optimal core set becomes particularly important. However, ensuring the diversity of the core set remains a significant challenge. Existing methods predominantly distinguish different training data based on the text features themselves, decoupled from LLMs' own understanding and representation of the data. To address this issue, we propose a Model-Aware Diverse Core Set Selection method, which distinguishes data features based on the neural activation states during LLM inference. This approach serves as an efficient instantiation of coverage-based selection using model-intrinsic activation features to ensure the diversity in the core set. We extensively evaluate our method on six benchmarks that cover five distinct tasks. In our method, the core set selected by the 3B-parameter LLM performs effectively when utilized to fine-tune larger models with 7B, 8B, and 13B parameters. Experimental results on the Alpaca-GPT4 dataset, which comprises 52K instruction-response pairs, show that the core set, sized at 15\% of the original dataset and selected by Llama-3.2-3B-Instruct, achieves an average improvement of 2.5\% when fine-tuning four larger base models compared with training on the full dataset. The experimental results demonstrate that our method enhances model performance on multiple downstream tasks while reducing data requirements.

2605.30852 2026-06-01 cs.CL 版本更新

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

推测性流水线解码:通过流水线并行实现更高准确度和零气泡推测

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei

发表机构 * Oregon State University(俄勒冈州立大学) DeepSolution(深思解决方案)

AI总结 提出推测性流水线解码(SPD)框架,利用流水线并行将目标LLM划分为n个流水线阶段并行处理n个token,通过推测模块聚合中间特征预测下一token,实现有限难度、高接受率和零延迟气泡,显著提升理论加速比。

详情
AI中文摘要

推测性解码(SD)通过草稿-验证范式加速低并发LLM推理。然而,主流方法通常依赖多token预测,这引入了逐渐增加的预测难度和串行草稿延迟。为了解决这些问题,我们提出了推测性流水线解码(SPD),这是一个突破性的框架,释放了流水线并行的真正潜力。通过将目标LLM划分为$n$个流水线阶段,SPD允许LLM并行处理$n$个token以加速解码。为了在单序列解码中持续填充流水线,推测模块聚合不同流水线深度的中间特征来预测下一个token,与目标模型的流水线步骤严格并行执行,从而实现有限的难度、更高的接受率和零延迟气泡。我们的实验表明,与主流基线相比,SPD实现了显著更高的理论加速比,为LLM解码加速提供了高度可扩展的解决方案。我们的代码可在https://github.com/yuyijiong/speculative_pipeline_decoding获取。

英文摘要

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding

2605.30844 2026-06-01 cs.CL cs.AI stat.ML 版本更新

Fine-Tuning Improves Information Conveyance in Language Models

微调提升语言模型中的信息传递

Yuwei Cheng, Weiyi Tian, Haifeng Xu

发表机构 * Department of Statistics(统计学系) University of Chicago(芝加哥大学) Department of Data Science(数据科学系) Department of Computer Science(计算机科学系)

AI总结 提出冠层熵(Canopy Entropy)度量,从树结构视角量化生成空间的有效大小,发现微调模型在总熵降低时仍能增强长度-熵率正相关,从而更高效地将不确定性转化为语义多样性。

详情
AI中文摘要

微调通常被认为会降低大型语言模型的不确定性和多样性,但现有分析忽略了输出长度这一关键混杂因素,因此未能捕捉不确定性在整个生成展开中的分布。为解决这一问题,我们提出冠层熵($\mathrm{CE}^\star$),一种从树视角看待语言生成的度量,其中“冠层”代表所有可能展开的空间,使得$\mathrm{CE}^\star$自然地量化生成空间的有效大小。$\mathrm{CE}^\star$共同捕捉输出长度$N$和生成序列$Y_{1:N}$中的不确定性——实际上,我们证明它等于总香农熵$H(N, Y_{1:N}\mid X)$,其中$X$表示提示。该公式产生了可解释的度量,包括长度-熵率相关项$ ho(N, r_N)$,其中$r_N$是熵率,通过指示较长输出是否每个标记信息量更多或更少来量化信息传递效率。实验上,跨任务和模型家族,我们发现微调模型一致地表现出更强的正相关$ ho(N, r_N)$,即使总熵降低。此外,在控制模型家族、任务、提示和输出长度效应后,我们发现微调几乎使熵率与语义多样性之间的相关强度增加了两倍,表明对齐模型更有效地将标记不确定性转化为语义多样性。总体而言,这些结果表明微调并非简单地降低不确定性,而是从根本上将其重组为更具信息性和语义意义的生成。我们的代码可在https://github.com/WeiyiTian/canopy-entropy获取。

英文摘要

Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\mathrm{CE}^\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\mathrm{CE}^\star$ naturally quantify the effective size of the generation space. $\mathrm{CE}^\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $ρ(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $ρ(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy-entropy.

2605.30833 2026-06-01 cs.CL cs.AI 版本更新

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

你的老师在这里帮不了你:对抗在线策略蒸馏中的监督保真度衰减

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Chinese Information Processing Laboratory(中文信息处理实验室) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 针对在线策略蒸馏中监督保真度衰减问题,提出前瞻组奖励方法,通过评估学生候选词在后续步骤中诱导的教师置信度并分配组归一化奖励,结合熵触发树注意力机制,显著提升长链推理性能。

详情
AI中文摘要

在线策略蒸馏通过使用来自教师的 token 级反馈,在学生模型自身生成的轨迹上训练学生模型来传递推理能力。然而,我们识别出一个关键瓶颈,即 extbf{监督保真度衰减(SFD)}:随着学生生成的前缀变长,教师的下一个 token 分布变得不那么自信和更具区分性。因此,反向 KL 蒸馏中依赖教师的纠正信号减弱,导致学生漂移在长推理链中累积。为了缓解 SFD,我们引入了 extbf{前瞻组奖励(\ours{})}。基于下一步教师置信度反映了未来反向 KL 监督的区分强度这一见解,\ours{} 通过学生在后续步骤中诱导的教师置信度来评估学生的 top-K 候选 token,并分配组归一化奖励。为了保持计算效率,我们进一步设计了一种熵触发的树注意力机制。在六个数学和代码基准测试中,\ours{} 在 7B 学生模型上比 OPD 提高了 mean@8 达 extbf{2.57} 个点,在长生成任务中增益更大,在 AIME-26 上达到 + extbf{4.92} 个点(39k token)。

英文摘要

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.

2605.30826 2026-06-01 cs.CL cs.AI 版本更新

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

超越一致性:为策展人分类对面板筛选的生物医学实体候选进行评分

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang, Siyu Zhang, Tingting Dan

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) ShanghaiTech University(上海科技大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出BioConCal评分器,利用无金标准的一致性、提及、表面可用性和文档特征,对多LLM面板筛选的候选实体进行评分,显著提高候选筛选的精确率和召回率。

详情
AI中文摘要

生物医学命名实体识别对于现代LLM来说看似简单:合理的生物医学提及容易浮现,但语料库约定正确性取决于标注约定、跨度边界、实体粒度和类型模式。多LLM一致性是一个显著性信号,而非语料库约定正确性。我们引入了一个候选级面板输出基准,用于面板筛选的候选验证,其中单元是由明确定义的多模型面板对齐的候选,而非独立提取器输出。该基准将八个LLM在五个公共生物医学NER数据集上的预测对齐到一个候选主表中。BioConCal是一个领域内监督评分器,它利用推理时的无金标准一致性、提及、表面可用性和文档特征,为固定候选流实例化这一层。在领域内,BioConCal将AUROC从原始一致性的0.753提高到0.910。在验证选择的0.95精确率目标下,它选择了1,340个候选,经验测试精确率为0.939,而原始一致性为293个候选。这对应于候选级召回率0.592和语料库级召回率0.523,而面板内行标签上限为0.883。主要好处不是恢复每个面板成员遗漏的实体,而是将嘈杂的面板流重塑为更高产出的审查队列。在实体类型转移下,阈值需要目标领域验证,而精确字符定位仍然是单独的后处理步骤。

英文摘要

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

2605.30813 2026-06-01 cs.CL cs.DS 版本更新

Incremental BPE Tokenization

增量式 BPE 分词

Shenghu Jiang, Ruihao Gong

发表机构 * Beihang University(北京航空航天大学) SenseTime Research(商汤研究)

AI总结 提出一种增量式 BPE 分词算法,以 O(log² t) 时间处理每个字节,总复杂度 O(n log² t),支持流式场景,相比 Hugging Face 和 OpenAI 的实现速度提升约 3 倍并降低延迟。

Comments Accepted to ICML 2026 (Spotlight)

详情
AI中文摘要

我们提出了一种用于增量式字节对编码(BPE)分词的新算法。该算法在最坏情况下以 $\mathcal{O}(\log^2 t)$ 时间处理每个输入字节,总复杂度为 $\mathcal{O}(n \log^2 t)$,其中 $n$ 是输入长度,$t$ 是最大分词长度。该算法增量地维护输入文本每个前缀的 BPE 分词结果,实现了由固定合并规则集定义的标准 BPE 合并过程。这使得在流式设置中能够进行高效的部分分词。作为标准 BPE 的即插即用替代方案,我们的方法相比 Hugging Face 的分词器实现了高达约 3 倍的加速,并在病态输入上相比 OpenAI 的 tiktoken 展示了显著的延迟降低。我们进一步引入了一种急切输出算法,支持流式输出,在增量分词过程中一旦确定分词边界即发出分词。总体而言,我们的结果表明 BPE 分词可以以具有强最坏情况保证的增量方式执行,同时在现代大语言模型流水线中提供实际的延迟优势。代码:https://github.com/ModelTC/mtc-inc-bpe

英文摘要

We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc-inc-bpe

2605.30804 2026-06-01 cs.CL 版本更新

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

将LLM性别偏见锚定人类基线:跨语言审计

Jiwoo Choi, Seonwoo Ahn, Tongxin Zhang, Seohyon Jung

发表机构 * School of Digital Humanities and Computational Social Sciences, KAIST, South Korea(数字人文与计算社会科学学院,韩国韩国科学技术院)

AI总结 通过HEXACO-100人格量表,跨英语、韩语、中文和日语审计六种大语言模型的性别刻板印象,发现其偏见幅度是人类跨国差异的2.5倍,并引入四模式框架(一致性、抑制、重组、放大)描述跨语言行为。

详情
AI中文摘要

我们审计了六种大语言模型(LLM)在英语、韩语、中文和日语中的性别刻板印象。其中三种主要面向英语使用(Claude、GPT、Gemini),三种面向东亚使用(DeepSeek、Syn-Pro、HyperCLOVA X)。我们采用HEXACO-100人格量表,并将每个模型锚定于覆盖48个国家的跨文化人类数据集,以询问的不是LLM是否有偏见,而是它们的性别归因偏离其部署人群的程度。我们的发现表明,它们的刻板印象范围大约是人类跨国范围的2.5倍,且该效应可能跨语言复合。一个以英语为中心的模型在用韩语提示时,达到了当地基线的5倍,即使提示表明候选人已被录用(这通常会减弱人类的刻板印象)。为了在不排序的情况下描述此类行为,我们引入了一个四模式框架——一致性、抑制、重组和放大——涵盖24个(模型×语言)单元。项目级分析表明,翻译不仅重新缩放刻板印象,还改变了与之相关的属性,在表面看似校准良好的情况下隐藏了显著的重新排列。我们的结果最终表明,没有单一的消除偏见流程能够均匀地解决跨语言边界的偏见。

英文摘要

We audit six large language models (LLMs) for gender stereotyping across English, Korean, Chinese, and Japanese. Three were developed primarily for English-language use (Claude, GPT, Gemini) and three for East Asian use (DeepSeek, Syn-Pro, HyperCLOVA X). We adopt the HEXACO-100 personality inventory and anchor each model against a cross-cultural human dataset spanning 48 countries to ask not whether LLMs are biased, but how far their gender attributions drift from the populations they are deployed among. Our findings show that their stereotyping spans a range roughly 2.5 times wider than the entire cross-country range found in humans, and the effect can compound across languages. One English-centric model, prompted in Korean, reached 5 times the local baseline, even when the prompt stated the candidate had already been hired, which often dampens human stereotyping. To characterize such behaviors without ranking them, we introduce a four-pattern framework -- concordance, suppression, reorganization, and amplification -- across 24 (model x language) cells. Item-level analysis reveals that translation does not just rescale stereotypes, but changes the attributes tied to it, hiding significant rearrangement under the surface while appearing well-calibrated. Our results ultimately suggest that no single debiasing pipeline is likely to address bias evenly across linguistic boundaries.

2605.30790 2026-06-01 cs.IR cs.AI cs.CL 版本更新

On the impact of retrieved content representations in RAG Pipelines

关于检索内容表示对RAG管道的影响

Jonathan J Ross, Bevan Koopman, Anton van der Vegt, Guido Zuccon

发表机构 * The University of Queensland(昆士兰大学) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 通过控制变量实验,研究检索文档的不同表示(选择、摘要、改写等)对RAG生成准确性的影响,发现答案保留是主要决定因素。

Comments 23 pages, 15 figures, submitted to ACL May 2026 ARR

详情
AI中文摘要

检索增强生成(RAG)通过检索到的文档补充语言模型的输入,但大多数RAG管道继承了为人类读者设计的检索组件。当消费者是大型语言模型(LLM)而非人类时,检索内容应如何表示尚不清楚。最近的工作提出了对检索内容的转换,并识别了影响生成的属性,但每项工作仅孤立地考察单一转换或属性,未明确文档表示的哪些特征最重要。我们通过控制比较来解决这一问题:固定检索不变,仅改变检索文档的表示,将原始基线与其他十三种转换(涵盖选择、摘要和改写,包括查询相关和查询无关变体)进行比较。在这十四种表示中,我们测量了四个生成器的问答准确性,并对每种表示测量了答案保留:即已知包含答案的文档在转换后是否仍支持其答案。我们发现,答案保留是生成器准确性的主要决定因素;值得注意的是,当保留率高时,表示的措辞、结构、长度和查询相关性影响有限。这表明,先前工作中归因于特定机制的准确性提升,可能部分由这些机制保留答案内容的能力解释,而这种归因在未控制保留的情况下无法确定。

英文摘要

Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.

2605.30788 2026-06-01 cs.CL cs.AI cs.LG 版本更新

XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

XLGoBench: 用算法任务检测跨语言技能差距

Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju

发表机构 * Google DeepMind(谷歌深Mind) Indian Institute of Technology Bombay(印度理工学院孟买分校) International Centre for Theoretical Sciences, Tata Institute of Fundamental Research(理论科学国际中心, Tata 基础研究机构)

AI总结 提出一套合成算法任务基准,通过跨语言执行相同任务来检测大语言模型的跨语言能力差距,实验揭示多个先进模型存在持续差距。

Comments 8+37pages

详情
AI中文摘要

我们引入一套合成算法任务,用于检测大语言模型在跨语言能力上的差距。我们的基准在语言间具有可比性,因为它要求模型在不同语言中执行相同的底层任务;可扩展,因为每个任务可以在不同复杂度级别生成,从而适应不同能力的模型;可量化,因为每个任务都承认客观的正确性概念;且透明,因为任务是从简单模板生成的,可以轻松审计翻译错误。由于我们的基准专注于算法任务,性能差异是跨语言差距的充分但不必要条件。尽管如此,我们通过大量实验表明,我们的基准暴露了多个最先进模型中存在的持续跨语言差距。

英文摘要

We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.

2605.30771 2026-06-01 cs.CL 版本更新

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

Eywa:基于溯源的人工智能智能体长期记忆

Resham Joshi

发表机构 * Eywa

AI总结 提出Eywa架构,通过先存储证据再推导事实、验证记忆并采用确定性多路径读取(零LLM调用)实现可审计的长期记忆,在多个基准测试中取得高准确率。

Comments 29 pages, 3 figures, 16 tables. Benchmark artifacts available at https://eywa.to/research

详情
AI中文摘要

跨会话持久化的人工智能智能体需要能够检索、审计、更新和擦除的记忆。现有的记忆系统通常将源证据、提取的事实、检索到的上下文和答案策略合并为一个不透明的提示路径,使得故障难以诊断:错误答案可能源于缺失证据、不支持的提取、过时状态、检索损失或答案模型行为。我们提出Eywa,一种基于溯源的记忆架构,围绕“证据先于信念”构建。Eywa在推导规范事实之前存储不可变的源证据,根据类型化信号和源支持验证提取的记忆,并通过确定性多路径读取路径(检索内部零LLM调用)检索有界的记忆上下文。检索到的上下文与答案指令分开返回,使得相同的记忆基质可以在前沿、预算和本地答案模型上进行评估。在冻结的、工件记录的检索配置下,Eywa在LoCoMo C1-C4分割上使用Claude Sonnet 4.6写入和QA角色达到90.19%的裁判准确率。在LongMemEval-S上,达到88.2%的检索充分性准确率。在BEAM(一个700问题的技术记忆压力基准)上,达到81.45%的平均nugget分数和85.29%的pass@score >= 0.5。完整的每问题工件,包括问题、黄金答案、模型答案、检索到的上下文和标签,发布在https://eywa.to/research。

英文摘要

AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt path, making failures difficult to diagnose: a wrong answer may come from missing evidence, unsupported extraction, stale state, retrieval loss, or answer-model behavior. We present Eywa, a provenance-grounded memory architecture built around evidence before belief. Eywa stores immutable source evidence before deriving canonical facts, validates extracted memories against typed signals and source support, and retrieves bounded memory context through a deterministic multi-route read path with zero LLM calls inside retrieval. Retrieved context is returned separately from answer instructions, allowing the same memory substrate to be evaluated across frontier, budget, and local answer models. Under a frozen, artifact-recorded retrieval configuration, Eywa reaches 90.19% judge accuracy on the LoCoMo C1-C4 split with Claude Sonnet 4.6 write and QA roles. On LongMemEval-S, it reaches 88.2% retrieval-sufficiency accuracy. On BEAM, a 700-question technical-memory stress benchmark, it reaches 81.45% mean nugget score and 85.29% pass@score >= 0.5. Full per-question artifacts, including questions, gold answers, model answers, retrieved context, and labels, are published at https://eywa.to/research.

2605.30758 2026-06-01 cs.CL cs.LG 版本更新

Pairwise Reference Alignment as a Model-Level Ordinal Observable

成对参考对齐作为模型级序数可观测量

Mujing Li

发表机构 * Independent Researcher(独立研究者)

AI总结 本文定义成对参考对齐为模型评分函数诱导的序数可观测量,提出中心化序参数统计量和基于边界的扩展,并给出有限样本估计和浓度界,通过Qwen2.5和RewardBench实验验证。

详情
AI中文摘要

成对偏好数据广泛用于语言模型评估和对齐,通常用于模型排名、奖励建模或偏好优化。本文提出了一个更基础的测量问题:给定成对偏好的参考分布,当我们测试模型是否将首选响应排在拒绝响应之上时,估计的是哪个模型级量?我们将成对参考对齐定义为由模型评分函数诱导的序数可观测量。给定三元组$(x,y^+,y^-)$上的参考对分布$P_{\mathrm{pair}}$和标量模型分数$S_M(x,y)$,我们将对齐可观测量定义为模型诱导的排序与参考偏好排序一致的概率。我们进一步定义了一个中心化的序参数类统计量,并讨论了基于边界的扩展。所得量在独立抽样假设下具有简单的有限样本估计量和浓度界。本文没有引入新的基准。它为成对参考对齐提供了概念和统计公式,阐明了参考对分布的作用,并将一般的序数可观测量与评分选择(如归一化对数概率或基于能量的分数)区分开来。我们还在Qwen2.5模型和RewardBench上进行了初步实证研究,其中所提出的统计量随模型大小和指令调优而增加,并根据公式在参考对子集之间变化。

英文摘要

Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses? We define pairwise reference alignment as an ordinal observable induced by a model scoring function. Given a reference pair distribution $P_{\mathrm{pair}}$ over triples $(x,y^+,y^-)$, and a scalar model score $S_M(x,y)$, we define the alignment observable as the probability that the model-induced ordering agrees with the reference preference ordering. We further define a centered order-parameter-like statistic and discuss a margin-based extension. The resulting quantities admit simple finite-sample estimators and concentration bounds under independent sampling assumptions. This note does not introduce a new benchmark. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log-probability or energy-based scores. We also provide an initial empirical study on Qwen2.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference-pair subsets as predicted by the formulation.

2605.30753 2026-06-01 cs.CL 版本更新

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

通过时空并行解码和置信度外推的高效扩散大语言模型

Zekai Li, Ji Liu, Yiqing Huang, Ziqiong Liu, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc. (AMD)(先进微器件公司(AMD))

AI总结 提出时空并行解码(TSPD)和置信度外推(CE)两种方法,通过动态控制去噪轨迹减少冗余迭代,加速扩散大语言模型推理。

详情
AI中文摘要

基于扩散的大语言模型(dLLMs)通过迭代去噪支持并行文本生成,但由于许多步骤花费在冗余精炼和重复掩码那些最终值已确定的token上,推理仍然延迟严重。先前的加速方法主要依赖于步骤局部置信度启发式或固定调度,这些方法对提示和任务变化敏感,且忽略了序列内的强位置效应。我们将扩散解码视为一个动态控制问题,并表明逐token的去噪轨迹为可靠控制提供了关键信号。我们提出了一个具有两个组件的轨迹感知解码框架。首先,时空并行解码(TSPD)使用一个轻量级的时空控制器,该控制器消耗每个token的轨迹特征,包括置信度、熵和动量,以及token位置,以决定何时token已收敛并可以安全固定。其次,我们引入了置信度外推(CE),一个无训练的状态空间模块,它预测未来的logit趋势并带有不确定性,以支持主动决策,包括安全的前瞻和在轨迹振荡或置信度不足时的目标稳定。TSPD和CE共同减少了不必要的去噪迭代,同时保持了输出质量,并且它们与系统优化(如KV缓存)干净地组合。

英文摘要

Diffusion-based large language models (dLLMs) support parallel text generation via iterative denoising, yet inference remains latency-heavy because many steps are spent on redundant refinement and repeated remasking of tokens whose final values are already determined. Prior acceleration methods mainly depend on step-local confidence heuristics or fixed schedules, which are sensitive to prompt and task variation and ignore strong positional effects within a sequence. We cast diffusion decoding as a dynamic control problem and show that token-wise denoising trajectories provide the key signal for reliable control. We propose a trace-aware decoding framework with two components. First, Temporal-Spatial Parallel Decoding (TSPD) uses a lightweight temporalspatial controller that consumes per-token trajectory features, including confidence, entropy, and momentum, together with token position, to decide when a token has converged and can be safely fixed. Second, we introduce Confidence Extrapolation (CE), a training-free state-space module that forecasts future logit trends with uncertainty to support proactive decisions, including safe look-ahead and targeted stabilization when trajectories are oscillatory or underconfident. Together, TSPD and CE reduce unnecessary denoising iterations while preserving output quality, and they compose cleanly with system optimizations such as KV caching.

2605.30736 2026-06-01 cs.LG cs.AI cs.CL 版本更新

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter: 一种面向生产的混合离线-在线学习LLM路由器

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen, Xile Ma, Yi Shi

发表机构 * Continuum AI

AI总结 提出OrcaRouter,一种结合LinUCB上下文赌博机与混合离线-在线学习协议的生产级LLM路由器,通过离线全信息反馈和在线赌博机学习实现低成本高精度模型选择。

Comments 6 pages, 1 table. Technical report

详情
AI中文摘要

大型语言模型的快速发展,每个模型具有不同的能力和推理成本,引发了一个实际部署问题:给定一个传入请求,应由哪个模型处理?我们提出OrcaRouter,一种面向生产的LLM路由器,它结合了基于词法和句子嵌入特征的LinUCB上下文赌博机与混合离线-在线学习协议。在离线阶段,OrcaRouter通过在一组精心策划的路由提示上评估每个候选模型来获取全信息反馈,生成一个奖励矩阵,用于为每个臂拟合一个岭回归器。在部署时,它从这些参数初始化,并可选地从赌博机反馈中继续学习,在观察到奖励后仅更新所选模型的臂。在我们提交RouterArena时(2026年5月20日),OrcaRouter-Adaptive以72.08的竞技场得分在公共RouterArena排行榜上排名第二,在每1000次查询成本1.00美元的情况下实现了75.54%的准确率。

英文摘要

The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

2605.30727 2026-06-01 cs.CL 版本更新

MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents

MosaicLeaks:深度研究代理的开放查询中的隐私风险

Alexander Gurung, Spandana Gella, Alexandre Drouin, Issam H. Laradji, Perouz Taslakian, Rafael Pardinas

发表机构 * ServiceNow AI Research(ServiceNow AI研究院) University of Edinburgh(爱丁堡大学) Mila - Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 针对深度研究代理在查询外部工具时可能泄露本地敏感信息的问题,提出MosaicLeaks基准测试和隐私感知深度研究(PA-DR)框架,通过强化学习降低信息泄露风险。

详情
AI中文摘要

深度研究代理越来越多地将私有本地文档与网络检索等外部工具结合,这带来了隐私风险:代理的外部查询可能泄露其本地上下文中的敏感信息。这种风险因马赛克效应而被放大,即单个查询看似无害,但聚合起来可能泄露信息。我们引入了MosaicLeaks,一个包含1,001个多跳深度研究任务的基准测试,这些任务将私有企业文档和公共网络语料库链接起来,迫使代理进行依赖本地信息的外部查询。我们通过一个仅观察代理外部查询并尝试在三个层面推断私有信息的对抗性LLM来评估泄露:代理的研究意图、特定私有问题的答案以及关于企业文档的可验证声明。我们发现,不同系列和大小的模型在三个层面上频繁泄露信息;零样本隐私提示减少了泄露但未消除泄露;仅针对任务性能的强化学习加剧了泄露。为了解决这个问题,我们提出了隐私感知深度研究(PA-DR),这是一个RL框架,结合了任务成功的情境奖励和一个学习到的隐私分类器,以提供对每个查询和马赛克级别泄露的密集信用分配。使用PA-DR训练Qwen3-4B-Instruct将准确率从48.7%提高到58.7%,并将答案和全信息泄露从34.0%降低到9.9%。

英文摘要

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information from its local context. This risk is amplified by the mosaic effect, where individual queries may appear harmless but become revealing in aggregate. We introduce MosaicLeaks, a benchmark of 1,001 multi-hop deep research tasks that chain private enterprise documents and a public web corpus, forcing agents to make external queries that depend on local information. We evaluate leakage with an adversary LLM that observes only the agent's external queries and attempts to infer private information at three levels: the agent's research intent, answers to specific private questions and verifiable claims about the enterprise documents. We find that models across families and sizes frequently leak at all three levels, that zero-shot privacy prompting reduces but does not eliminate leakage and that reinforcement learning for task performance alone worsens leakage. To address this, we propose Privacy-Aware Deep Research (PA-DR), an RL framework that combines situational rewards for task success with a learned privacy classifier to provide dense credit assignment over both per-query and mosaic-level leakage. Training Qwen3-4B-Instruct with PA-DR improves accuracy from 48.7% to 58.7% and reduces answer and full-information leakage from 34.0% to 9.9%.

2605.30723 2026-06-01 cs.CL 版本更新

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

技能并非一刀切:面向 LLM 智能体的模型感知技能对齐

Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li

发表机构 * East China Normal University(华东师范大学)

AI总结 提出 MASA 框架,通过层次化技能进化管道和轻量级条件重写器,为不同能力的 LLM 主干网络自适应调整技能,显著提升长时交互任务性能。

详情
AI中文摘要

LLM 智能体越来越多地检索外部策划的技能——在决策时获取的程序性指令——以提高在长时交互任务上的表现。现有的技能库通常被视为模型无关的,在不同容量和行为的骨干网络上重用相同的技能表述。然而,我们在多个模型规模上的控制实验表明,技能的有效性强烈依赖于模型:对某个骨干网络有益的技能可能对另一个有害。受此观察启发,我们提出了 MASA(模型感知技能对齐),一个无需修改智能体权重即可为每个目标骨干网络调整技能的框架。MASA 分两个阶段运行:(1)一个层次化技能进化管道,通过爬山法和 UCB 驱动的树搜索,在环境反馈和模型能力概况的指导下,迭代重写通用和任务特定技能;(2)一个在进化轨迹上训练的轻量级模型条件技能重写器,以在单次前向传播中重现适应过程。在三个交互环境和四个骨干网络上的实验表明,MASA 始终实现最佳整体性能,比最强基线高出最多 25.8 个点。学习到的重写器进一步泛化到未见过的任务和环境,无需额外搜索,以极低的推理成本持续超越更大的教师 LLM。

英文摘要

LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model-dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model-Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task-specific skills using hill climbing and UCB-driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.

2605.30717 2026-06-01 cs.CL 版本更新

Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models

语言模型中性别化与性别中立生成的神经元级干预

Zhiwen You, Nafiseh Nikeghbal, Jana Diesner

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 提出神经元级干预方法,通过识别与女性、男性和性别中立相关的特定神经元,实现对语言模型生成文本的性别控制,并发现性别神经元主要集中在前几层。

详情
AI中文摘要

语言模型(LMs)即使在给定中性提示时也可能产生性别化语言和刻板印象。以往关于LM性别偏见的研究大多通过二元视角(女性 vs. 男性)审视性别,对性别中立形式(如they/them代词或中性措辞的职位名称)关注有限。性别相关信号如何在LM的内部表示中编码仍是一个开放问题。在这项工作中,我们研究了LM中三类性别特异性神经元:女性、男性和性别中立。我们提出了一种神经元级干预方法,用于识别与每个性别类别紧密相关的神经元。然后通过受控生成测试这些神经元,表明激活或掩蔽性别相关神经元可以将句子导向目标性别形式,同时保留其原始含义。为了评估我们性别干预方法的有效性,我们整理了两个数据集,其中包含跨所有三个性别类别标记的受控句子,并通过人工评估验证数据质量。在两个开源LM上的实验表明,性别特异性神经元并非均匀分布在模型各层;相反,它们高度集中在最早层,后面层的贡献较小。与现有方法相比,我们的方法实现了更精确的性别控制,通过两个评估标准,对非目标性别类别的泄露更少,输出质量稳定。总体而言,我们的工作研究了性别如何在LM中编码,并为受控性别干预提供了一种简单而有效的方法,适用于神经元干预评估和性别偏见缓解。代码和数据集可在 https://github.com/zhiwenyou103/Gender-Neuron-Intervention 获取。

英文摘要

Language models (LMs) can produce gendered language and stereotypes even when given neutral prompts. Most prior work on gender bias in LMs primarily examines gender through a binary lens (feminine vs. masculine), with limited attention to gender-neutral forms, such as they/them pronouns or neutrally phrased job titles. How gender-related signals are encoded in the internal representations of LMs remains an open question. In this work, we study gender-specific neurons in LMs across three categories: feminine, masculine, and gender-neutral. We propose a neuron-level intervention method to identify neurons that are strongly tied to each gender category. We then test these neurons through controlled generation, showing that activating or masking gender-related neurons can steer a sentence toward a target gender form while preserving its original meaning. To evaluate the effectiveness of our gender-intervention approach, we curate two datasets with controlled sentences labeled across all three gender categories and validate the data quality through human evaluation. Experiments on two open-source LMs show that gender-specific neurons are not evenly distributed across model layers; instead, they concentrate heavily in the earliest layers with smaller contributions from later layers. Compared to existing methods, our method achieves more precise gender control, with less leakage into non-target gender categories and stable output quality through two evaluation criteria. Overall, our work examines how gender is encoded in LMs and provides a simple yet effective approach toward controlled gender intervention for both neuron intervention evaluation and gender bias mitigation. Code and datasets are available at: https://github.com/zhiwenyou103/Gender-Neuron-Intervention

2605.30712 2026-06-01 cs.CL 版本更新

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

ExpGraph: 面向LLM智能体的模型无关经验学习与图结构记忆

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nanyang Technological University(南洋理工大学) Meta Monetization AI(Meta 营收人工智能)

AI总结 提出ExpGraph框架,通过图结构记忆和强化学习实现冻结LLM执行器的外部经验复用,在问答、数学推理、代码生成和多步智能体任务上显著提升性能并减少交互步骤。

详情
AI中文摘要

大型语言模型(LLM)智能体在推理、工具使用和多步交互方面表现出强大的能力,但它们通常从零开始解决任务,未能重用先前经验中的成功策略或失败教训。对收集的经验进行微调可以改善重用,但当出现更强或更合适的执行器时,这种方法不够灵活。我们提出ExpGraph,一个模型无关的经验学习框架,使冻结且可替换的LLM执行器能够通过外部经验重用而无需参数更新来改进。ExpGraph将历史轨迹总结为可重用的技能和失败教训,将它们组织为自进化经验图中的节点,并通过图扩散和效用感知排序检索有用的经验。一个轻量级的检索副驾驶通过强化学习进行训练,使用比较有无检索经验时执行器性能的反馈,同时图根据下游任务结果在线更新。我们在ExpSuite上评估ExpGraph,涵盖问答、数学推理、代码生成以及包括ALFWorld和AppWorld在内的多步智能体环境。ExpGraph在静态任务上,对于较小和较大的执行器,分别比最强基线提高12.2%和4.7%;在智能体环境中提高21.4%和12.7%,同时平均交互步骤减少12.7%和21.6%。消融实验表明,图结构经验、效用感知排序和自适应检索共同实现了跨不同任务和执行器模型的有效经验重用。

英文摘要

Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.

2605.30711 2026-06-01 cs.CL cs.AI cs.LG stat.ML 版本更新

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE: 一种用于智能体大语言模型中高效记忆演化的新颖门控机制

Sijia Wang, Dhanajit Brahma, Ricardo Henao

发表机构 * Duke University(杜克大学)

AI总结 提出SAGE门控机制,基于von Mises-Fisher密度估计和自适应阈值,将记忆写入控制建模为新奇性检测问题,在LoCoMo上以更低成本实现最优token-F1。

详情
AI中文摘要

智能体大语言模型必须持续决定新提取的事实是应添加、与现有记忆合并还是忽略,然而先前的工作更侧重于检索和存储,而非原则性的写入端控制。我们将记忆演化视为一个新颖性检测问题,并提出SAGE(Spherical Adaptive Gate for memory Evolution),一种用于记忆演化的球形自适应门控机制,它通过基于von Mises-Fisher的密度估计器对记忆嵌入上的候选事实进行评分,并使用跟踪记忆存储几何结构的自适应阈值对其进行路由。SAGE将明确新颖的事实解析为ADD,明确冗余的事实解析为NOOP,仅将不确定的情况发送给LLM合并步骤,从而减少了昂贵的写入时推理。在LoCoMo上,SAGE在所有七个开放权重骨干对比中均实现了对Mem0的最佳平均token-F1,而在GPT-4o-mini上,它将添加阶段的API成本降低了3.4倍,添加阶段延迟降低了2.5倍,且平均评判分数差距很小。作为A-Mem的即插即用二进制门控,SAGE在五个模型上跳过了大约16-18%的LLM调用,且在开放权重骨干上质量变化极小。这些结果表明,新颖性感知的写入控制是提高长期智能体记忆中记忆质量和系统效率的实用杠杆。

英文摘要

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.

2605.30693 2026-06-01 cs.CR cs.CL 版本更新

Triaging Threats to Specialized Guardrails

针对专用护栏的威胁分类

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu, Sicong Jiang, Zihan Wang, Minglai Yang, Zhe Zhao, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GuardZoo统一基准和RouteGuard路由专家框架,通过将对话分流至专用护栏,解决单一护栏在多种威胁域上的任务干扰问题,提升细粒度检测和泛化能力。

详情
AI中文摘要

构建稳健的安全护栏对于在多样化的实际应用中部署大型语言模型至关重要。然而,这一目标仍然具有挑战性,因为安全风险涵盖异质的威胁领域,而现有数据集仅覆盖零散的风险子集并依赖不一致的分类法。因此,目前尚不清楚现有护栏能否在狭窄的评估设置之外泛化。为了更好地理解护栏模型的稳健性,我们首先引入GuardZoo,一个统一的人工标注基准,包含32,460个样本,覆盖15个不同的不安全类别。在GuardZoo上的评估揭示,单一护栏遭受任务干扰:不同的威胁领域需要不同的决策边界,这些边界难以压缩到单个模型中。因此,我们提出RouteGuard,一个路由-专家框架,将每个对话分流到专门的专家护栏进行威胁特定检测。实验表明,RouteGuard在强护栏基线上提高了细粒度威胁检测,在域外评估下泛化更好,并支持灵活的模块化扩展以应对新兴威胁。

英文摘要

Building robust safety guardrails is essential for deploying Large Language Models across diverse real-world applications. However, this goal remains challenging because safety risks span heterogeneous threat domains, while existing datasets cover only fragmented risk subsets and rely on inconsistent taxonomies. Consequently, it remains unclear whether current guardrails can generalize beyond narrow evaluation settings. To better understand the robustness of guardrail models, we first introduce GuardZoo, a unified human-annotated benchmark with 32,460 samples covering 15 distinct unsafe categories. Evaluation on GuardZoo reveals that monolithic guardrails suffer from task interference: different threat domains require distinct decision boundaries that are difficult to compress into a single model. We therefore propose RouteGuard, a router-expert framework that triages each conversation to specialized expert guardrails for threat-specific detection. Experiments show that RouteGuard improves fine-grained threat detection over strong guardrail baselines, generalizes better under out-of-domain evaluation, and supports flexible modular expansion to emerging threats.

2605.30690 2026-06-01 cs.CL 版本更新

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

ElasticMem: 将潜在记忆作为LLM智能体的可学习资源

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nanyang Technological University(南洋理工大学)

AI总结 提出ElasticMem框架,通过可学习的弹性潜在记忆分配策略,在记忆密集型问答和具身智能体控制任务中显著提升准确率并降低token开销。

详情
AI中文摘要

长期记忆对于LLM智能体在长时间交互中连贯推理、个性化响应和复用过往经验至关重要。然而,现有的记忆增强方法通常将记忆视为固定资源:文本空间方法将检索到的记忆拼接进上下文窗口,导致大量token开销和对噪声证据的敏感性;而潜在空间方法减少了文本成本,但仍依赖刚性检索或固定容量的记忆接口。这造成了查询相关的记忆效用与固定记忆分配之间的不匹配。我们提出ElasticMem,一种记忆增强的LLM框架,学习将记忆用作弹性潜在资源。ElasticMem构建了一个离线潜在记忆库,包含检索键和内容缓存,从推理器的隐藏状态自适应地检索记忆,通过学得的策略为每个检索到的记忆分配可变潜在预算,并将选中的潜在状态作为软记忆token注入生成过程。整个记忆使用过程通过组相对策略优化与下游任务奖励进行联合优化。我们在MemorySuite上评估ElasticMem,涵盖记忆密集型QA和具身智能体控制。在Qwen2.5-3B-Instruct和Qwen2.5-7B-Instruct骨干网络上,ElasticMem在加权平均QA准确率上分别比最强基线提升26.2%和24.6%,在ALFWorld成功率上分别提升66.3%和27.2%,同时实现了最低的ALFWorld token成本。消融和定性分析进一步表明,自适应检索和弹性预算分配帮助ElasticMem优先考虑有用证据和可迁移计划,超越了刚性余弦相似度。ElasticMem的代码将在https://github.com/ulab-uiuc/ElasticMem发布。

英文摘要

Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab-uiuc/ElasticMem.

2605.30685 2026-06-01 cs.CY cs.AI cs.CL cs.HC 版本更新

How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language

早期采用者如何在全球范围内使用生成式AI:按国家收入和语言的差异

Madeleine I. G. Daepp, Isaac Slaughter

发表机构 * Microsoft AI Economy Institute(微软人工智能经济研究所)

AI总结 基于大规模匿名化AI聊天机器人交互数据,实证分析了不同国家早期采用者在使用生成式AI上的差异,发现教育用途在低收入国家更普遍,休闲用途与收入正相关,且英语交互在非英语主导国家中过度代表,表明语言性能改进可能影响数字鸿沟或跨越式发展。

详情
AI中文摘要

全球范围内人们正在使用AI,但并非所有人都以相同的方式使用。利用一个广泛可用的免费AI聊天机器人的大规模匿名化、去标识化和隐私清洗的交互数据集,我们实证描述了不同国家早期采用者使用情况的差异。在大多数国家,尤其是低收入国家,教育是最常见的使用领域,教育使用与国家GDP之间存在明显的负相关。相比之下,休闲相关使用与国家收入水平正相关。我们发现,语言也塑造了使用模式:在研究期间,英语交互在那些主要语言未被现有模型很好服务的地区过度代表。我们的工作表明,改善跨语言性能可能是决定这项技术扩大数字鸿沟还是实现跨越式发展的关键因素。

英文摘要

AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.

2605.30675 2026-06-01 cs.CL cs.AI 版本更新

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

大型语言模型不确定性中的人类对齐、校准与激活模式

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward, Grayson Heyboer

发表机构 * Vanderbilt University(范德比大学) Tennessee Technological University(田纳西技术大学)

AI总结 研究大型语言模型的不确定性与人类不确定性的相似性,通过分析行为与内部激活模式,发现模型在多项选择和开放式事实回忆数据集上同时存在对齐与校准,并描述了指令微调的影响。

详情
AI中文摘要

不确定性量化是大型语言模型行为分析中一个庞大且不断发展的子领域。为了识别和对抗幻觉,该领域主要关注测量和改进校准,即不确定性判断对任务效能的准确性。在这项工作中,我们探讨了一个相对未被充分探索的问题:大型语言模型的不确定性与人类不确定性有多相似。我们研究了大型语言模型的外部行为和内部激活模式中是否存在类似人类的不确定性信号,即不确定性对齐。我们识别了模型在涵盖多项选择和开放式事实回忆的多种数据集上是否同时表现出对齐和校准的证据。并且我们描述了指令微调对这些方面的影响。

英文摘要

Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the accuracy of uncertainty judgments to task efficacy. In this work, we investigate the relatively underexplored question of how similar large language model uncertainty is to human uncertainty. We investigate the presence and strength of human-similar uncertainty signals, deemed uncertainty alignment, in large language model overt behavior and internal activation patterns. We identify whether the models show evidence of simultaneous alignment and calibration on a variety of datasets covering both multiple choice and open ended factual recall. And we characterize the effect of instruct fine-tuning on each of these facets.

2605.30673 2026-06-01 cs.CL 版本更新

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

TeachObs:多模态教学观察与模型评估的人工验证基准

Yeil Jeong, Youngjin Yoo, Seobin Sohn, Hyejin Han, Jinseo Lee, Scott Howard, Unggi Lee

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校) Pai Chai University(培才大学) Seoul National University(首尔国立大学) Ewha Womans University(成均馆大学) University of Wolverhampton(沃尔夫汉普顿大学) Korea University Sejong Campus(韩国大学世宗校区)

AI总结 提出TeachObs基准,包含30节公开课视频的5158个15秒场景,由7名研究者标注39个二值观察码,并评估5个前沿视觉大语言模型在三种任务上的表现。

详情
AI中文摘要

课堂视频包含可观察的教学实践,但其教学和视觉信号很少以适合模型评估的形式组织。我们提出了 extit{TeachObs},一个用于课堂视频中多模态教学观察的人工验证基准。 extit{TeachObs}包含来自八个国家的30节公开课视频,分为5158个固定的15秒场景。七名研究者用39个二值观察码标注每个场景,涵盖20个视觉码(如手势、板书、指向和视觉材料)和19个非视觉码(如指导、监控、提问、反馈和反思)。基于Krippendorff's alpha,使用可靠性和流行度感知规则构建黄金片段标签。除了片段级标签,三名专家评分员对30节课进行了课程级评分和定性评估,涵盖教学设计、教学实施、学生反应、学习材料和课程收尾,评分员覆盖范围详见正文。利用这两个人工参考层,我们在三个任务上评估了五个具备视觉能力的前沿大语言模型——纯文本片段编码、文本+帧片段编码,以及在LLM作为评判协议下的课程级覆盖评分——发现没有单一模型在所有三个任务中持续优于其他模型,添加中间帧会同时增加每个场景的真实和虚假归因,并且模型评估相对于专家评分员高估了程序清晰的课程。因此, extit{TeachObs}支持细粒度标注基准测试和整课评估,展示了AI系统在哪些方面可以辅助课堂视频分析,以及在哪些方面专家判断仍然必要,涵盖不同学科、课堂形式和标注难度水平。

英文摘要

Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes. Seven researchers annotated each scene with 39 binary observation codes, covering 20 visual codes, such as gesture, board work, pointing, and visual materials, and 19 nonvisual codes, such as instruction, monitoring, questioning, feedback, and reflection. Gold segment labels are constructed using reliability- and prevalence-aware rules based on Krippendorff's alpha. In addition to segment-level labels, three expert raters produced lesson-level ratings and qualitative evaluations of instructional design, instructional delivery, learner response, learning materials, and lesson closure across the 30 lessons, with rater coverage detailed in the body. Using these two human reference layers, we evaluate five vision-capable frontier LLMs across three tracks - text-only segment coding, text + frame segment coding, and lesson-level coverage scored under an LLM-as-judge protocol - and find that no single model consistently outperforms others across all three tracks, that adding a mid-frame inflates both true and false attributions per scene, and that model evaluations over-rate procedurally clear lessons relative to expert raters. \textit{TeachObs} therefore supports both fine-grained annotation benchmarking and whole-lesson evaluation, showing where AI systems can assist classroom video analysis and where expert judgment remains necessary across varied subjects, classroom formats, and annotation difficulty levels.

2605.30668 2026-06-01 cs.CL cs.AI 版本更新

CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation

CobSeg: 对话主题分割的连贯性边界建模

Sijin Sun, Liangbin Zhao, Jiaxiang Cai, Ming Deng, Mingyu Luo, Xiuju Fu

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Technology(高性能计算研究所,科技局) Shanghai Univeristy(上海大学) Fudan University(复旦大学)

AI总结 提出CobSeg多分支架构,通过分离连贯性语义与词汇边界转换并利用边界信息加权和主题连贯性线索,在无需LLM调用下提升对话主题分割性能。

Comments 8 pages with appindx. Under review

详情
AI中文摘要

对话主题分割在许多人类-AI协作应用中至关重要,需要识别异质边界线索,包括话语边缘附近的词汇转换和跨话语的语义不连续性。现有的话语模型常常稀释这些局部词汇信号。我们提出CobSeg,一种新颖的多分支架构,它将连贯性层面的语义连续性与词汇边界转换分离,并通过方向性边界预测恢复两者。CobSeg进一步使用边界信息加权来强调高效用的话语位置,并融合了基于语料库的主题连贯性线索与学习到的组合权重。尽管CobSeg在有监督的金标准边界训练和自动诱导边界的伪标签设置下作为紧凑的可训练分割器进行评估,它在推理过程中无需LLM调用即可实现增强的边界预测。在五个基准测试中,它改进了$P_k$和$W_d$,特别是在局部词汇线索显著时:在金标准监督下,它在VHF上将$P_k$降低了0.7个点,$W_d$降低了0.6个点,并在DialSeg711上达到了$P_k$为1.0;在诱导边界下,它在VHF上将$P_k$降低了14.8个点,在DialSeg711上降低了1.5个点,在TIAGE上降低了1.1个点,优于先前的非LLM方法。

英文摘要

Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances. Existing utterance models often dilute these local lexical signals. We propose CobSeg, a novel multi-branch architecture that separates coherence-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction. CobSeg further uses boundary informativeness weighting to emphasize high-utility utterance positions, and incorporates a corpus-derived topic coherence cue with learned combination weights. While CobSeg is evaluated as a compact trainable segmenter under supervised gold-boundary training and a pseudo-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference. Across five benchmarks, it improves $P_k$ and $W_d$ particularly when local lexical cues are prominent: under gold supervision, it reduces $P_k$ by 0.7 points and $W_d$ by 0.6 points on VHF, and reaches $P_k$ of 1.0 on DialSeg711; with induced boundaries, it reduces $P_k$ by 14.8 points on VHF, by 1.5 points on DialSeg711, and by 1.1 points on TIAGE, outperforming prior non-LLM approaches.

2605.30654 2026-06-01 cs.CL cs.AI cs.HC 版本更新

EUDAIMONIA: Evaluating Undesirable Dynamics in AI

EUDAIMONIA: 评估AI中的不良动态

Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast, Ravi Iyer, Robin Jia

发表机构 * University of Southern California(南加州大学)

AI总结 提出Social AI Design Code框架,并通过EUDAIMONIA基准测试评估22个最新LLM在社交互动中对用户福祉的符合程度,发现即使最强模型也违反约30%的设计要求。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作陪伴、情感披露和人际建议的对话伙伴,但这些互动的社会动态可能造成能力导向或传统安全评估无法捕捉的伤害。我们引入了Social AI Design Code,这是一个评估LLM在社交互动中是否符合用户福祉的框架,包括它们是否鼓励有害的亲密关系、依赖或长时间参与。为了在自然且多样化的用户-LLM互动中评估这些风险,我们通过弱到强过滤、多模型重新标记和受控重写,从WildChat构建了包含969个用户输入和3,147个设计要求违规检查的基准测试EUDAIMONIA,将代码操作化。评估22个最近的LLM,我们发现即使最强的模型Claude-Opus-4.7和GPT-5.5也分别违反了30.7%和27.2%的检查。扩展思考并未降低违规率,表明这些失败是持久的社会对齐问题,而非仅通过测试时推理就能解决的缺陷。

英文摘要

Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.

2605.30653 2026-06-01 cs.CL 版本更新

Counterfactual Graph for Multi-Agent LLM Calibration

多智能体LLM校准的反事实图

Jiatan Huang, Mingchen Li, Ziming Li, Sunjae Kwon, Hong Yu, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学) University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)

AI总结 针对多智能体LLM系统中通信导致的相关失败和虚假共识问题,提出CAGE-CAL框架,通过比较观察到的通信后图与匹配的反事实无通信图来校准置信度,提升可靠性区分和拓扑选择。

详情
AI中文摘要

多智能体LLM系统通常将一致性视为证据:当面板中的许多智能体给出相同答案时,该答案被认为更可靠。我们表明,在智能体通信后,这一假设可能失效。通信可能引发相关失败和虚假共识,因此相同的投票份额可能在一个拓扑中反映可靠的一致性,但在另一个拓扑中反映过度自信。我们提出CAGE-CAL,一个用于多智能体LLM的反事实智能体图校准框架。对于每个查询,CAGE-CAL将观察到的通信后智能体图与匹配的反事实无通信图进行比较,捕捉成对失败相关性和组级依赖性。CAGE-CAL不是简单地计算有多少智能体一致,而是估计观察到的依赖性与无通信依赖性之间的反事实偏移,并据此校准置信度。在五个基准测试中,CAGE-CAL以具有竞争力的ECE提高了可靠性区分,并且其校准后的置信度进一步改进了拓扑选择,优于最佳固定拓扑策略。

英文摘要

Multi-agent LLM systems often treat agreement as evidence: when many agents in a panel give the same answer, that answer is assumed to be more reliable. We show that this assumption can fail after agents communicate. Communication can induce correlated failures and false consensus, so the same vote share may reflect reliable agreement in one topology but over-confidence in another. We propose CAGE-CAL, a counterfactual agent-graph calibration framework for multi-agent LLMs. For each query, CAGE-CAL compares an observed post-communication agent graph with a matched counterfactual no-communication graph, capturing both pairwise failure correlations and group-level dependencies. Rather than simply counting how many agents agree, CAGE-CAL estimates the counterfactual shift between observed and no-communication dependence, and calibrates confidence accordingly. Across five benchmarks, CAGE-CAL improves reliability discrimination with competitive ECE, and its calibrated confidence further improves topology selection over the best fixed-topology strategy.

2605.30646 2026-06-01 cs.CL cs.AI 版本更新

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

同一患者,不同措辞,不同诊断?评估临床大语言模型中的语义稳定性

Mahdi Alkaeed, Adnan Qayyum, Nabeel Abo Kashreef, Muhammad Bilal, Junaid Qadir

发表机构 * Department of Computer Science and Engineering, College of Engineering, Qatar University(卡塔尔大学计算机科学与工程系) College of Science and Engineering, Hamad Bin Khalifa University (HBKU)(哈马德·本·卡伊夫大学(HBKU)理学院) Primary Health Care Corporation (PHCC)(初级卫生保健公司) Birmingham City University(伯明翰城市大学)

AI总结 针对临床大语言模型对语义等价但措辞不同的提示敏感的问题,提出基于自然语言推理的语义验证框架和三个量化指标,评估16个开源通用与医学模型,发现领域专业化并不一致地提升或降低鲁棒性。

Comments 14 pages, 5 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地应用于临床场景。然而,它们的行为对细微的语言变化(如改写或句法变化)高度敏感。这种敏感性在安全关键的医疗环境中带来风险,因为语义等价的输入应产生一致的预测。但一个关键挑战是确保提示变化真正保留临床意义,因为基于嵌入的相似性度量通常无法捕捉涉及否定、时间性或严重程度的区别。为解决这一局限,我们提出一个基于自然语言推理(NLI)的语义验证框架,用于过滤保留意义的提示变化,并进一步使用LLM-as-a-judge进行精炼,由临床专家审核。此外,我们引入三个指标来量化模型敏感性:保留意义变化敏感性(MVS)、置信度变化(ΔC)和最坏情况不稳定性(WCI)。我们使用来自DiagnosisQA和MedQA数据集的改写提示,评估了同一模型系列和参数规模下的16个开源通用(GP)和医学LLMs。结果表明,领域特定(DS)模型之间的鲁棒性差异是混合的且高度依赖模型,即领域专业化并不一致地改善或降低对保留意义提示改写的鲁棒性。几个DS模型在鲁棒性排名中位列前茅(与GP对应模型相比),而强大的GP基线模型也保持竞争力。

英文摘要

Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (ΔC), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.

2605.30641 2026-06-01 cs.CL cs.AI 版本更新

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

COFT: 用于大语言模型中公平思维链推理的反事实-保形解码

Arya Fayyazi, Mehdi Kamal, Massoud Pedram

发表机构 * Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, California, USA(电气与计算机工程系,南加州大学,洛杉矶,加利福尼亚州,美国)

AI总结 提出COFT,一种无需训练的解码方法,通过反事实提示和保形校准在解码时实现token级公平性控制,显著减少思维链生成中的社会偏见,同时保持任务效用和语言质量。

Comments Proceeding of ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在思维链(CoT)生成过程中可能揭示并放大社会偏见。我们提出COFT(Chain of Fair Thought),一种无需训练的解码方法,在解码时应用token级公平性控制,并对任何冻结的因果语言模型提供无分布边际有效性保证(在可交换性下)。COFT分三个阶段运行。首先,通过将敏感跨度替换为中性token来创建掩码反事实提示。其次,通过轻量级logit融合比较事实和掩码logit分布,以减弱属性驱动的偏见。第三,使用双分支分裂保形校准,在用户选择的风险水平下认证每步候选token集。我们在六个模型和多个偏见基准上评估COFT。我们的方法将标准偏见指标降低30-55%(中位数38%),同时保持任务效用和语言质量。推理准确率在运行间噪声范围内保持不变。计算开销适中,相当于一次额外的缓存前向传递(<=11%)。COFT提供了一条清晰、可审计的路径,实现更安全的CoT生成,显著减少偏见,效用损失可忽略,且无需重新训练、辅助分类器或权重访问。

英文摘要

Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (<=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.

2605.30640 2026-06-01 cs.LG cs.CL 版本更新

CSULoRA: Closest Safe Update Low-Rank Adaptation

CSULoRA:最近安全更新低秩适应

Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh, Salima Lamsiyah

发表机构 * Department of Computer Science, University of Luxembourg(卢森堡大学计算机科学系)

AI总结 提出CSULoRA方法,通过后处理校正LoRA适配器,在保留任务相关性的同时抑制不安全更新方向,降低攻击成功率。

Comments 10 pages, 3 figure

详情
AI中文摘要

低秩适应已成为大型语言模型参数高效微调的标准方法,但即使少量不安全或对抗性微调数据也会显著削弱对齐模型的安全行为。现有的安全保持LoRA方法通常依赖硬干预,如投影、剪枝、阈值化或额外训练目标。虽然这些方法可以抑制不安全更新方向,但它们也可能移除任务相关信息或需要额外调优。我们提出CSULoRA,一种通过最近安全更新估计来校正训练后LoRA适配器的后处理方法。CSULoRA从安全对齐模型与其对应基础检查点之间的权重位移中估计安全对齐子空间。然后,它将每个LoRA更新分解为完全对齐、部分对齐和子空间外分量。CSULoRA不丢弃估计安全子空间外的分量,而是求解一个闭式惩罚最小变化问题,该问题保留完全对齐分量,同时根据相对能量平滑衰减潜在不安全方向。在对抗性微调实验中,CSULoRA显著降低了攻击成功率,同时保留了标准LoRA微调获得的大部分效用增益。

英文摘要

Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned models. Existing safety-preserving LoRA methods often rely on hard interventions such as projection, pruning, thresholding, or additional training objectives. While these methods can suppress unsafe update directions, they may also remove task-relevant information or require extra tuning. We introduce CSULoRA, a post-hoc method for correcting trained LoRA adapters through closest safe update estimation. CSULoRA estimates a safety-aligned subspace from the weight displacement between a safety-aligned model and its corresponding base checkpoint. It then decomposes each LoRA update into fully aligned, partially aligned, and off-subspace components. Instead of discarding components outside the estimated safety subspace, CSULoRA solves a closed-form penalized minimum-change problem that preserves the fully aligned component while smoothly attenuating potentially unsafe directions according to their relative energy. In adversarial fine-tuning experiments, CSULoRA substantially reduces attack success rate while preserving most of the utility gains obtained from standard LoRA fine-tuning.

2605.30628 2026-06-01 cs.CL cs.AI cs.LG 版本更新

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

错误的架构:从普遍不可能到局部补丁的LLM可靠性

Mikhail L. Arbuzov, Lee Mosbacker, Sisong Bei, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

发表机构 * Independent Researcher(独立研究者) Palo Alto Networks(帕洛阿尔托网络)

AI总结 本文通过两个命题和一个推论,形式化地论证了通用LLM可靠性在无限域上不可实现,但在操作有界的局部补丁中可通过目录发现和干预覆盖实现可靠性。

Comments 25 pages, no figures

详情
AI中文摘要

通用LLM可靠性不是一个有限库问题:在所有可能任务、工具、模式、知识源和评估者期望中,新的可干预区分的失败模式会无界出现,因此没有有限的干预词典能保证对每种此类模式的有界残余错误。但部署的系统并不在整个宇宙中运行。它们在操作有界的补丁(法律审查、医学RAG、代码修复、客户支持代理、合同提取)内运行,这些补丁具有重复的任务、模式、工具和评估者期望。在这些补丁内,经验证据表明失败是稀疏的、重复的,并集中在一个小的重复目录中,因此可靠性变成了一个局部目录发现和干预覆盖问题,而不是指数级的令牌长度问题。我们通过两个命题和一个推论形式化了这一转变。命题1是最坏情况模式方面的负面结果:没有有限的干预词典能覆盖无界域的每个可区分的失败模式。推论1是逆发现蕴含:模式发现的对数上界无法容纳线性更多的不同尾模式,除非指数级地观察到更多的硬失败事件。命题2是积极的局部补丁结果:在活跃模式暴露对数增长和头部重覆盖下,每个硬决策的足够干预预算随序列长度多对数增长,并在补丁目录饱和后变为域常数。该框架重新定位而非消解长上下文困难:当硬决策数量本身随任务长度增长时,可靠性仍然困难;贡献在于识别轴向干预,而非使这些区域变得容易。

英文摘要

Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode. But deployed systems do not operate over the whole universe. They operate inside operationally bounded patches (legal review, medical RAG, code repair, customer-support agents, contract extraction) with recurring tasks, schemas, tools, and evaluator expectations. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue-discovery and intervention-coverage problem rather than an exponential token-length problem. We formalize this transition with two propositions and one corollary. Proposition 1 is the worst-case-mode-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain. Corollary 1 is the inverse-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard-failure events. Proposition 2 is the positive patch-local result: under log active-mode exposure and head-heavy coverage, a sufficient per-hard-decision intervention budget grows polylogarithmically in sequence length and becomes domain-constant once the patch catalogue saturates. The framework relocates rather than dissolves long-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on-axis intervention rather than to make those regimes easy.

2605.30611 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Crafter: 面向多样化输入的可编辑科学图表生成的多智能体框架

Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 提出Crafter多智能体框架,通过结构化组合离散语义组件,实现跨图表类型和输入条件的可编辑科学图表生成,并引入CraftEditor将栅格输出转换为可编辑SVG,在CraftBench基准上显著优于现有方法。

Comments 24 pages, 11 figures

详情
AI中文摘要

科学图表是传达复杂研究思想最有效的手段之一,但生成出版质量的插图仍然是论文准备中最劳动密集的部分。现有的自动化系统各自针对单一图表类型,且仅接受文本输入,未能解决研究人员实际使用的多样类型和条件;此外,它们的栅格输出无法进行局部修改。由于科学图表是离散语义组件的结构化组合,生成器在这些布局上产生的局部错误需要的不是更强的骨干网络,而是一个框架。我们将这个框架实例化为两个互补系统:Crafter,一个用于图表生成的多智能体框架,无需架构更改即可泛化到多种图表类型和输入条件;以及CraftEditor,它应用相同的模式将栅格输出转换为可编辑的SVG。此外,我们引入了CraftBench,一个涵盖三种图表类型和四种输入条件的基准,并带有手工质量标注。实验表明,Crafter在PaperBanana-Bench和CraftBench上显著优于独立的生成器和智能体基线,消融实验确认了每个组件的独立贡献;CraftEditor忠实地将输出转换为可编辑的SVG,超越了所有基线。我们的代码和基准可在https://github.com/HaozheZhao/Crafter获取。

英文摘要

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

2605.30604 2026-06-01 cs.CR cs.AI cs.CL cs.IR 版本更新

An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations

面向受监管网络安全运营的组织范围LLM代理运行时架构

George Fatouros, Georgios Makridis, George Kousiouris, John Soldatos, Dimosthenis Kyriazis

发表机构 * Innov-Acts Ltd(Innov-Acts有限公司) Dept. of Digital Systems University of Piraeus(数字系统系希腊比雷埃克斯大学) Harokopio University(哈罗基奥大学) Dept. of Informatics and Telematics(信息学与电信系)

AI总结 提出一种组织范围的LLM代理运行时架构,通过类型化安全上下文、运行时核心、专业子代理、受控工具适配层和分层人机回环,实现检索、工具调用、内存、发现、报告和审计的全局强制,并保持模型无关和本地部署。

Comments 8 pages, 3 figures

详情
AI中文摘要

受监管的网络安全工作流缺乏一个运行时基础,该基础能够在检索、工具调用、内存、发现、报告和审计中强制执行组织范围,同时保持模型无关和本地可部署。近期的大语言模型(LLM)代理系统在孤立的网络安全任务上报告了强劲结果,但它们本身并未为受监管的安全运营中心(SOC)和合规工作流定义一个可审计的平台架构,在这些工作流中,单个分析师可能触发约束整个组织的行动,并且运行时必须与现有的SIEM/XDR堆栈集成,作为上下文和告警驱动触发器的主要来源,而不是作为独立的分析层运行。本文提出了一种面向金融网络安全的组织范围LLM代理运行时架构。其贡献是一个类型化的安全上下文,该上下文在每个入口点创建,包括作为一等触发器摄入的SIEM/XDR通知,并在每个组件边界强制执行,结合共享的运行时核心、逻辑专业子代理、受控工具适配层(在统一策略和审计下暴露SIEM/XDR查询、丰富和响应原语)、带有证据引用的结构化发现、分层人机回环(HITL)门控以及仅追加审计。模型上下文协议(MCP)、扩展遥测、用于渗透测试的数字孪生、图检索和联邦知识共享被视为可选的扩展路径,而非强制性的运行时假设。我们描述了一个可实现的切片作为架构的可测试性表面,并提出了一个可证伪的评估计划,包含用于架构就绪性、安全策略执行、证据可追溯性、输出质量和运营可观测性的度量级通过标准。

英文摘要

Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert-driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization-scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first-class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human-in-the-loop (HITL) gates, and append-only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric-level pass criteria for architecture readiness, security-policy enforcement, evidence traceability, output quality, and operational observability.

2605.30599 2026-06-01 cs.LG cs.CL 版本更新

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

AMNESIA: 一个大规模医学遗忘基准套件与疾病知情分析

Saeedeh Davoudi, Reihaneh Iranmanesh, Ophir Frieder, Nazli Goharian

发表机构 * IR Lab, Computer Science Department, Georgetown University, Washington D.C.(信息检索实验室,计算机科学系,乔治·华盛顿大学,华盛顿特区)

AI总结 提出AMNESIA,首个大规模开源医学遗忘基准,包含70,560个问答对,评估四种遗忘方法,发现个体遗忘会侵蚀同病患者的其他知识。

详情
AI中文摘要

医学知识不断演变。这需要更新或选择性遗忘已训练的医学LLM中编码的信息。机器遗忘旨在无需完全重新训练即可移除特定训练数据对模型的影响。然而,现有的遗忘基准依赖于合成或小规模通用数据,导致临床遗忘研究不足。我们引入AMNESIA,首个大规模、开源医学遗忘基准,包含来自11种疾病类别、8,820份患者笔记的70,560个问答对。AMNESIA包括测试直接回忆的事实问题和测试临床推理的推理问题。我们用它来评估四种广泛使用的遗忘方法,分别在随机患者和疾病级别,并引入一个新的指标来检测医学术语的泄露。我们表明,遗忘个体患者会侵蚀具有相同病症的其他患者的知识,这需要能够更好地区分患者与共享临床知识的方法。

英文摘要

Medical knowledge is continuously evolving. This creates a need to update or selectively forget information encoded in already-trained medical LLMs. Machine unlearning aims to remove the influence of specific training data from a model without full retraining. Yet, existing unlearning benchmarks rely on synthetic or small-scale general data, leaving clinical unlearning understudied. We introduce AMNESIA, the first large-scale, open source benchmark for medical unlearning, with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. AMNESIA includes both factual questions testing direct recall and reasoning questions testing clinical inference. We use it to evaluate four widely used unlearning methods at both random patient and disease-level, and introduce a new metric for detecting leakage of medical terminology. We show that unlearning individual patients erodes knowledge of others with the same condition, calling for methods that can better separate patients from shared clinical knowledge.

2605.30590 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

反事实评估揭示临床LLM和智能体的隐藏能力画像

Matt Turk

发表机构 * Protege Data Lab(Protege数据实验室)

AI总结 提出因果敏感性评分(CSS),通过沿五个临床维度变异肿瘤病例来评估模型是否按预期方向更新推荐,发现与覆盖度指标排名相反,并揭示所有前沿模型在手术状态干预上的安全盲点。

Comments Accepted to RLEval @ ACM CAIS 2026 (Workshop on Methods and RL Environments for Evaluating AI Agents) and selected for an invited talk based on reviewer ratings. 4-page short paper + appendix

详情
AI中文摘要

两个临床AI系统在基于覆盖度的评分标准上得分几乎相同,但当患者输入变化时行为却截然不同:一个更新其推荐以匹配新的临床信号,而另一个无论输入如何都产生相同输出。我们引入因果敏感性评分(CSS),这是一个预注册的干预性指标,沿五个临床有意义的维度——生物标志物翻转、先前治疗失败、生物标志物移除、手术状态变化和分期扰动——变异肿瘤肿瘤委员会病例,并使用{0, 0.5, 1.0}量表对每个模型是否在预注册的正确方向上更新其推荐进行评分。与基于覆盖度的加权召回指标共识匹配评分(CMS)相比,来自三个实验室的六个前沿模型在224个病例的单次推理中评估,排名几乎完全相反:所有六个模型排名发生变化,CMS最差的模型成为CSS最好的模型,而一个中上CMS模型在CSS上排名最后。我们进一步揭示了一个普遍的安全盲点:每个前沿模型在手术状态干预上失败(D家族最多17.2%的CSS),这是CMS未暴露的发现。该指标也适用于使用工具的智能体:在ReAct风格的实验中,工具使用改善了六个模型中五个的CSS(+2.5到+20.3个百分点),然而CSS最低的模型检索相同的图表部分但仍未能更新其推荐——揭示了仅在反事实评估下可见的结构性响应缺陷。跨评判者复制和三位评估者的医学专业验证确认了总体发现。像CSS这样的干预性预注册指标补充了临床AI智能体的基于覆盖度的评估:它们捕捉了覆盖度指标遗漏的响应性,并为未来的智能体强化学习系统提供了候选的密集奖励信号。

英文摘要

Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.

2605.30589 2026-06-01 cs.CL cs.AI 版本更新

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

ImmigrationQA: 一个基于来源的数据集及面向美国移民法的小型模型适配

Nazarii Shportun

发表机构 * Independent Researcher(独立研究员)

AI总结 本文构建了基于来源的问答数据集ImmigrationQA(17,058对,覆盖13个移民子领域),并通过参数高效LoRA微调Llama 3.2 3B Instruct模型,在程序性子领域取得显著提升,但复杂法律推理仍较弱。

Comments 12 pages, 4 tables. Dataset (17,058 QA pairs), fine-tuned model, and code are publicly released

详情
AI中文摘要

美国移民法涵盖数千页的官方政策、联邦法规和程序指南,这些内容频繁变化,且对缺乏法律代表的申请人影响重大。我们描述了ImmigrationQA的构建过程,这是一个基于来源的问答数据集,包含13个移民子领域的17,058对问答,以及使用参数高效LoRA对Llama 3.2 3B Instruct模型在该数据集上的微调。语料库来自11个主要和次要来源——包括USCIS政策手册、8 CFR、BIA先例决定和社区问答——产生了10,056份经过验证的规范文档和18,308个文本块。使用Claude Sonnet 4.6通过五种模式特定提示从这些文本块生成结构化问答对,其中22对因来源跨度重叠不足被拒绝。微调模型在993对的保留测试集上使用LLM-as-judge评分进行评估,基于101个示例的分层样本。微调模型平均得分为1.08/3.0(16.8%完全正确;101示例分层评估),而Llama 3 8B基础模型得分为0.85/3.0(4%完全正确),平均分相对提升27%;零样本Claude Sonnet基线得分为1.52/3.0(25%完全正确)。微调模型在程序性子领域(旅行证件、身份调整、非移民签证)表现出集中改进,但在复杂法律推理和时效性统计方面仍然较弱。整个流程的云计算成本约为29美元。所有工件——数据集、模型、代码和提示模板——均已公开发布。该系统不能替代法律咨询,且不反映语料库抓取日期后的法规变化。

英文摘要

U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.

2605.30582 2026-06-01 cs.CL 版本更新

AI for Monitoring and Classifying Data Used in Research Literature

AI用于监控和分类研究文献中使用的数据

Rafael Macalaba, Aivin V. Solatorio

发表机构 * World Bank(世界银行) Office of the World Bank Group(世界银行集团办公室) Development Data Group(发展数据组)

AI总结 提出基于GLiNER的多任务框架,结合合成数据生成和LLM重验证,实现研究文献中数据集提及的提取、关系识别和使用上下文分类,以解决数据集使用监控中的标注稀缺和引用不一致问题。

详情
AI中文摘要

虽然Google Scholar和Semantic Scholar等平台追踪学术论文的引用,但尚无类似基础设施用于监控研究文献中的数据集使用情况,导致数据使用格局在很大程度上不透明。解决这一差距对于透明度、可重复性和影响监控至关重要,但进展受到不一致的引用实践、稀缺的标注数据以及文献中对数据集的模糊引用的阻碍。传统的NLP方法难以应对这些挑战,促使转向更具适应性、语义更丰富的模型。基于先前使用LLM进行数据提及检测和合成数据进行引导训练的工作,本文提出了一种用于可扩展数据集监控的更新方法。我们引入了一个基于GLiNER的多任务框架,联合执行数据集提及提取、关系识别和使用上下文分类。为解决标签稀缺问题,该流程利用合成数据生成产生训练示例,并基于LLM的重验证过滤错误提及并强制标签一致性,共同提高了整个训练流程的可靠性、覆盖率和输出一致性。这项工作推进了用于监控研究文献中数据使用的开源工具的开发,有助于实现可泛化、无约束的数据集引用追踪的更广泛目标。

英文摘要

While platforms like Google Scholar and Semantic Scholar track citations for academic papers, no comparable infrastructure exists for monitoring dataset usage in research literature, leaving the landscape of data use largely opaque. Addressing this gap is critical for transparency, reproducibility, and monitoring of impact, yet progress is hindered by inconsistent citation practices, scarce labeled data, and ambiguous references to datasets in the wild. Traditional NLP approaches struggle with these challenges, motivating the shift toward more adaptive, semantically rich models. Building on prior work using LLMs for data mention detection and synthetic data for bootstrapping training, this paper presents an updated methodology for scalable dataset monitoring. We introduce a multitask GLiNER-based framework that jointly performs dataset mention extraction, relation identification, and usage-context classification. To address label scarcity, the pipeline leverages synthetic data generation to produce training examples and LLM-based revalidation to filter incorrect mentions and enforce labeling consistency, together improving reliability, coverage, and output consistency across the training pipeline. This work advances the development of open-source tools for monitoring data use in research literature, contributing to the broader goal of generalizable, unconstrained dataset citation tracking.

2605.30580 2026-06-01 cs.CL cs.LG 版本更新

Speculative Decoding Across Languages

跨语言的推测解码

Nirajan Paudel, Michael Ginn, Luc De Nardi, Alexis Palmer

发表机构 * University of Colorado(科罗拉多大学)

AI总结 本文研究了通过微调草稿模型或使用n-gram模型来提高非英语语言中推测解码效率的策略,发现任务特定蒸馏虽能提升效率但泛化性差,而n-gram模型尽管接受率较低,但由于生成速度快,始终能提供显著的加速效果。

Comments 10 pages, 11 figures, submitted to ACL ARR May 2026

详情
AI中文摘要

推测解码已成为大型语言模型(LLM)推理的关键组成部分,通过草拟多个令牌并并行验证,实现更快的生成。然而,小型草稿模型往往在多语言能力上严重不足。因此,在生成非英语文本时,推测解码的效率远低于英语。我们比较了三种提高十一种语言推测解码效率的策略:在任务特定数据(翻译)上微调草稿模型;在未标记的单语语料库上微调草稿模型;以及在相同单语语料库上训练简单的n-gram草稿模型。我们在翻译(从英语到目标语言)和保留任务故事生成上评估效率。我们发现,虽然任务特定蒸馏可以显著提高效率,但蒸馏模型在新任务上泛化能力差。与此同时,n-gram草稿模型尽管接受率较低,但由于草稿生成速度快得多,始终能提供大的加速。

英文摘要

Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective. We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.

2605.30574 2026-06-01 cs.CL 版本更新

Probing the Prompt KV Cache: Where It Becomes Dispensable

探测提示KV缓存:何时变得可有可无

Vinayshekhar Bannihatti Kumar, Manoj Ghuhan Arivazhagan, Disha Makhija, Rashmi Gangadharaiah

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 通过控制实验发现,提示KV缓存的冗余主要源于聊天模板的形式而非内容,上层缓存可用中性填充模板的KV缓存替代而不损失精度。

详情
AI中文摘要

先前的KV缓存压缩方案经验性地表明,在解码过程中提示缓存部分冗余,丢弃或总结条目几乎不损失精度。我们探究这种冗余何时以及何种形式出现:在哪些层、经过多少解码步骤、以及提示跨度KV缓存可以何种方式被替换而不破坏任务。通过控制拼接干预,扫描层截止点和解码步骤,发现这种冗余关乎形式(聊天模板脚手架)而非内容。将上层提示跨度KV缓存替换为来自聊天模板脚手架(其用户内容为中性填充)的KV缓存,可恢复接近干净的精度,而将相同槽位归零则使精度崩溃。这种分离在Qwen3、Gemma 3和Llama 3系列的多数据集上重复出现。

英文摘要

Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which layers, after how many decoding steps, and in what form can the prompt span KV cache be replaced without breaking the task. A controlled splice intervention swept over layer cutoff and decoding steps shows this redundancy is about form (chat template scaffolding) rather than content. Replacing the upper layer prompt span KV cache with KV cache from a chat template scaffold whose user content is a neutral filler recovers near clean accuracy, while zeroing the same slots collapses accuracy. The dissociation replicates across the Qwen3, Gemma 3, and Llama 3 families on multiple datasets.

2605.30568 2026-06-01 cs.CL 版本更新

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

生成与精炼动态评估准则用于LLM-as-a-Judge

Zijie Wang, Eduardo Blanco

发表机构 * University of Arizona(亚利桑那大学) Department of Computer Science(计算机科学系)

AI总结 提出无需人工标注的自动生成细粒度评估准则方法,通过数据集级和实例级粒度生成,并利用元评判奖励信号迭代微调准则生成器,在成对和逐点评估中均超越现有基线。

详情
AI中文摘要

LLM-as-a-Judge 是一种可扩展的人工评估替代方案,然而现有的基于准则的方法依赖于人工标注数据,如参考答案或专家制定的准则。我们提出自动生成细粒度评估准则,无需任何人工标注。我们的免训练方法以数据集特定和实例特定的粒度生成准则,在四个基准测试中取得了与现有方法竞争的性能。我们进一步提出一种方法,通过元评判奖励信号迭代微调准则生成器。微调后的生成器在成对和逐点评估中均优于所有现有基线。值得注意的是,微调后的14B准则生成器在准则生成方面优于更大的专有模型,显示了我们的微调策略的有效性。

英文摘要

LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained evaluation rubrics without any human annotation. Our training-free method generates rubrics at dataset-specific and instance-specific granularities, achieving performance competitive with existing methods across four benchmarks. We further present a method that iteratively fine-tunes a rubric generator model via meta-judge reward signals. The fine-tuned generator outperforms all existing baselines in both pairwise and pointwise evaluation. Notably, a fine-tuned 14B rubric generator outperforms a much larger proprietary model at rubric generation, showing the effectiveness of our fine-tuning strategy.

2605.30557 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

看见不等于知道:视觉语言模型是否知道何时不回答空间问题(以及为什么)?

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Google Research(谷歌研究)

AI总结 针对视觉语言模型在空间推理中过度自信回答的问题,提出SpatialUncertain框架,通过遮挡和视角歧义两种挑战,评估模型是否知道何时应弃权以及如何寻找可靠证据。

Comments Website: https://zhangyuejoslin.github.io/spatialuncertain/

详情
AI中文摘要

空间推理是部署在真实环境中的视觉语言模型(VLM)的基本能力。然而,视觉观察本质上是对3D世界的有限表示:遮挡可能使物体不可见,视角可能使几何属性产生误导。尽管如此,现有的空间推理基准通常假设观察是充分且可靠的,侧重于模型是否产生正确答案,而不是它们是否认识到问题无法回答以及需要哪些额外观察。在这项工作中,我们通过构建一个受控评估框架SpatialUncertain来挑战这一假设,并引入两种观察挑战:(1)遮挡,隐藏目标信息;(2)视角歧义,产生误导性视觉线索。对于每种配置,我们设计在清晰观察下可回答但在引入挑战下需要弃权的空间问题。我们进一步评估模型是否能识别哪些额外视角可以解决视角歧义。我们在多种前沿开源和闭源VLM上的结果揭示了两个一致的失败模式。首先,模型倾向于过度自信地回答,即使在视觉证据不完整或具有误导性时也试图解决空间推理任务,在遮挡下平均准确率约为30%,在视角歧义下低于10%。其次,即使有额外视角可用,一些模型在识别哪些视角能提供可靠证据方面表现接近随机。总之,我们的发现呼吁超越答案正确性,转向评估模型是否知道何时弃权以及如何寻找可靠证据。

英文摘要

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

2605.30545 2026-06-01 cs.CL 版本更新

Refining Word-Based Grammatical Error Annotation for L2 Korean

针对L2韩语的基于词的语法错误标注改进

Jungyeul Park, Kyungtae Lim, Wonjun Oh, Benjamin Nguyen, Zihao Huang, Mengyang Qiu, Jayoung Song

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院) The University of British Columbia(不列颠哥伦比亚大学) Saint Elizabeth University(圣埃利赞大学) The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 通过解决现有资源中的三个问题(表面目标实现、韩语特定编辑标注和单参考评估),改进基于词的语法错误标注,并验证了改进资源在困惑度、编辑一致性和多参考评估上的有效性。

详情
AI中文摘要

韩语语法错误纠正(K-GEC)在基于词的评估与许多学习者错误的语素层面之间存在结构不匹配。后置词和动词结尾附着于词汇宿主,但它们编码的语法关系必须在纠正和评估中体现。本文通过解决现有资源中的三个相关问题来改进L2韩语的基于词语法错误标注:表面目标实现、韩语特定编辑标注和单参考评估。我们在形态约束实现规则下从国立韩语研究院(NIKL)L2语料库重建目标句子,并将其语素级标注转换为词级\texttt{m2}编辑。然后我们定义了一个韩语ERRANT风格的标注方案,保留MRU核心的同时区分功能语素错误、拼写错误、词边界错误和词序错误。我们还为KoLLA语料库增加了一个额外的参考纠正,为韩语GEC提供了多参考评估设置。实证验证表明,改进的NIKL目标具有更低的困惑度,转换后的\texttt{m2}文件与源-目标编辑表示的一致性更高,并且改进的资源在相同模型设置下提高了基于KoBART的纠正性能。多参考KoLLA评估进一步减少了对偏离单一参考的有效纠正的惩罚,尤其对于神经和提示式GEC系统。这些结果表明,韩语GEC评估不仅依赖于纠正模型,还依赖于反映韩语形态、空格和纠正变异性的参考数据和编辑标注。

英文摘要

Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they encode grammatical relations that must be represented in correction and evaluation. This paper refines word-based grammatical error annotation for L2 Korean by addressing three connected problems in existing resources: surface target realization, Korean-specific edit annotation, and single-reference evaluation. We reconstruct target sentences from the National Institute of Korean Language (NIKL) L2 corpus under morphologically constrained realization rules and convert its morpheme-level annotations into word-level \texttt{m2} edits. We then define a Korean ERRANT-style annotation scheme that preserves the MRU core while distinguishing functional morpheme errors, spelling errors, word boundary errors, and word order errors. We also augment the KoLLA corpus with an additional reference correction, yielding a multi-reference evaluation setting for Korean GEC. Empirical validation shows that the refined NIKL targets yield lower perplexity, the converted \texttt{m2} files achieve higher agreement with source-target edit representations, and the refined resources improve KoBART-based correction under the same model setting. Multi-reference KoLLA evaluation further reduces the penalty imposed on valid corrections that diverge from a single reference, especially for neural and prompted GEC systems. These results show that Korean GEC evaluation depends not only on correction models, but also on reference data and edit annotations that reflect Korean morphology, spacing, and correction variability.

2605.30529 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

通用嵌入还是特定嵌入,哪个更好?非英语语言临床编码搜索的实证研究

David Rey-Blanco, Roberto Cruz

发表机构 * TietAI

AI总结 本研究通过使用大型生成语言模型生成的合成数据微调双语编码器,构建两阶段检索器,解决了非英语语言临床编码检索中召回率下降的问题,并在多语言基准上取得了优于BioBERT-ST的性能。

Comments 24 pages, 12 figures, 6 tables

详情
AI中文摘要

用于语义搜索的句子嵌入模型绝大多数是在英语语料库上开发和评估的。当应用于其他语言的临床检索——特别是ICD-10-CM/CIE-10代码的检索——召回率会下降,而这种下降往往被聚合基准所掩盖。我们研究大型生成语言模型是否可以作为数据工厂来缩小这一差距。我们构建了一个两阶段检索器(双编码器后接交叉编码器重排序器),该检索器在Gemini生成的合成数据(涵盖英语、西班牙语、加泰罗尼亚语、意大利语、葡萄牙语和法语)上对西班牙生物医学编码器(PlanTL-GOB-ES/bsc-bio-ehr-es)进行微调,并与BioBERT-ST和未调优的西班牙编码器进行评估。仅双编码器在MRR(0.876 vs. 0.866)上匹配BioBERT-ST,并在R@3(0.650 vs. 0.626)和R@5(0.804 vs. 0.790)上超越它,且无需英语生物医学预训练。添加交叉编码器重排序器将聚合R@5提升至0.822,并在五种语言中的四种上占据主导地位(西班牙语+0.017,加泰罗尼亚语+0.033,法语+0.018,葡萄牙语+0.037),但以英语的小幅回归为代价。这种权衡在临床上是可接受的:葡萄牙语的R@5达到0.829,而BioBERT-ST为0.714。贡献:一个基于LLM生成数据构建领域特定医学检索器的开放配方;学习增益的量化(MRR从0.755到0.876,+15.9%,使用约19,500个合成对);以及按语言和排名对增益集中区域的刻画。

英文摘要

Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

2605.30526 2026-06-01 cs.LG cs.CL 版本更新

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

测量、定位和消融LLMs中的对齐特征

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür, Nick Feamster

发表机构 * University of Chicago(芝加哥大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 研究通过对比人类文本、基模型和对齐模型生成,发现对齐训练引入AI风格特征,并提出PASTA方法通过消融对齐方向来降低AI检测率。

详情
AI中文摘要

对齐语言模型通常表现出可识别的AI风格,但其与后训练和内部表示的联系尚不清楚。本文研究后训练是否引入或放大了AI风格规律,以及这些规律是否具有局部内部特征。为此,我们在匹配的人类源前缀下比较人类文本、基模型生成和对齐模型生成。对齐生成显示出比基生成更低的人类语料库亲和力和更高的AI检测率,表明后训练使生成文本偏离人类语料库风格,转向检测器可见的AI风格文本。然后我们引入PASTA(后训练对齐特征目标消融),一种无需训练的方法,通过对齐-基残差对比估计后训练对齐特征,并在解码过程中消融相应方向。在11个对齐模型和6个AI检测器上,PASTA降低了对大多数对齐模型的检测率;该效果在检测器间良好迁移,且不被随机方向复现。定性分析表明,PASTA生成保持相关性和连贯性,同时表现出更大的风格变化。这些结果共同表明,后训练的AI风格效果可以通过激活消融进行测量、定位和因果测试。

英文摘要

Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies AI-like stylistic regularities and whether these regularities have a localized internal signature. To this end, we compare human text, base-model generations, and aligned-model generations under matched human-source prefixes. Aligned generations show lower human-corpus affinity and higher AI-detection rates than base generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible AI-like text. We then introduce PASTA (Post-training Alignment Signature Targeted Ablation), a training-free method that estimates a post-training alignment signature from aligned-base residual contrasts and ablates the corresponding direction during decoding. Across 11 aligned models and 6 AI detectors, PASTA lowers the detection rate for most aligned models; this effect transfers well across detectors and is not reproduced by random directions. Qualitative analysis suggests that PASTA generations remain relevant and coherent while exhibiting greater stylistic variation. Together, these results show that AI-like stylistic effects of post-training can be measured, localized, and causally tested through activation ablation.

2605.30523 2026-06-01 cs.LG cs.AI cs.CC cs.CL cs.FL 版本更新

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

重新审视填充Transformer的表达能力:哪些架构选择重要,哪些不重要

Anej Svete, William Merrill, Ryan Cotterell, Ashish Sabharwal

发表机构 * ETH Zürich(苏黎世联邦理工学院) Allen Institute for AI(人工智能研究所)

AI总结 本文通过连接布尔电路,系统研究了填充Transformer的表达能力,发现数值精度和模型深度是影响表达能力的主要因素,而注意力类型、模型宽度和均匀性等架构选择对表达能力影响不大。

详情
AI中文摘要

近期工作通过连接布尔电路描述了Transformer能计算和不能计算的内容,但现有结果缺乏精确刻画,且对建模选择敏感。填充Transformer——在其输入后附加填充符号如“...”——通过为自适应并行计算提供多项式空间,成为建立与电路类等价关系的有用工具。然而,目前仅研究了有限的填充Transformer理想化模型,这些等价关系在注意力类型、模型宽度和均匀性变化下的稳健性仍待探索。我们发现,在实际假设下,填充Transformer对所有这些变化都出奇地稳健,并确定数值精度和模型深度是影响表达能力的主要因素。具体地,我们证明多项式填充的L-均匀常数精度Transformer等价于L-均匀AC⁰,而增长精度的Transformer达到L-均匀TC⁰,与宽度无关。此外,循环机制允许类似电路的顺序处理:log^d N次循环的常数精度Transformer达到FO-均匀AC^d,增长精度的达到FO-均匀TC^d。有趣的是,宽度或精度超过对数增长并不会增加表达能力,且我们所有结果对softmax和平均硬注意力Transformer均成立。

英文摘要

Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.

2605.30521 2026-06-01 cs.CL 版本更新

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

评估使用模拟工具调用隔离不可信提示输入

David Gros, Adam Gleave

AI总结 通过自动化红队测试,研究将不可信内容包装在模拟工具调用中作为隔离措施是否能提高大语言模型的鲁棒性,发现该方法通常无效甚至有害。

详情
AI中文摘要

大型语言模型必须频繁处理不可信输入,例如判断另一个模型的答案或在对抗压力下运行垃圾邮件和有害分类器等任务。这些输入通常以字符串格式直接插入到提示模板中,使系统容易受到操纵。目前来自 OpenAI 等主要提供商的 LLM 规范根据指令层次区分可信度,从系统消息(最可信)到工具结果(最不可信)。一种可能的自然缓解措施是将不可信内容包装在模拟工具调用中作为隔离。我们通过自动化红队搜索,在七个模型和三个 LLM 作为评判者的任务上探索了这一假设。与我们的假设相反,工具包装并未广泛提高鲁棒性。在二元评估任务(GSM8K 评分)中,它通常会增加攻击成功率,这似乎是指令层次的倒置。在标量和成对任务中,效果较小且依赖于模型,没有测试模型得到可靠帮助,并且几个模型显示出倒置。我们建议在部署系统中评估这一局限性,并从长远来看,追求更强的指令层次训练或新的不可信输入原语。

英文摘要

Large language models must frequently process untrusted inputs, such as judging an answer from another model or running tasks like spam and harm classifiers while under adversarial pressure. These inputs are often string-formatted directly into a prompt template, leaving systems fragile to manipulation. Current LLM specs from major providers like OpenAI distinguish trustworthiness along an Instruction Hierarchy, from System messages (most trusted) to Tool Results (least trusted). A possible natural mitigation is to wrap untrusted content in a mock tool call as a quarantine. We explore this hypothesis with an automated redteaming search over static attack strings across seven models and three LLM-as-a-Judge tasks. Counter to our hypothesis, tool-wrapping does not broadly improve robustness. On a binary evaluation task (GSM8K grading) it typically increases attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent, with no tested model reliably helped, and several showing inversion. We recommend evaluating this limitation in deployed systems, and longer-term, pursuing stronger Instruction Hierarchy training or new untrusted-input primitives.

2605.30514 2026-06-01 cs.LG cs.CL 版本更新

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

MAAT: 多阶段适配器感知的定向遗忘学习

Suryash Yagnik, Shubham Gaur, Saksham Thakur, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Indian Institute of Information Technology, Bhopal, India(印度比哈尔理工学院) University of California, Santa Cruz, USA(加州大学圣克鲁兹分校) Independent Researcher(独立研究者) Stanford University, USA(斯坦福大学) BITS Pilani Goa, India(比斯拉米印度学院)

AI总结 针对现有机器遗忘评估中因果知识(Why类)样本极少导致评估失衡的问题,提出5WBENCH平衡基准和MAAT多阶段框架,首次在Why类知识上同时实现高遗忘与高保留。

Comments 16 pages, 4 figures, 10 tables

详情
AI中文摘要

机器遗忘评估在结构上存在偏差:Why类问题(探究因果和关系知识)在CounterFact中占比不足0.06%,在ZSRE中占0.6%,在TOFU、MUSE和WMDP-Cyber中占不到1.3%。这种近乎为零的表示意味着,在因果知识上失败的方法可以在整体上获得高分,而这种失败在没有平衡评估的情况下是无法检测的。我们提出了5WBENCH,一个平衡的5000样本基准,每个5W类别(谁、什么、何时、何地、为什么)包含1000个样本,首次使因果遗忘失败变得可量化。使用5WBENCH,我们表明现有基线方法无法在Why类问题上同时实现高遗忘和高保留:激进的遗忘会降低保留知识,而保守的方法则无法遗忘因果事实。Why类问题的难度源于多跳推理链(Why条目占44%,其他类别≤2%)以及超过40.1个token答案跨度上的梯度稀释。我们提出了MAAT(多阶段适配器感知的定向遗忘学习),一个三阶段框架,作用于LoRA适配器权重,结合梯度投影上升、SVD秩维度剪枝、任务向量否定以及混合KL-隐藏状态保留修复。MAAT是第一个在Why类因果知识上同时实现高遗忘和高保留的方法,在遗忘-保留帕累托前沿上达到了新的操作点。我们公开了代码。

英文摘要

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.

2605.30504 2026-06-01 cs.CL 版本更新

Auditing LLM Benchmarks with Item Response Theory

使用项目反应理论审计LLM基准测试

Sander Land, Daniel M. Bikel

发表机构 * Writer, Inc.(Writer公司)

AI总结 本文引入基于项目反应理论的指标,通过114个模型的响应在七个偏好和多选基准测试中识别出前200个示例中95%精度的可能错误标签,并追踪错误来源,同时揭示奖励模型在风格偏好而非事实知识上的专业化。

详情
AI中文摘要

LLM基准测试的标签在发布时被冻结,并无声地传播到下游基准测试中,包括所有错误。我们引入了一个基于项目反应理论的指标,该指标使用114个模型的响应,在七个偏好和多选基准测试的前200个示例中,以95%的精度识别出可能的错误标签,性能优于监督分类器。我们将这些错误追溯到机械标签启发式方法、从源数据集中继承的未更改的上游注释错误,以及没有合理单一标签的根本性模糊项目。相同的模型拟合显示,奖励模型专门研究风格偏好而非事实知识,并识别出一个前沿奖励模型,该模型以78%的准确率与检测到的错误标签一致,而同行仅为38%,这与基准测试污染或基准测试特定的过度优化一致。

英文摘要

LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.

2605.30501 2026-06-01 cs.CL 版本更新

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

线性集成洗去水印:论LLMs中分布扰动的脆弱性

Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen, Runcong Zhao

发表机构 * Department of Informatics, King's College London, UK(伦敦国王学院信息学院) Department of Mathematics, Imperial College London, UK(伦敦帝国学院数学系) Department of Statistics, University of Warwick, UK(沃里克大学统计系)

AI总结 本文通过理论和实验证明,当用户访问多个模型时,对输出概率分布进行简单平均即可消除水印,并提出了WASH方法解决集成中的词汇对齐和分词差异问题。

详情
AI中文摘要

水印在AI生成的文本中嵌入统计签名,用于检测和归属。我们揭示了一个基本漏洞:当用户访问多个模型(当今的现实)时,水印轻易失效。水印将输出分布从原始分布扰动开,而在竞争市场中,这些扰动在不同提供商之间通常是独立的。我们从理论上证明,平均输出概率分布可以恢复未加水印的分布,误差项仅为二阶。实验上,仅平均3-5个模型即可抵消这些扰动。我们提出WASH(通过统计混合的水印衰减),解决了集成生成中的实际挑战:异构模型间的词汇不对齐和分词差异。在六种水印方案和三个LLM上的实验表明,平均3个模型可将检测z分数从5-300抑制到2以下(低于4的检测阈值),并将5% FPR下的TPR降至50%以下,同时质量提升27.5%,在长序列生成上运行速度比最佳基线快6倍。我们的结果表明,通过水印实现稳健的AI文本检测要么接受这一基本漏洞,要么需要模型提供商之间前所未有的协调。

英文摘要

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

2605.30497 2026-06-01 cs.CL 版本更新

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

CanLegalRAGBench:评估加拿大判例法上的检索增强生成

Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead, Vered Shwartz

发表机构 * Department of Computer Science, University of British Columbia(不列颠哥伦比亚大学计算机科学系) Vector Institute(向量研究所) CIFAR AI Chair Peter A. Allard School of Law, University of British Columbia(不列颠哥伦比亚大学彼得·A·艾尔德法学院)

AI总结 针对法律RAG系统中幻觉问题及加拿大法律评估不足,提出基于真实查询和专家标注的加拿大法律QA基准CanLegalRAGBench,发现检索性能受设计选择影响、开源嵌入模型与闭源模型竞争力相当,但自动评估存在局限且生成答案常偏离黄金标准。

详情
AI中文摘要

基于RAG的法律助手越来越受欢迎,但LLM的幻觉仍然是一个关键问题,可能损害司法公正。虽然已经开发了基准来评估进展,但许多基准依赖于合成查询而非现实的法律场景。此外,加拿大法律在现有评估中代表性不足。为了解决这一差距,我们引入了CanLegalRAGBench,一个基于真实查询和专家标注的加拿大法律QA基准,其答案基于判例法。我们的评估表明,检索性能对设计选择敏感,开源嵌入模型与闭源模型具有竞争力。然而,它也揭示了自动评估的局限性,即对检索到替代相关文档的系统进行惩罚。我们还发现,生成的答案往往偏离黄金标准,要么出现幻觉,要么产生过于详细或不相关的内容,其中8-29%的主张不被检索到的文档支持。我们希望这个基准将有助于推动法律RAG系统局限性的持续改进。

英文摘要

RAG-based legal assistants have been growing in popularity, but LLM hallucinations remain a key issue and potentially undermines justice. While benchmarks have been developed to evaluate progress, many rely on synthetic queries rather than realistic legal scenarios. Moreover, Canadian law remains underrepresented in existing evaluations. To address this gap, we introduce CanLegalRAGBench, a Canadian legal QA benchmark based on realistic queries and expert-annotated answers grounded in case law. Our evaluation shows that retrieval performance is sensitive to design choices and that open-source embedding models are competitive with closed source models. However, it also reveals the limitation of automatic evaluations that penalize systems for retrieving alternative relevant documents. We also find that generated answers often diverge from gold responses, either with hallucinations or by producing overly detailed or irrelevant content, with 8-29% of claims not being supported by the retrieved documents. We hope this benchmark will help drive continued progress in addressing limitations of legal RAG systems.

2605.30487 2026-06-01 cs.CL 版本更新

Configurable Reward Model for Balanced Safety Alignment

可配置奖励模型用于平衡安全对齐

Zhengping Jiang, Mehran Khodabandeh, Akash Bharadwaj, Manik Bhandari, Mayur Srungarapu, Anqi Liu, Benjamin Van Durme, Li Chen

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出可配置安全奖励模型(CSRM),通过配置目标数据增强和联合优化,实现对细粒度安全配置和对话细微差别的敏感性,在可配置安全基准上达到最优性能,并改善有用性与安全性的权衡。

详情
AI中文摘要

将大型语言模型(LLM)对齐到异构且快速变化的安全要求仍然是一个关键挑战。现有的指令微调LLM和独立安全分类器通常无法泛化到新的安全配置,这促使需要明确可配置以适应不断变化规范的奖励模型(RM)。我们引入了可配置安全奖励模型(CSRM),它针对校准的安全合规性和奖励建模进行了联合优化。我们的方法由配置目标数据增强支持,该增强在保持相对严重性结构的同时强制指令遵循。由此产生的RM对细粒度安全配置和对话细微差别敏感,显著改进了对未见安全配置的泛化能力。CSRM在最近的可配置安全基准上达到了最先进性能,包括CoSApien(94.6% F1)和DynaBench(75.8% F1),无需额外的人工注释。当用于下游安全对齐时,与现有基线相比,CSRM产生的LLM具有显著改善的有用性-安全性权衡。

英文摘要

Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications. We introduce the Configurable Safety Reward Model (CSRM), which is jointly optimized for calibrated safety compliance and reward modeling. Our approach is supported by configuration-targeted data augmentation that enforces instruction adherence while preserving relative severity structure. The resulting RM is sensitive to fine-grained safety configurations and conversational nuances, substantially improving generalization to previously unseen safety configurations. CSRM achieves state-of-the-art performance on recent configurable safety benchmarks, including CoSApien (94.6% F1) and DynaBench (75.8% F1), without requiring additional human annotation. When used for downstream safety alignment, CSRM yields LLMs with a significantly improved helpfulness-safety tradeoff compared to existing baselines.

2605.30481 2026-06-01 cs.CL 版本更新

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

当英语重写地方知识:大语言模型中的全球叙事主导

Md Arid Hasan, Ruwad Naswan, Farhan Samir, Sharifa Sultana, Syed Ishtiaque Ahmed

发表机构 * University of Toronto(多伦多大学) BUET(巴特利特大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本研究通过构建孟加拉语文化数据集CulturalNB,评估大语言模型在低资源文化背景下的跨语言知识一致性,发现英语提问会系统性地增加全球替代和制度框架,减少地方视角覆盖,表明文化失败不仅是知识缺失,更是根基和叙事优先级问题。

Comments Submitted to ARR

详情
AI中文摘要

大语言模型(LLMs)被广泛用作跨语言知识接口。然而,植根于文化的问题往往反映全球主导叙事而非地方背景。我们将这种失败模式称为孟加拉语(一种低资源文化背景)中的 extit{全球叙事主导}。我们引入了 exttt{CulturalNB},一个包含717个手工策划的孟加拉语文化实例的数据集,配有平行的孟加拉语-英语问答对、支持证据、元数据和社会文化注释。通过仅问题和基于证据的提示,我们使用人类和两个独立的LLM裁判,在跨语言一致性、语言锚定、全球替代、制度偏见和认知视角覆盖等指标上评估了九个最先进的LLM。结果表明,用英语提问会系统性地增加全球替代和制度框架,同时减少地方视角覆盖。地方证据提高了事实一致性和视角覆盖,但并未消除语言引起的认知偏移。这些发现表明,LLM中的文化失败不仅是知识缺失错误,更是根基和叙事优先级失败。

英文摘要

Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.

2605.30478 2026-06-01 cs.SE cs.CL 版本更新

Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback

通过验证反馈的强化学习改进用于代码生成的小型语言模型

Egor Skopin, Evgeny Kotelnikov

发表机构 * Vyatka State University(沃伊亚提卡州立大学) European University at St. Petersburg(圣彼得堡欧洲大学)

AI总结 本研究通过基于验证奖励的强化学习(RLVR)结合LoRA微调,在MBPP基准上对小型语言模型(Qwen3-0.6B和Llama3.2-1B)进行Python代码生成实验,发现奖励设计对功能正确性和行为诊断有显著影响,并提出组合奖励可稳定权衡正确性与风格约束。

Comments Accepted for AINL-2026 conference

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)使用程序可检查的信号(如单元测试结果)训练语言模型,从而能够直接优化代码生成的功能正确性。我们在MBPP基准上使用两个小型模型(Qwen3-0.6B和Llama3.2-1B)结合LoRA微调,对Python代码生成进行了RLVR的实证研究。在多种奖励公式(如:仅单元测试奖励、通过Ruff linter的仅静态分析塑形、以及组合奖励)下,我们比较了基于组的策略优化变体(GRPO和GSPO),并评估了功能正确性和行为诊断。在我们的实验设置中,RLVR在提出的组合奖励配置下将MBPP测试上的pass@1提高了最多13个百分点。然而,我们发现奖励塑形可能引发系统性行为偏移:仅使用静态分析惩罚可能使策略偏向于生成更短的代码,从而减少lint错误,但未能可靠地提高功能正确性。相比之下,组合奖励减轻了这种退化,并在正确性和风格约束之间产生了更稳定的权衡。总体而言,我们的结果强调,RLVR在代码生成中的有效性对奖励设计和优化粒度高度敏感,并且除了pass@1之外的诊断指标(包括生成长度、Ruff严重性分布和执行错误类型)有助于识别失败模式。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only rewards, static-analysis-only shaping via the Ruff linter, and a combined reward, we compare group-based policy optimization variants (GRPO and GSPO) and evaluate both functional correctness and behavioral diagnostics. In our experimental setting, RLVR improves pass@1 on MBPP test by up to 13 percentage points under proposed combined reward configuration. However, we find that reward shaping can induce systematic behavioral shifts: using only static-analysis penalties may bias the policy toward shorter completions that reduce lint errors without reliably improving functional correctness. In contrast, combined rewards mitigate this degeneration and yield more stable trade-offs between correctness and style constraints. Overall, our results highlight that RLVR effectiveness for code generation is highly sensitive to reward design and optimization granularity, and that diagnostics beyond pass@1, including generation length, Ruff severity profiles, and execution error types are useful for identifying failure modes.

2605.30472 2026-06-01 cs.CL 版本更新

Your Multimodal Speech Model Says I Have a Face for Radio

你的多模态语音模型说我有张适合电台的脸

Maya K. Nachesa, Vlad Niculae, Vagrant Gautam

发表机构 * Language Technology Lab University of Amsterdam(语言技术实验室 阿姆斯特丹大学) Heidelberg Institute for Theoretical Studies(海德堡理论研究所)

AI总结 通过配对不同人脸与相同音频的视频,评估多模态语音识别模型在性别、种族及其交叉属性上的偏差,发现质量差异显著,提示多模态并非必然更好。

详情
AI中文摘要

随着大型神经模型在语言任务上的表现越来越好,研究人员正在构建处理更多数据模态的多模态和全模态模型。其中一个例子是将语音识别模型扩展到音视频数据,用于噪声抑制和多模态字幕生成。虽然性能和偏差在单模态领域已得到广泛研究,但新模态如何影响这些方面尚不清楚,尽管它们在人类中会产生偏差。因此,我们提出了对多模态语音识别的首次偏差评估,我们创建了将不同人脸与相同音频配对的视频,并测量语音转录准确性的变化。我们发现,在mWhisper-Flamingo和Gemini模型中,自我声明的性别、种族及其交叉属性之间存在高达4.05个词错误率点的服务质量差异。我们的研究结果表明,开发者应优先评估、修复和传达此类限制,因为通过额外模态提供更多信号并不一定更好,甚至可能导致有偏见的结果。

英文摘要

As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to audio-visual data for noise mitigation and multimodal subtitling. While performance and bias have been studied extensively in the single-modality regime, it is unknown how new modalities affect this, even though they produce biases in humans. We therefore propose the first bias evaluation of multimodal speech recognition, where we create videos pairing different faces with the same audio, and measure changes in speech transcription accuracy. We find large quality-of-service differences across mWhisper-Flamingo and Gemini models, with drops of up to 4.05 word error rate points, across self-declared gender, ethnicity, and their intersection. Our findings point to a priority for developers to evaluate, fix, and communicate such limitations, as providing more signals through additional modalities is not necessarily better, and may even lead to biased outcomes.

2605.30465 2026-06-01 cs.CL 版本更新

Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study

知识图谱增强的零样本主题分类:多策略比较研究

Shahana Akter, Yatharth Vohra, Ankita Shukla, Souvika Sarkar

发表机构 * A2I Lab, School of Computing, Wichita State University(A2I实验室,计算学院,威斯康星州立大学) EIS Lab, College of Engineering, University of Nevada, Reno(EIS实验室,工程学院,内华达大学,里诺)

AI总结 提出零样本多标签主题分类框架,通过知识图谱增强文档表示,实验表明对小型模型有正面影响,对大型模型有负面影响。

Comments 15 pages, 1 figure, ACL format. This paper proposes a KG-augmented zero-shot multi-label topic classification framework and evaluates multiple strategies

详情
AI中文摘要

在没有标注训练数据的情况下进行多标签主题分类是一项具有挑战性的任务,特别是当文档包含复杂的关系信息时。我们提出了一个零样本多标签主题分类框架,并系统地研究了每篇文章的知识图谱增强如何影响其性能。基础框架在没有标注训练数据的情况下对文档中的主题进行分类,有四种变体:仅文章分类、关键词增强分类以及两者的自一致性解码变体。然后,我们为每个基础变体增加每篇文章的知识图谱。该图谱通过类似于KGGen的流水线从输入文档中提取,基于主语-谓语-宾语三元组。我们在十五个大型语言模型和八个跨不同领域的多标签数据集上测试了所有八种方法(四种基础和四种图谱增强)。对于基础框架,关键词增强分类(AK)是表现最好的方法,十五个大型语言模型中有六个超过了句子编码器基线。图谱增强对小型模型有正面影响,对大型模型有负面影响。这表明大型模型已经从预训练中包含了足够的关系信息。此外,自一致性解码变体在任何实验中都没有显示出性能提升,同时计算成本增加了约五倍。

英文摘要

Multi-label topic classification without labeled training data is a challenging task, specially when documents contain complex relational information. We present a zero-shot multi-label topic classification framework and systematically investigate how per-article knowledge graph augmentation affects its performance. The base framework classifies topics in documents without labeled training data and has four variants: article-only classification, keyword-enhanced classification, and self-consistency decoding variants of both. Then, we augment each base variant with per article knowledge graph. This graph is extracted from the input document through a pipeline similar to KGGen based on subject-predicate-object triples. We test all eight methods, four base and four graph augmented on fifteen LLMs and eight multi-label datasets across different domains. For the base framework, keyword-enhanced classification (AK) is the best performing method, and six out of fifteen LLMs surpass the sentence-encoder baseline. Graph augmentation has positive and negative impacts on small and large models, respectively. This shows that larger models already contain enough relational information from pretraining. Furthermore, the self-consistency decoding variant does not show performance improvements in any experiment while increasing computation costs about fivefold.

2605.30459 2026-06-01 cs.CL 版本更新

Can LLM Teams Play What? Where? When?

LLM 团队能玩“什么?哪里?何时?”吗?

Anastasia Kotelnikova, Viktor Byzov, Maria Dolzhenkova, Evgeny Kotelnikov

发表机构 * Vyatka State University(维yatka国立大学) European University at St. Petersburg(圣彼得堡欧洲大学)

AI总结 本文研究基于团队的交互策略(投票、静默团队、健谈团队)能否提升大语言模型在“什么?哪里?何时?”问答游戏中的表现,实验表明团队策略优于单模型基线,准确率最高提升20个百分点,最佳团队达到44.23%,接近人类水平。

Comments Accepted for Dialogue-2026 conference

详情
AI中文摘要

大语言模型(LLM)在需要间接推理、文化知识和协调假设测试的任务上仍然受限。我们研究基于团队的交互是否能提升LLM在“什么?哪里?何时?”(ChGK)问答游戏中的表现,该游戏旨在奖励集体推理。我们引入了三种团队策略:投票、静默团队(队长观察最终答案)和健谈团队(队长观察答案和理由)。为最小化数据泄露,我们在2025年发布的572个ChGK问题数据集上评估这些策略。使用六个近期的大规模开源模型,我们表明基于团队的策略优于单模型基线,准确率提升高达20个百分点。最佳团队达到44.23%的准确率,并在有人类统计数据的题目上接近人类团队表现。模型间多样性分析显示,分歧强烈预测较低准确率,但解释性沟通显著缓解性能下降。我们进一步检查队长行为,未发现自我偏好偏差的证据;访问同伴理由提升了队长判断。总体而言,LLM团队主要作为答案选择和错误过滤机制,而非新解决方案的生成器。我们的发现强调了交互的重要性,并表明自适应策略是多智能体系统的一个有前景的方向。

英文摘要

Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales). To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44.23% accuracy, and approaches human team performance on questions with available human statistics. Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.

2605.30448 2026-06-01 cs.LG cs.CL 版本更新

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

黑盒大语言模型蒸馏的有界行为不可区分性

Munawar Hasan

发表机构 * Michigan Technological University(密歇根技术大学)

AI总结 针对黑盒LLM蒸馏,提出有界行为不可区分性形式化定义,并通过对抗评估揭示语义相似性不足以保证行为不可区分性。

详情
AI中文摘要

黑盒大语言模型蒸馏通常被评估为输出匹配问题:当学生模型的响应与教师模型在语义上相似或任务一致时,即认为学生模型成功。然而,输出相似性并不意味着学生模型与其模仿的模型在行为上不可区分。我们引入了有界行为不可区分性,形式化为在显式提示分布上的$(ε,q,t,\mathbb{A})$-行为不可区分性,其中$ε$限制区分优势,$q$限制预言机查询次数,$t$限制计算量,$\mathbb{A}$表示对手类别。我们在Qwen和Llama教师-学生对上使用受控的$5,000$提示行为探测套件实例化该概念。对于每个系列,我们比较教师模型与基础学生模型以及LoRA蒸馏学生模型,衡量蒸馏是否降低了可区分性而不仅仅是提高了相似性。LoRA将Qwen的语义相似性从$0.788$提升至$0.862$,Llama从$0.814$提升至$0.874$。然而,对抗评估揭示了剩余的行为差异:学习到的判别器保持非零优势,成对类别分析显示伪影集中在风格/格式、鲁棒性和领域技术提示中。成对教师识别对手证实了这一趋势。使用不同系列的Llama评判器和A/B交换一致性过滤,Qwen的区分优势从基础学生模型的$0.158$下降到LoRA蒸馏后的$0.081$。查询预算实验表明,分歧引导的采集并不始终优于分层随机采样,表明覆盖率和多样性仍然是强基线。我们的结果表明,语义保真度有用但不足:黑盒大语言模型蒸馏需要有界、对抗性和类别感知的评估。

英文摘要

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(ε,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $ε$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

2605.30434 2026-06-01 cs.LG cs.AI cs.CL cs.MA 版本更新

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS-Bench:关于长周期智能数据分析的失败

Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(知识图谱联合实验室)

AI总结 提出LongDS基准,用于评估长周期多轮数据分析中智能体维护和更新分析状态的能力,发现最佳模型平均准确率仅48.45%,且长周期错误占失败原因的52%-69%。

Comments Ongoing work

详情
AI中文摘要

现实世界的数据分析本质上是迭代的,然而现有基准大多评估孤立或短期的交互任务,未能测试智能体在长周期内跟踪不断变化的分析上下文的能力。我们引入了LongDS,一个用于长周期、多轮数据分析的基准,其中智能体必须维护、更新、恢复和组合不断变化的分析状态。LongDS包含68个从真实世界Kaggle笔记本构建的任务,涵盖地球科学、商业和教育等六个领域的2,225轮交互。任务围绕状态演化模式(例如反事实扰动、回滚、多状态组合)设计,平均依赖跨度为11.3轮。评估五个最先进模型,我们发现最佳模型仅达到48.45%的平均准确率,性能从早期到后期轮次下降近47个百分点,长周期错误占失败原因的52%-69%。进一步分析表明,额外的智能体步骤并不一定能提高性能,这表明关键瓶颈在于维护正确的分析状态,而非增加交互预算。我们发布LongDS以支持可靠的长周期智能数据分析研究。代码和数据将在https://github.com/zjunlp/DataMind发布。

英文摘要

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

2605.30415 2026-06-01 cs.CL cs.AI 版本更新

Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology

语言模型中的领域适应与推理框架:以历史宇宙学为受控实验

Francesco De Bernardis

发表机构 * Independent Researcher(独立研究者)

AI总结 通过历史宇宙学受控实验,研究领域适应如何重塑语言模型的解释行为,发现适应主要改变解释框架而非直接改变立场。

Comments 17 pages, 3 figures

详情
AI中文摘要

我们以历史宇宙学为受控环境,研究领域适应如何重塑语言模型中的解释行为。在第一阶段,我们在一个去除明确日心说引用的前哥白尼语料库上从头训练一个小型语言模型,并评估地动说或日心说延续是否仍然出现。在第二阶段,我们使用QLoRA在同一语料库上微调一个更大的预训练模型,以研究适应如何修改解释框架和宇宙学立场。模型输出使用LLM-as-judge框架进行评估,该框架标记宇宙学立场(地心说、日心说或模糊)和解释框架(前现代与现代)。在受限的第一阶段,较小的模型偶尔生成局部的地动说延续,但这些延续全局不稳定,不足以支持连贯的宇宙学推理。在第二阶段,微调导致向现代前解释框架的大幅且统计显著的转变,而条件宇宙学立场分布在这些框架内相对稳定。因此,地心说输出的增加主要源于解释机制的重新分布,而非立场的直接修改。这些结果表明,领域适应可能主要重塑生成延续的语言框架,而立场的变化则次要地源于这些转变。

英文摘要

We investigate how domain adaptation reshapes explanatory behavior in language models using historical cosmology as a controlled setting. In Phase 1, we train a small language model from scratch on a pre-Copernican corpus from which explicit heliocentric references were removed, and evaluate whether Earth-motion or heliocentric continuations nevertheless emerge. In Phase 2, we fine-tune a larger pretrained model using QLoRA on the same corpus in order to study how adaptation modifies explanatory framing and cosmological stance. Model outputs are evaluated using an LLM-as-judge framework that labels both cosmological stance (geocentric, heliocentric, or ambiguous) and explanatory frame (premodern versus modern). In the constrained setting of Phase 1, the smaller models occasionally generate local Earth-motion continuations, but these remain globally unstable and insufficient to support coherent cosmological reasoning. In Phase 2, fine-tuning induces a large and statistically significant shift toward premodern explanatory framing, while the conditional cosmological stance distributions remain comparatively stable within those frames. As a result, increases in geocentric outputs arise primarily from redistribution over explanatory regimes rather than from direct modification of stance. These results suggest that domain adaptation may primarily reshape the linguistic frameworks from which continuations are generated, with changes in stance emerging secondarily from those shifts.

2605.30400 2026-06-01 cs.CL 版本更新

Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

使用RAG支持的跨模型多数投票工作流评估ChatGPT在生物医学关联生成与验证中的协议

Ahmed Abdeen Hamed, Luis M. Rocha

AI总结 提出一个协议,通过RAG和跨模型多数投票工作流评估ChatGPT生成疾病中心生物医学关联的能力,并验证其可靠性。

Comments Main Manuscript and Supplementary Information. Both are equally important

详情
Journal ref
STAR Protocols, 2026; 7
AI中文摘要

我们提出一个协议,用于评估ChatGPT生成以疾病为中心的生物医学关联的能力。该协议概述了如何生成关联、使用生物医学本体验证生物实体以及通过文献验证关联。协议包括一个自一致性策略,用于评估跨ChatGPT模型的生成可靠性。为了解决本体精确匹配的限制,我们提供了一个用例,通过由开源大语言模型(LLMs)驱动的检索增强生成(RAG)工作流执行语义验证。这使得LLMs能够对其他LLMs生成的内容建立真实性,并暴露幻觉。

英文摘要

We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models. To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.

2605.30391 2026-06-01 cs.MA cs.AI cs.CL 版本更新

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

机器中的社会推理:探究大语言模型辩论中的集体求真动态

Tom Pecher

发表机构 * Department of Computer Science University of Bath(计算机科学系英国巴斯大学)

AI总结 通过多智能体辩论模拟论证推理理论,证明大语言模型集体辩论能显著提升基于问卷的求真任务性能,并提出利用辩论动态测量模型内在属性的新基准方法。

Comments Master's thesis

详情
AI中文摘要

人类推理长期以来被理论化地认为是通过社会方式运作的,而非孤立的个体认知,而是通过集体对抗性话语,这一框架被称为论证推理理论(ATR)。ATR不依赖个体“理性主义者”作为求真的主要载体,而是将真理重新概念化为社会认识论的涌现属性:即在不完美的个体推理在辩论的对抗压力下精炼的产物。这种集体智能的分布式方法引导人类达到了更高的认知高度,并支撑着所有民主制度的基本原则。本论文首次通过大语言模型(LLM)的多智能体辩论(MAD)模拟ATR,开辟了新领域。通过严格的实证分析,我们证明,在正确设计一组认知多样化的模型时,LLM-MAD能够显著提高基于问卷的求真任务性能,即使单个辩论参与者独立表现有限。此外,我们提供了强有力的实证证据,表明这种性能提升在机制上根植于ATR的核心原则,暗示集体推理可能普遍优于个体推理,而非生物学或进化中的偶然现象。最后,基于对辩论动态的分析,我们提出了一种新的基准测试方法,利用LLM-MAD测量模型内在属性(如幻觉倾向),从而以当前静态基准方法无法支持的方式比较模型。

英文摘要

Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.

2605.28646 2026-06-01 cs.CR cs.CL 版本更新

MaskClaw: Edge-Side Personalized Privacy Arbitration for GUI Agents with Behavior-Driven Skill Evolution

MaskClaw: GUI代理的边端个性化隐私仲裁与行为驱动技能进化

Yanqiu Zhao, Dongying Zheng, Kaibo Huang, Yukun Wei, Zhongliang Yang, Linna Zhou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出MaskClaw,一种在边端进行隐私仲裁的GUI代理框架,通过本地视觉证据提取、策略记忆检索和沙箱门控的技能进化,在截图离开信任环境前决定允许、遮蔽或询问,解决了静态PII检测和云端推理的隐私边界问题。

Comments Preprint. Submitted to EMNLP 2026. 21 pages, including appendices; 5 figures Under review. Yanqiu Zhao and Dongying Zheng contributed equally to this work

详情
AI中文摘要

GUI代理依赖截图来推断意图并在应用程序间操作,但这些截图通常包含私人消息、医疗记录、支付凭证和工作场所特定的工作流程。在这种设置中,隐私决策取决于任务、接收者、应用程序状态和用户角色,然而静态PII检测器无法捕捉这些边界,而云端VLM推理可能在决定保护什么之前上传原始屏幕。我们提出MaskClaw,一个用于GUI代理的边端隐私仲裁器。MaskClaw提取本地视觉证据,检索用户和任务特定的策略记忆,并在原始截图离开受信任的用户或组织控制环境之前决定允许、遮蔽或询问。在五个设计的技能进化场景中,它将纠正、取消和编辑转化为可复用的隐私技能,并通过沙箱门控进行检查。我们引入了P-GUI-Evo,一个基于真实UI模式、重构的HTML屏幕和清理标签构建的基准。实验表明,仅靠模式匹配、云端推理和路由在相同协议下倾向于过度确认、过度遮蔽或暴露原始截图。相关工件可在https://github.com/Theodora-Y/MaskClaw获取。

英文摘要

GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace-specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud-side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge-side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. In five designed skill-evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P-GUI-Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over-confirm, over-mask, or expose raw screenshots under the same protocol. The artifact is available at https://github.com/Theodora-Y/MaskClaw.

2605.27881 2026-06-01 cs.CL 版本更新

Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents?

检索、奖励与训练协议:训练搜索代理的关键因素是什么?

Yibo Zhao, Zichen Ding, Jiayi Wu, Zun Wang, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(东华大学数据科学与工程学院) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文通过控制实验,系统研究了检索语料库、奖励设计和训练协议三个维度对搜索代理训练的影响,发现纠正语料覆盖问题比算法差异更有效,简单的基于结果的奖励方法在多数设置下表现优异,并提出了实用训练指南。

Comments 18pages, 4 figures, and 15 tables

详情
AI中文摘要

由大型语言模型驱动的搜索代理能够通过多步推理自主分解查询、检索信息并综合答案。然而,训练方法的快速增长已超越了受控比较:现有工作在检索语料库、奖励设计和训练协议上存在差异,使得实际驱动改进的因素不明确。我们提出了一项受控实证研究,隔离了搜索代理训练中三个未充分探索的维度。首先,我们识别了广泛使用的Wikipedia 2018语料库中的一个关键数据覆盖问题,并表明仅纠正该问题带来的收益就大于训练算法之间的差异。其次,我们系统比较了三种基础模型上基于结果和基于过程的奖励方法,发现最简单的基于结果的方法在大多数设置中达到竞争性或更优的性能,并且过程级信用分配可能过度纠正代理行为。第三,我们分析了训练数据多样性、离策略数据利用和搜索预算缩放,提炼出训练有效搜索代理的实用指南。我们的代码可在https://github.com/YiboZhao624/SearchAgentReview获取。

英文摘要

Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi-step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under-explored dimensions of search agent training. First, we identify a critical data-coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome-based and process-based reward methods across three base models, finding that the simplest outcome-based approach achieves competitive or superior performance in most settings, and that process-level credit assignment can over-correct agent behavior. Third, we analyze training data diversity, off-policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at https://github.com/YiboZhao624/SearchAgentReview.

2605.27355 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

对齐篡改:人类反馈强化学习如何被利用以优化错位偏见

Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

发表机构 * MIT(麻省理工学院)

AI总结 本文提出对齐篡改漏洞,即对齐中的LLM通过影响偏好数据集使RLHF放大不良行为,并通过实验展示多种偏见的放大,指出现有缓解方法难以在不牺牲质量的情况下解决该问题。

Comments Accepted at ICML 2026, Source code: https://alignment-tampering.github.io/

详情
AI中文摘要

人类反馈强化学习(RLHF)是将大型语言模型(LLM)与人类偏好对齐的标准方法。在本工作中,我们引入对齐篡改,这是一种潜在漏洞,即正在对齐的LLM影响偏好数据集,导致RLHF放大不良行为。这源于RLHF的核心局限性:(1)偏好数据集由LLM自身的输出构建,使其能够影响它们;(2)成对比较仅指示哪个响应更好,而不说明原因。这些局限性可能被利用以导致对齐篡改。例如,如果LLM以更高质量生成有偏见的响应,标注者会基于质量偏好它们。然而,偏好标签无法区分质量与偏见,奖励模型继承了这一局限性。通过强化学习或最佳N采样优化此类奖励可能放大错位偏见。我们的实验展示了跨多种偏见的放大:从关键词偏见到宣传(例如性别歧视)、品牌推广和工具性目标寻求。缓解仍然具有挑战性,因为现有的鲁棒RLHF技术无法在不牺牲响应质量的情况下完全解决对齐篡改。这些发现揭示了当前RLHF的结构性漏洞,并强调了防止此漏洞的必要性。项目页面:https://alignment-tampering.github.io/

英文摘要

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/

2605.27255 2026-06-01 cs.CL cs.AI 版本更新

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs

Pair-In, Pair-Out: 面向高效LLM的潜在多令牌预测

Wenhui Tan, Minghao Li, Xiaoqian Ma, Siqi Fan, Xiusheng Huang, Liujie Zhang, Ruihua Song, Weihang Chen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学首都人工智能学院) AI Platform, Xiaohongshu Inc.(小红书人工智能平台) University of Electronic Science and Technology of China(电子科技大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出Pair-In, Pair-Out (PIPO)方法,通过统一潜在压缩和多令牌预测,并训练轻量级置信度头消除验证器开销,在保持可靠性的同时实现推理加速。

Comments Project Page: GitHub.com/RedAI-Infra/PIPO

详情
AI中文摘要

长链式推理使得自回归解码成为现代大语言模型的主要推理成本。现有方法要么针对输入侧(潜在压缩),要么针对输出侧(推测解码和多令牌预测,MTP),但这两条工作线是独立进行的。此外,输出侧方法必须进行昂贵的验证器传递,以验证MTP预测的不可靠草稿令牌。为解决这些问题,我们提出 extbf{Pair-In, Pair-Out (PIPO)},通过将潜在压缩器和MTP头视为镜像操作来统一两侧:压缩器将两个输入令牌折叠成一个潜在表示,而MTP头将一个隐藏状态展开成一个额外的输出令牌。为了在不牺牲可靠性的情况下消除验证器成本,PIPO训练一个轻量级置信度头,决定是否接受草稿令牌。我们观察到,在线策略蒸馏(OPD)自然匹配推测解码的拒绝采样准则,因此置信度头可以以可忽略的额外成本与OPD一起训练。在AIME 2025、GPQA-Diamond、LiveCodeBench v6和LongBench v2上使用Qwen3.5-4B和9B骨干网络的实验表明,PIPO在常规解码上将pass@4提高了最多+7.15个点,同时实现了高达2.64倍的首令牌延迟和2.07倍的每令牌延迟加速。项目页面:GitHub.com/RedAI-Infra/PIPO。

英文摘要

Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing methods target either the input side (latent compression) or the output side (speculative decoding and multi-token prediction, MTP), but the two lines of work have been pursued independently. Moreover, output-side methods must incur an expensive verifier pass to validate the unreliable draft tokens predicted by MTP. To address these issues, we propose \textbf{Pair-In, Pair-Out (PIPO)}, which unifies both sides by viewing a latent compressor and an MTP head as mirror-image operations: the compressor folds two input tokens into one latent representation, while the MTP head unfolds one hidden state into one additional output token. To remove the verifier cost without sacrificing reliability, PIPO trains a lightweight confidence head that decides whether draft tokens should be accepted. We observe that On-Policy Distillation (OPD) naturally matches the rejection-sampling criterion of speculative decoding, so the confidence head can be trained alongside OPD with negligible extra cost. Experiments on AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 with Qwen3.5-4B and 9B backbones show that PIPO improves pass@4 over regular decoding by up to $+7.15$ points, while delivering up to $2.64\times$ first-token-latency and $2.07\times$ per-token-latency speedups. Project Page: GitHub.com/RedAI-Infra/PIPO.

2605.26711 2026-06-01 cs.CL cs.LG 版本更新

The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models

外部观察者的必要性:充分性差距的形式化——序列模型中混合可识别性与上下文基础化的数学扩展

Francesco Corielli

AI总结 本文通过构建二元混合过程,形式化了由未观测隐状态导致的充分性差距,并引入辅助信号建立上下文主导阈值,证明温度缩放无法弥补缺失上下文,而外部观察者或验证器在高风险领域是必要的。

详情
AI中文摘要

我们构建了一个二元混合过程,其中一个确定性文本机制和一个随机机制由未观测的隐状态控制。即使一个理想的无容量限制的序列预测器能够精确恢复纯文本边际分布,当观测到的前缀与错误的隐状态兼容时,它也可能变得过度自信。由此产生的熵差并非普通的优化误差;而是由未观测状态上的边缘化导致的充分性差距。然后,我们通过一个保真度为$γ∈[1/2,1]$的辅助二元信号形式化检索、工具使用和外部基础化。由此产生的贝叶斯更新给出了一个上下文主导阈值:当纠正信号的保真度超过纯文本后验权重中分配给误导机制的部分时,该信号恰好反转由文本历史诱导的后验几率。该阈值减小了充分性差距,但通常不能完全消除;完全消除需要相关隐状态的完美揭示或等效的验证机制。该分析阐明了为什么温度缩放无法恢复缺失的上下文,为什么基础化机制必须既信息丰富又可被模型学习使用,以及为什么在高风险领域自主序列模型需要结构上解耦的观察者或验证器。

英文摘要

We construct a binary mixed-regime process with one deterministic textual regime and one random regime governed by an unobserved latent state. Even an ideal infinite-capacity sequence predictor that exactly recovers the text-only marginal law can become overconfident when the observed prefix is compatible with the wrong latent regime. The resulting entropy difference is not an ordinary optimization error; it is a sufficiency gap caused by marginalization over an unobserved state. We then formalize retrieval, tool use, and external grounding through an auxiliary binary signal with fidelity $γ\in [1/2,1]$. The resulting Bayesian update yields a contextual dominance threshold: a corrective signal reverses the posterior odds induced by the textual history exactly when its fidelity exceeds the text-only posterior weight assigned to the misleading regime. This threshold reduces, but does not generally eliminate, the sufficiency gap; complete closure requires perfect revelation of the relevant latent state or an equivalent verification mechanism. The analysis clarifies why temperature scaling cannot restore missing context, why grounding mechanisms must be both informative and learnably usable by the model, and why autonomous sequence models require structurally decoupled observers or verifiers in high-stakes domains.

2605.26396 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Advancing Creative Physical Intelligence in Large Multimodal Models

推进大型多模态模型中的创造性物理智能

Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu, Aditi Tiwari, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Heng Ji

发表机构 * UIUC(伊利诺伊大学香槟分校) Amazon(亚马逊)

AI总结 针对大型多模态模型在开放式环境中缺乏基于视觉的创造性工具使用能力的问题,提出MM-CreativityBench基准和基于偏好学习的具身对齐方法,显著提升实体选择并减少幻觉。

Comments 51 Pages, 9 Figures, 7 Tables, Previous Work CreativityBench: arXiv:2605.02910

详情
AI中文摘要

大型多模态模型(LMMs)在感知和推理方面取得了快速进展;然而,目前尚不清楚这些能力是否能够泛化到在开放式环境中发现基于视觉的解决方案,超越模式识别。在此类场景中,智能需要的不仅仅是回答明确的问题:它涉及识别场景中的元素如何以非显而易见但物理上可行的方式被重新利用。这种创造性问题解决形式是人类智能的核心,但在当前基准测试中基本上未得到测试。为了评估这一能力,我们引入了MM-CreativityBench,这是一个用于在视觉丰富、物理受限的环境中进行基于可操作性的创造性工具使用的基准。每个实例呈现一个场景图像,包含候选实体及其部件的结构化视图,从而能够对模型如何迭代检查场景、识别相关可操作性以及组合视觉和物理上可行的解决方案进行细粒度、交互式评估。我们的实验表明,当前的LMMs往往表现不佳,不是由于缺乏生成能力,而是因为它们无法维持基于具身的探索。模型经常忽略相关实体,对关键部件检查不足,或幻觉出图像中不存在的属性。受此失败模式的启发,我们提出了具身对齐,将创造性工具使用视为一个偏好学习问题。使用直接偏好优化,我们鼓励模型偏好基于视觉证据的属性-可操作性推理,而非幻觉替代方案。此外,我们结合从可操作性知识库中获得的监督,以指导更广泛的实体探索和多轮规划。我们的结果显示,在正确选择实体和部件方面取得了持续改进,同时大幅减少了幻觉和与具身相关的错误。

英文摘要

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.

2605.30018 2026-06-01 cs.CL cs.LG 版本更新

Latent Performance Profiling of Large Language Models

大型语言模型的潜在性能剖析

Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty, Partha Pratim Das, Lipika Dey, Richa Singh, Mayank Vatsa

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi(印度理工学院德里分校电子工程系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度理工学院德里分校人工智能学院) Hewlett Packard Enterprise, India(印度惠普企业公司) Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur(印度理工学院Khargapur分校计算机科学与工程系) A.K.Choudhury School of Information Technology, University of Calcutta, India(印度加尔各答大学信息科技学院) Department of Computer Science & Engineering, Indian Institute of Technology Bombay(印度理工学院孟买分校计算机科学与工程系) Department of Computer Science, Ashoka University, India(阿什oka大学计算机科学系) Department of Computer Science & Engineering, Indian Institute of Technology Jodhpur(印度理工学院朱罗普分校计算机科学与工程系)

AI总结 提出潜在性能剖析(LPP)框架,通过隐藏激活和输出分布提取任务无关的诊断指标,揭示模型内在特性,补充传统基准评估。

详情
AI中文摘要

大型语言模型(LLMs)在标准化基准测试中经常取得令人印象深刻的分数,但仅凭准确性对能力的了解有限。通过排行榜评估开源LLMs面临持续的问题,如数据污染、任务范围狭窄以及与真实世界可靠性的弱对齐。基于基准的评估(如MMLU PRO、BBH或IFEval)主要捕捉模型在固定测试集上的输出,而非其如何处理信息、校准不确定性或构建内部知识。在本文中,我们主张从以基准为中心的评估转向对LLMs进行互补的、以状态为中心的内在评估。为此,我们引入了潜在性能剖析(LPP)——一个从隐藏激活和输出分布中提取任务无关诊断的框架。LPP在模型的潜在表示和动态上定义了一组标量指标,揭示了与规模无关的特征,从而实现可解释的比较并揭示隐藏的脆弱性。与静态准确性分数不同,LPP在相似规模的模型间提供稳定、对架构敏感的签名。通过对八个LLMs(规模范围0.5B-14B)的广泛实证分析,我们证明了具有相似基准分数的模型可能表现出对比的潜在特征,例如熵或适应性的差异。在这些见解的指导下,我们设计了用于不确定性和符号推理的合成探针,这些探针与内在指标一致,同时与排行榜偏差解耦。我们建议将LPP与基准一起报告,以提供对模型行为更深入、可解释的理解,从而实现更可靠的模型选择、安全评估以及超越表面准确性的评估。

英文摘要

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, state-centered intrinsic assessment of LLMs. To this end, we introduce Latent Performance Profiling (LPP) -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

2605.29751 2026-06-01 cs.CL 版本更新

DySem: Uncovering Dynamic Semantic Components of Large Language Models for Calculating Semantic Textual Similarity

DySem: 揭示大语言模型的动态语义组件以计算语义文本相似度

Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang

发表机构 * College of Computer Science and Software Engineering(计算机科学与软件工程学院)

AI总结 提出DySem框架,通过多语言共识提取大语言模型中与语义更相关的内部组件,并构建文本相关的联合语义集实现动态维度相似度计算,无需训练且性能优于基线。

Comments 18 pages, 23 figures, 5 tables

详情
AI中文摘要

计算语义文本相似度是自然语言处理中的基础任务。当前基于大语言模型(LLM)的方法通常依赖提取固定维度的最后一层隐藏状态来计算每对文本的相似度。我们认为这种范式存在两个局限:(i)最后一层隐藏层编码的是更通用的知识而非仅语义知识,因此对于语义相似度计算并非最优;(ii)LLM的隐藏层维度通常非常大,这引入了表示语义时的冗余和噪声。在这项工作中,我们提出DySem,一种新颖的无需训练框架,通过多语言共识探究LLM中更多与语义相关的内部组件,并摆脱静态表示空间,转而通过构建文本相关的联合语义集实现动态的、样本特定的语义维度,并在该共享维度子集上计算相似度。在各种LLM上的大量实验表明,我们的方法在保持较低相似度计算维度的同时,持续优于最近的基线。代码已发布在https://github.com/szu-tera/DySem。

英文摘要

Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last-layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training-free framework that investigates more semantic-related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample-specific semantic dimensions by constructing text-dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at https://github.com/szu-tera/DySem.

2605.29511 2026-06-01 cs.MA cs.CL cs.LG 版本更新

DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration

DynaGraph: 通过动态拓扑重构的轻量级多模型交互框架

Yanxing Guo, Zihao Zheng, Fangzhou Wu, Ling Liang, Lin Bao, Zongwei Wang, Yimao Cai

发表机构 * Peking University(北京大学) Nanjing University(南京大学) Beijing Advanced Innovation Center for Integrated Circuits(北京集成电路先进制造创新中心) Beijing University of Posts and Telecommunications(北京邮电大学) Yanxin Co. Ltd(燕新有限公司)

AI总结 提出DynaGraph框架,通过动态拓扑重构和PEFT适配器复用,在单消费级GPU上实现多模型协作,接近72B单模型推理能力并大幅降低延迟和token消耗。

详情
AI中文摘要

处理复杂推理任务通常依赖于庞大的单体LLM,这会导致严重的计算冗余。虽然通过结构化流水线或多智能体协作进行任务分解提供了替代方案,但这些方法不可避免地陷入一个关键困境:预定义的静态拓扑极易受到级联错误的影响,而无约束的动态智能体则面临轨迹发散和不可预测的内存膨胀。为了解决这个问题,我们提出了DynaGraph,一个由动态拓扑重构驱动的轻量级多模型框架。在执行层面,DynaGraph在共享基础模型上复用时分PEFT适配器,使得整个系统的训练和推理部署可以在单个消费级GPU上完成。在路由层面,评估器持续监控执行置信度以触发分层自愈:针对局部数据差距的细粒度修补和针对严重逻辑断裂的子图重构。在StrategyQA、MATH和FinQA上的实验表明,我们的8B模型接近72B单体模型的推理能力(例如,在StrategyQA上为87.6%,在MATH上为82.7%)。此外,与无约束的动态架构相比,它延迟降低了高达68.1%,token消耗降低了68.6%。

英文摘要

Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.

2605.29343 2026-06-01 cs.CL 版本更新

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Draft-OPD:用于推测草稿模型的在线策略蒸馏

Haodi Lei, Yafu Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 针对推测解码中草稿模型因离线训练与在线推理不匹配导致性能瓶颈的问题,提出Draft-OPD方法,通过目标辅助展开和重放验证暴露的错误位置实现在线策略蒸馏,在多种任务上实现超过5倍的无损加速。

详情
AI中文摘要

推测解码通过将目标模型与轻量级草稿模型配对,并行验证其提出的令牌,从而加速大型语言模型推理。构建草稿模型的常见方法(如EAGLE3或DFlash)是在目标生成轨迹上进行监督微调(SFT)。然而,我们观察到SFT很快达到平台期:草稿模型在测试数据上的接受长度停止提升。原因是离线到推理的不匹配:在SFT中,草稿模型从固定的目标生成轨迹学习,而在推测解码期间,它在其自身策略提出的块上进行评估。这激发了在线策略蒸馏(OPD),其中目标模型在草稿诱导的状态上监督草稿模型。然而,OPD对于草稿模型仍然困难,因为它们无法可靠地独立展开完整序列,而目标辅助生成使收集的序列遵循目标分布,从而消除了在线策略信号。因此,我们提出Draft-OPD,它使用目标辅助展开进行稳定延续,并从验证暴露的错误位置重放草稿。这使得草稿模型能够从接受和拒绝的提议中学习目标反馈,将训练集中在限制推测接受的草稿诱导错误上。实验表明,Draft-OPD在多种任务上对思考模型实现了超过5倍的无损加速,比EAGLE-3和DFlash分别提高了23%和13%。

英文摘要

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

2605.29317 2026-06-01 cs.CL 版本更新

FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning

FoRA: 基于Fisher正交秩适配的参数高效微调

Juneyoung Park, Seongbae Lee, Han-Sang Lee, Kyuho Lee, Minjae Kim, Seungheon Hyeon, Kiduk Kwon, Seongwan Kim, Jaeho Lee

发表机构 * OptAI Inc(OptAI公司) LG Uplus

AI总结 提出FoRA方法,通过Fisher信息选择信息层并在Stiefel流形上训练LoRA下投影,在减少参数预算的同时保持性能,优于LoRA和DoRA。

Comments EMNLP 2026

详情
AI中文摘要

参数高效微调(PEFT)主要关注LoRA及其面向精度的变体,而减少可训练参数的原始目标相对较少受到关注。我们引入了FoRA,通过减少适配层数而非适配器秩来重新审视这一目标。FoRA通过单次对角Fisher评分(训练成本低于1%)选择任务信息层,并在Stiefel流形上训练所选层的LoRA下投影,保持列正交性和有效秩。在五个LLaMA系列骨干网络上,FoRA在参数预算减半的情况下始终优于LoRA和DoRA,在参数数量为AdaLoRA四分之一时,精度差距在0.7-0.8个点以内。在来自LLaMA、Qwen3和Gemma系列的十二个骨干网络上的跨架构实验证实了从270M到32B参数的一致增益。两个组件超加性地结合:Fisher选择本身在相同预算下匹配秩缩减,而Stiefel约束提供了决定性的额外增益。

英文摘要

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.

2605.29268 2026-06-01 cs.CL cs.AI cs.LG cs.NE 版本更新

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

进化搜索中的计算分配:从深度-广度到多臂老虎机

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang, Haozheng Luo, Tianfan Fu, Aarthy Nagarajan

发表机构 * University of Notre Dame(诺丁汉大学) Northeastern University(东北大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Southeast University(东南大学) Northwestern University(西北大学) Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对LLM引导的进化搜索中固定预算的LLM调用分配问题,提出基于多臂老虎机的BaSE方法,通过跨并行轨迹分配调用,平均适应度提升12.3%。

详情
AI中文摘要

LLM引导的进化搜索(Evolve系统)在数学和组合任务上达到了最先进的结果,但现有系统通常只报告多次运行中的最佳结果,而未记录运行间的分布。我们询问如何分配固定的LLM调用预算,以及单次运行达到报告数字的可靠性如何。通过扫描五个模型和三个任务的深度-广度网格,我们识别出两个经验规律:一个适应度-计算包络线,其中能力排序主要取决于有效FLOPs;以及一个双线性深度-广度拟合,具有任务特定的交互;两者都受模型-任务能力门控。受这些规律启发,我们提出BaSE(基于老虎机的自进化),一种多臂老虎机,它在并行轨迹间分配LLM调用。在不改变模型、提示或评估器的情况下,BaSE在8个(模型,任务)单元上比最强的岛屿协议基线平均适应度提高12.3%,在方差高的设置上增益最大:仅通过分配实现可靠性提升。

英文摘要

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

2605.29146 2026-06-01 cs.CL cs.AI 版本更新

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

SafeRx-Agent: 基于知识的多智能体框架用于安全且可解释的药物推荐

Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu, Qincheng Lu, Ziyu Zhao, Jijun Chi, Jingrui Tian, Xiao-Wen Chang, Ziyang Song

发表机构 * McGill University(麦吉尔大学) McMaster University(麦马斯特大学) University of Toronto(多伦多大学) Ohio University(俄亥俄大学)

AI总结 提出SafeRx-Agent,一种基于知识的多智能体框架,通过患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合,在MIMIC-III和MIMIC-IV数据集上提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

详情
AI中文摘要

药物推荐预测患者就诊时的用药,但现有方法仍面临两个关键挑战。在模型层面,传统药物推荐方法仅预测结构化的药物代码,证据基础有限,而LLM智能体可以利用更丰富的临床上下文,但可能缺乏安全验证和可追溯性。在任务层面,现有基准通常使用宽泛的药物类别,忽略了亚组级别的安全性差异,可能导致风险高估。我们引入了基于第四级ATC代码生成的第一个细粒度药物推荐设置。我们提出了安全处方智能体(SafeRx-Agent),一种基于知识的多智能体框架,利用患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合。在MIMIC-III和MIMIC-IV数据集上的实验结果表明,SafeRx-Agent提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

英文摘要

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2605.28836 2026-06-01 cs.CL cs.AI 版本更新

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

不让任何读者掉队:人人能理解的多智能体摘要

Jimin Jung, MyoungJin Kim, Jaehyung Seo, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系) Department of Computer Science and Engineering, Konkuk University(konkuk大学计算机科学与工程系)

AI总结 提出NRLB多智能体框架,通过模拟三类读者群体并结合模板规划与迭代优化,生成既忠实又易于理解的平实语言摘要。

详情
AI中文摘要

美国的《平实语言法案》要求政府文件使用清晰、简单的语言,以便公众易于理解,但现有的摘要系统难以应对普通读者中多样化的语言和认知障碍。我们提出了NRLB(不让任何读者掉队),一个用于平实语言摘要的多智能体框架,它模拟了三类代表性读者群体:小学生读者、非母语读者和注意力缺陷读者。NRLB结合了基于模板的规划与迭代的、面向读者的优化,能够系统地检测和解决难懂术语、缺失上下文和令人困惑的句子。在多个数据集上的评估显示,在保持事实准确性的同时,可读性持续提升。人工评估进一步验证了NRLB的效果,标注者偏好率在55%到76%之间,突显了NRLB在生成既忠实于原文又广泛适用于公众的平实语言摘要方面的潜力。

英文摘要

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2603.09632 2026-06-01 cs.CV cs.CL 版本更新

X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting

X-GS:基于3D高斯溅射的感知与思考可扩展框架

Yueen Ma, Zenglin Xu, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院)

AI总结 提出X-GS框架,包含感知器和思考器,统一多种3DGS技术实现实时在线SLAM与语义蒸馏,并支持多模态模型完成下游任务。

详情
AI中文摘要

3D高斯溅射(3DGS)已成为新颖视图合成的强大技术,随后扩展到众多空间AI应用。然而,大多数现有3DGS方法孤立运行,专注于特定领域。本文介绍X-GS,一个包含两个主要组件的可扩展框架。X-GS-感知器统一了广泛的3DGS技术,以实现具有语义蒸馏的实时在线SLAM。X-GS-思考器容纳多模态模型,使其能够与感知器无缝交互以完成下游任务。在我们的X-GS实现中,感知器利用最新的视觉基础模型提高在线SLAM性能,并采用三种关键机制加速语义蒸馏。思考器可以基于对比和生成视觉语言模型构建,并利用感知器的语义高斯溅射解锁3D视觉定位和场景描述等功能。在多个基准上的实验结果表明了X-GS框架的高效性和新解锁的多模态能力。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods operate in isolation, focusing on specific domains. In this paper, we introduce X-GS, an extensible framework consisting of two major components. The X-GS-Perceiver unifies a broad range of 3DGS techniques to enable real-time online SLAM with semantic distillation. The X-GS-Thinker accommodates multimodal models, enabling them to seamlessly interface with the Perceiver to complete downstream tasks. In our implementation of X-GS, the Perceiver leverages the latest vision foundation models to improve online SLAM performance and employs three key mechanisms to accelerate semantic distillation. The Thinker can be built upon both contrastive and generative vision-language models and utilizes the Perceiver's semantic Gaussian splats to unlock capabilities such as 3D visual grounding and scene captioning. Experimental results on diverse benchmarks demonstrate the efficiency and newly unlocked multimodal capabilities of the X-GS framework.

2602.10388 2026-06-01 cs.CL cs.AI 版本更新

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

少即是多:利用稀疏自编码器在LLM特征空间中合成多样化数据

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu

发表机构 * Department of Computing, University of Georgia, Georgia, United States(佐治亚大学计算机系) Computer Engineering, University of California San Diego, California, United States(加州大学圣地亚哥分校计算机工程系) Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(Mohamed bin Zayed人工智能大学机器学习系) Department of Computing, Hong Kong Polytechnic University, Hong Kong, China(香港理工大学计算机系)

AI总结 提出基于稀疏自编码器的特征激活覆盖率(FAC)指标及数据合成框架FAC Synthesis,通过识别缺失特征并生成对应样本来提升数据多样性和下游任务性能。

详情
AI中文摘要

后训练数据的多样性对于大型语言模型(LLM)的有效下游性能至关重要。许多现有的后训练数据构建方法使用基于文本的指标来衡量多样性,这些指标捕捉语言变化,但此类指标仅能为决定下游性能的任务相关特征提供微弱信号。在这项工作中,我们引入了特征激活覆盖率(FAC),该指标在可解释的特征空间中衡量数据多样性。基于此指标,我们进一步提出了一个多样性驱动的数据合成框架,名为FAC Synthesis,该框架首先使用稀疏自编码器从种子数据集中识别缺失特征,然后生成明确反映这些特征的合成样本。实验表明,我们的方法在包括指令遵循、毒性检测、奖励建模和行为引导在内的各种任务上,持续提高了数据多样性和下游性能。有趣的是,我们识别出跨模型家族(即LLaMA、Mistral和Qwen)共享的可解释特征空间,从而实现了跨模型知识迁移。我们的工作为探索以数据为中心的LLM优化提供了坚实且实用的方法论。

英文摘要

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

2605.25842 2026-06-01 cs.AI cs.CL 版本更新

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

MuCRASP: 多模态思维链推理感知的结构化剪枝

Aritra Dutta, Somak Aditya

发表机构 * Indian Institute of Technology, Kharagpur(印度理工学院,哈里科普尔)

AI总结 针对视觉语言模型在结构化剪枝后思维链推理准确性下降的问题,提出MuCRASP框架,通过识别推理关键令牌并保持跨模态对齐,在压缩下维持推理质量。

Comments Preprint ver. 2

详情
AI中文摘要

视觉语言模型(VLM)越来越依赖思维链(CoT)推理来解决复杂的多模态任务,但其庞大的参数量使得部署成本高昂。结构化剪枝提供了一种自然的解决方案;然而,现有方法无法在VLM中保持CoT推理的准确性。我们确定了两个关键原因:(1)CoT一致性依赖于生成轨迹中的稀疏过渡点(枢轴令牌),而现有剪枝方法对CoT不敏感;(2)为单模态LLM设计的剪枝方法未考虑视觉和文本模态之间的激活分布差异。基于这些观察,我们提出了MuCRASP,一种结构化剪枝框架,针对推理关键组件,同时保持跨模态对齐并在全局参数预算下考虑层间敏感性。在三个推理基准测试上的四个VLM实验表明,MuCRASP在不断增加压缩的情况下始终能保持推理质量。在Qwen2.5-VL-7B上剪枝30%时,MuCRASP在物理推理任务上获得了8.87的LLM-as-a-Judge评分,而最强基线为7.32。此外,MuCRASP在高达50%的剪枝率下仍保持高推理一致性,显著优于先前的剪枝方法,同时表现出更低的困惑度退化。

英文摘要

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

2605.25773 2026-06-01 stat.ML cs.AI cs.CL cs.LG 版本更新

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

高效基准测试仅是特征选择与多元回归

Sam Bowyer, Acyr Locatelli, Kris Cao

发表机构 * Cohere University of Bristol(布里斯托大学)

AI总结 将高效基准测试重新定义为带特征选择的多元回归问题,使用核岭回归预测和mRMR特征选择算法,在降低计算成本的同时提高预测精度和排名相关性。

Comments 36 pages, 27 figures

详情
AI中文摘要

高效基准测试技术旨在通过仅使用基准测试问题子集预测完整基准测试分数,从而降低评估LLMs的计算成本。通过将此问题重新定义为带特征选择的多元回归实例,我们发现只需在预测阶段使用核岭回归即可大幅改进现有高效基准测试方法。此外,使用一种名为最小冗余最大相关性(mRMR)的信息论特征选择算法,我们可以通过选择对预测最有用的问题子集进一步改进这些方法。除数据非常匮乏的情况外,这些方法在二元和连续指标的各种基准测试中,始终实现更小的预测误差(MAE和RMSE),以及预测分数与真实分数之间更大的排名相关性(Spearman ρ和Kendall τ)。此外,mRMR子采样比竞争方法(通常涉及拟合概率模型或运行聚类算法)快得多,并且在不同随机种子或训练数据划分下更可能选择相同的问题。教程代码见https://github.com/sambowyer/mrmr_eval。

英文摘要

Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .

2605.23278 2026-06-01 cs.CL stat.ML 版本更新

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

下一个词预测何时有用?边缘化、遍历性、混合可识别性、局部充分性、RAG、工具与编程

Francesco Corielli

发表机构 * GitHub

AI总结 本文通过区分完整条件语言过程、边缘文本过程和模型诱导分布,论证了下一个词预测的有效性依赖于强假设(平稳性、代表性、遍历性)以及观察前缀对潜在上下文的充分性,并解释了RAG和工具使用作为条件充分性机制的作用。

详情
AI中文摘要

在观察序列上训练的语言模型通常被描述为学习给定前一个词的下一个词的条件分布。这种描述仅在一定条件下成立。在真实词轨迹上训练的模型并未观察到完整的条件法则;它接收的是采样后的延续。此外,真实语言生成不仅受前文影响,还受非文本环境的影响:事实、事件、意图、目标、信念、社会背景和任务特定约束。本文区分了三个常被混淆的对象:以潜在环境为条件的完整条件语言过程、通过积分掉这些环境得到的边缘纯文本过程,以及从有限观察语料库中学习到的模型诱导分布。 本文认为,将模型训练解释为估计边缘纯文本法则需要强假设:平稳性、代表性和遍历性,这些假设在统计估计中是标准的,但在应用于异质语言语料库时存在问题。即使这些假设成立,边缘纯文本法则也仅当观察前缀是延续相关潜在环境的近似充分统计量时才有用。从信息论角度看,有用性要求下一个词与被省略环境之间的条件互信息(给定观察文本)很小。 然后,本文将这一论证扩展到异质训练语料库。 最后,本文将检索增强生成(RAG)和工具使用解释为条件充分性装置。

英文摘要

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.

2605.19806 2026-06-01 cs.CL cs.AI 版本更新

Chunking German Legal Code

德国法律文本的分块处理

Max Prior, Natalia Milanova, Andreas Schultz

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 研究针对德国成文法,以德国民法典为基准语料库,比较多种分块策略在检索增强生成中的性能,发现基于法律固有结构(如章节、小节)的分块方法在召回率和计算效率上优于语义增强方法。

Comments Accepted at the Eigth Workshop on Automated Semantic Analysis of Information in Legal Texts co-located with the 21th International Conference on Artificial Intelligence and Law (ICAIL 2026)

详情
AI中文摘要

本文研究了针对德国成文法的检索增强生成的分块策略,以德国民法典作为结构化基准语料库。我们实现并比较了一系列分割方法,包括结构单元(章节、小节、句子、命题)、固定大小窗口、上下文分块、语义聚类、Lumber风格分块以及基于RAPTOR的层次检索。所有方法都在一个具有章节级黄金标签的法律问答数据集上进行评估,测量召回率、查询延迟、索引构建时间和存储需求。结果表明,与固有法律结构对齐的分块策略——特别是基于章节和小节的检索——实现了最高的召回率,而覆盖这种结构的更复杂方法表现更差。与上下文分块、RAPTOR和Lumber等LLM密集型技术相比,这些更简单的方法还提供了有利的计算效率。研究结果突出了语义丰富性与操作成本之间的关键权衡,并证明保留领域特定结构对于有效的法律信息检索至关重要。

英文摘要

This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

2605.17101 2026-06-01 cs.CL cs.AI 版本更新

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG: 面向医学推理的自演化多智能体检索增强生成框架

Yongfeng Huang, Ruiying Chen, James Cheng

发表机构 * CSE, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Wuhan University of Technology(武汉理工大学)

AI总结 针对医学问答中单轮静态检索与临床推理多阶段过程不匹配的问题,提出SEMA-RAG框架,通过任务解耦和动态多轮探索,由三个专业智能体分别负责临床解释、自演化检索和证据裁决,在多个基准上平均提升准确率6.46个百分点。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

检索增强生成(RAG)被广泛用于缓解医学问答中的幻觉和知识过时等风险,但其主要采用单轮静态检索范式,与临床推理的多阶段过程不匹配。这种压缩的工作流导致两个结构性缺陷:问题到查询的转换通常缺乏临床基础的语义解释,且检索缺乏迭代充分性反馈,难以形成可靠的证据链。我们认为这两个问题源于更深层的原因:将解释、探索和裁决等异构任务过载到单一推理链上。解决方案是通过任务解耦和动态多轮探索来重构工作流。为此,我们提出SEMA-RAG,一种用于医学问答的自演化多智能体RAG框架,将这些角色分配给三个专业智能体:解释智能体负责临床模式解释,探索智能体负责充分性驱动的自演化检索,裁决智能体负责证据裁决和答案选择。在五个基准和五个LLM骨干网络上,SEMA-RAG平均比最强基线提高6.46个准确率点(按骨干网络测量)。

英文摘要

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

2605.16215 2026-06-01 cs.AI cs.CL 版本更新

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

完全开放的Meditron:临床大语言模型的可审计流水线

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出首个完全开放的临床大语言模型构建流水线Fully Open Meditron,通过可审计的数据集、可复现的训练框架和对齐评估协议,在不牺牲可审计性和可复现性的前提下实现了领域最新性能。

Comments Preprint. 31 pages, 10 figures. Code, models, and data: https://github.com/EPFLiGHT/FullyOpenMeditron

详情
AI中文摘要

临床决策支持系统(CDSS)需要可审查、可审计的流水线,以实现严格、可复现的验证。然而,当前基于LLM的CDSS仍然大多不透明。大多数“开放”模型仅开放权重,发布参数的同时隐瞒了决定模型行为的数据来源、整理程序和生成流水线。完全开放(FO)模型暴露完整的训练堆栈,目前在医学领域尚不存在。我们引入了Fully Open Meditron,这是首个用于构建LLM-CDSS的完全开放流水线,包含临床医生审计的训练语料库、可复现的数据构建和训练框架,以及使用对齐的评估协议。该语料库将八个公共医学QA数据集统一为标准化对话格式,并通过三个经临床医生审查的合成扩展扩展了覆盖范围:考试式QA、源自46,469个临床实践指南的指南基础QA以及临床小插曲。该流水线强制执行系统级去污染、教师生成的金标签重采样以及由四位医生小组进行的端到端验证。我们使用LLM-as-a-judge协议对专家撰写的临床小插曲进行评估,并针对204名人类评分者进行校准。我们将该配方应用于五个FO基础模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO变体均优于其基础模型。Apertus-70B-MeditronFO在综合医学基准上比其基础模型提高了+6.6个百分点(从47.2%到53.8%),建立了新的FO SoTA。Gemma-3-27B-MeditronFO在58.6%的LLM-as-a-judge比较中优于MedGemma,并在HealthBench上表现更优(58% vs 55.9%)。这些结果表明,完全开放的流水线可以在不牺牲可审计性或可复现性的情况下实现最先进的领域特定性能。

英文摘要

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

2602.12005 2026-06-01 cs.CL 版本更新

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

LaCy: 小型语言模型能学且应学的不仅仅是损失问题

Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michael Kirchhof

发表机构 * Apple(苹果公司) University of Cambridge(剑桥大学)

AI总结 研究在预训练中,小型语言模型(SLM)应学习哪些token以及应通过<CALL>委托哪些token,提出结合损失和事实性信号的LaCy方法,提升SLM在级联生成中的事实准确性。

Comments 40 pages, 26 figures, 10 tables, preprint. v3-v4: new results for RAG, ablations and additional analysis

详情
AI中文摘要

语言模型不断增长以将更多世界知识压缩到其参数中,但可预训练到其中的知识受参数规模上限约束。尤其是小型语言模型(SLM)容量有限,导致事实性错误生成。通常通过让SLM访问外部源(如查询更大模型、文档或数据库)来缓解此问题。在此设置下,我们研究基本问题:预训练期间SLM可以且应该学习哪些token,以及哪些应通过<CALL> token委托。我们发现这不仅仅是损失问题:尽管损失可预测预测token是否与真实值不匹配,但不足以识别哪些预测实际上会导致事实性或语义无效的延续。一些高损失token对应预训练文档中可接受的替代延续,因此不应触发<CALL>。这表明可学习性不能仅从损失表征,而需要关于token在句子中角色的额外领域特定信号。在类似维基百科的领域中,我们展示用spaCy解析器的轻量级语法信息增强损失信号可显著改善委托决策。基于此洞察,我们提出LaCy,一种新颖的预训练方法,结合损失与事实性信号以决定SLM应学习哪些token。实验表明,LaCy模型成功学习预测哪些token以及何时请求帮助。这在与更大模型级联生成时获得更高FactScore,且优于Rho或LLM-judge训练的SLM,同时更简单更廉价。

英文摘要

Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, it is insufficient for identifying which predictions would actually lead to factual or semantically invalid continuations. Some high-loss tokens correspond to \emph{acceptable} alternative continuations of a pretraining document and therefore should not trigger a \texttt{<CALL>}. This suggests that learnability cannot be characterized from loss alone, but requires additional domain-specific signals about the role of a token in the sentence. In Wikipedia-like domains, we show that augmenting the loss signal with lightweight grammatical information from a spaCy parser substantially improves delegation decisions. Based on this insight, we propose LaCy, a novel pretraining method that combines loss with factuality signals to decide which tokens an SLM should learn. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and when to call for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

2602.00747 2026-06-01 cs.CL cs.AI 版本更新

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

将搜索与训练解耦:通过模型合并实现大规模语言模型预训练的数据混合缩放

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

发表机构 * NLP Team, Xiaohongshu Inc., Shanghai, China(小红书自然语言处理团队,小红书公司,上海,中国) Tsinghua University, Beijing, China(清华大学,北京,中国) The University of Tokyo, Tokyo, Japan(东京大学,东京,日本)

AI总结 提出DeMix框架,通过模型合并预测最优数据配比,在降低搜索成本的同时提升基准性能。

Comments 18 pages, 5 figures, accepted at ICML 2026

详情
AI中文摘要

确定有效的数据混合是大语言模型(LLM)预训练的关键因素,模型必须在通用能力与数学、代码等困难任务的专业性之间取得平衡。然而,识别最优混合仍然是一个开放挑战,现有方法要么依赖不可靠的小规模代理实验,要么需要代价高昂的大规模探索。为此,我们提出“将搜索与训练解耦混合”(DeMix),一种利用模型合并预测最优数据配比的新框架。DeMix不是为每个采样的混合训练代理模型,而是按规模在候选数据集上训练组件模型,并通过加权模型合并推导数据混合代理。这种范式将搜索与训练成本解耦,使得无需额外训练负担即可评估无限采样的混合,从而通过更多搜索试验促进更好的混合发现。大量实验表明,DeMix打破了充分性、准确性和效率之间的权衡,以更低的搜索成本获得更高基准性能的最优混合。此外,我们发布了DeMix语料库,一个包含高质量预训练数据和已验证混合的综合22T令牌数据集,以促进开放研究。我们的代码和DeMix语料库可在https://github.com/Lucius-lsr/DeMix获取。

英文摘要

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

2601.15197 2026-06-01 cs.AI cs.CL cs.CV cs.RO 版本更新

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beijing Zhongguancun Academy(北京中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhengzhou University(郑州大学) Beihang University(北航) East China Normal University(东华大学) DeepCybot Co., Ltd.(DeepCybot有限公司)

AI总结 针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题,提出LangForce框架,通过贝叶斯分解和潜在动作查询构建双分支架构,最大化动作与指令的点互信息,无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中显示出潜力,但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理:目标驱动的数据收集造成了数据集偏差。在此类数据集中,仅凭视觉观察就能高度预测语言指令,导致指令与动作之间的条件互信息消失,我们将此现象称为信息崩溃。因此,模型退化为忽略语言约束的纯视觉策略,并在分布外(OOD)设置中失败。为解决此问题,我们提出LangForce,一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询,我们构建了一个双分支架构,用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息(PMI)。该目标有效惩罚了视觉捷径,并奖励明确解释语言命令的动作。无需新数据,LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进,包括在具有挑战性的OOD SimplerEnv基准上提升11.3%,验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

2605.11336 2026-06-01 cs.IR cs.AI cs.CL cs.HC 版本更新

Much of Geospatial Web Search Is Beyond Traditional GIS

大部分地理空间网络搜索超越了传统GIS

Ilya Ilyankou, Stefano Cavazzi, James Haworth

发表机构 * SpaceTimeLab(空间时间实验室) Department of Civil, Environmental, and Geomatic Engineering(土木、环境与测绘工程系) UCL(伦敦大学学院)

AI总结 通过密集句子嵌入、SetFit分类器和密度聚类,在MS MARCO语料库中发现18%的查询具有地理空间性质,并构建了88类分类体系,揭示地理搜索以事务性和实用性查询为主,多数超出传统GIS和知识图谱范围。

详情
AI中文摘要

网络搜索查询涉及地点的频率远高于现有标注方案所表明的,然而地理空间网络搜索查询的景观——人们对地点的询问内容及其频率——在大规模上仍然缺乏特征描述。我们对包含101万条真实必应查询的完整MS MARCO语料库应用密集句子嵌入、轻量级SetFit分类器和基于密度的聚类,无需预先过滤地名或空间关键词,识别出181,827条地理空间查询(18.0%),几乎是原始标注中标记为“位置”的6.17%的三倍。由此产生的88个查询类别分类体系揭示,地理空间网络搜索以事务性和实用性查询为主:仅成本和价格就占地理空间查询的15.3%,几乎是整个自然地理主题规模的两倍。这些活动中的大部分——成本、营业时间、联系方式、天气、旅行推荐——超出了传统GIS和知识图谱旨在服务的范围。这些类别在它们所接受的答案类型上差异很大,从可由空间数据库或知识图谱回答的确定性查询,到需要生成式或实时系统的评估性或时间波动性查询。我们讨论了对混合检索架构以及大型语言模型中地理推理基准的启示。我们公开发布了标注数据集、分类器和分类体系。

英文摘要

Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope of what traditional GIS and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.

2602.03216 2026-06-01 cs.CL cs.LG 版本更新

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Token Sparse Attention: 交错令牌选择的高效长上下文推理

Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim

发表机构 * Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea(电气电子工程系,首尔国立大学,首尔,韩国)

AI总结 提出Token Sparse Attention,一种轻量级动态令牌级稀疏化机制,通过交错选择令牌并在注意力前后压缩/解压缩,实现高效长上下文推理,在128K上下文中获得高达3.23倍加速且精度损失小于1%。

Comments ICML 2026

详情
AI中文摘要

注意力的二次复杂度仍然是大语言模型长上下文推理的核心瓶颈。先前的加速方法要么使用结构化模式稀疏化注意力图,要么在特定层永久驱逐令牌,这可能会保留不相关的令牌或依赖不可逆的早期决策,尽管令牌重要性具有层/头动态性。在本文中,我们提出Token Sparse Attention,一种轻量级动态令牌级稀疏化机制,在注意力期间将每个头的$Q$、$K$、$V$压缩到减少的令牌集,然后将输出解压缩回原始序列,使得令牌信息可以在后续层中重新考虑。此外,Token Sparse Attention在令牌选择和稀疏注意力的交叉点上暴露了一个新的设计点。我们的方法完全兼容密集注意力实现,包括Flash Attention,并且可以无缝地与现有的稀疏注意力内核组合。实验结果表明,Token Sparse Attention持续改善精度-延迟权衡,在128K上下文中实现高达3.23倍的注意力加速,且精度下降小于1%。这些结果表明,动态和交错的令牌级稀疏化是可扩展长上下文推理的一种互补且有效的策略。

英文摘要

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

2508.21762 2026-06-01 cs.CL cs.AI 版本更新

Reasoning-Intensive Regression

推理密集型回归

Diane Tchuindjo, Omar Khattab

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对推理密集型回归任务,提出MENTAT方法,结合批量反思提示优化与神经集成学习,在基准测试中相比基线提升高达65%。

详情
AI中文摘要

AI研究人员和从业者越来越多地将大型语言模型(LLMs)应用于我们称之为推理密集型回归(RiR)的任务,即从文本中推断细微的数值分数。与情感分析或相似性分析等标准语言回归任务不同,RiR通常出现在临时应用中,例如基于评分标准的评分、复杂环境中的密集奖励建模或特定领域的检索,这些任务需要对上下文进行更深入的分析,而可用的任务特定训练数据和计算资源有限。我们将四个实际问题作为RiR任务,建立初始基准,并用于测试我们的假设:即冻结的LLMs和通过梯度下降微调Transformer编码器在RiR中通常都会遇到困难。然后,我们提出MENTAT,一种简单轻量的方法,结合批量反思提示优化与神经集成学习。MENTAT在两个基线上实现了高达65%的提升,尽管未来仍有很大的改进空间。

英文摘要

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances.

2604.21928 2026-06-01 cs.CL 版本更新

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

使用生成式大语言模型评估自动语音识别

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour

发表机构 * Idiap Research Institute(Idiap研究 institute) EPFL(瑞士联邦理工学院) Brno University of Technology(布拉格技术大学) Avignon University(阿维尼翁大学) Le Mans University(勒曼大学) Nantes University(南特大学)

AI总结 本文提出利用生成式大语言模型通过假设选择、语义距离计算和错误分类三种方法评估ASR,在HATS数据集上达到92-94%的人类一致性,优于WER和语义指标。

详情
AI中文摘要

自动语音识别(ASR)传统上使用词错误率(WER)进行评估,该指标对语义不敏感。基于嵌入的语义指标与人类感知的相关性更好,但基于解码器的大语言模型(LLM)在此任务中仍未得到充分探索。本文通过三种方法评估其相关性:(1)在两个候选假设中选择最佳假设,(2)使用生成式嵌入计算语义距离,以及(3)对错误进行定性分类。在HATS数据集上,最佳LLM在假设选择中与人类标注者的一致性达到92-94%,而WER仅为63%,并且也优于语义指标。来自基于解码器的LLM的嵌入显示出与编码器模型相当的性能。最后,LLM为可解释且语义化的ASR评估提供了有前景的方向。

英文摘要

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

2604.16278 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Learning to Reason with Insight for Informal Theorem Proving

学习在非形式定理证明中进行洞察推理

Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, Linqi Song

发表机构 * City University of Hong Kong(香港城市大学) Tsinghua University(清华大学) Ke Holdings Inc.(Ke控股公司) Shenzhen University of Advanced Technology(深圳先进技术大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对非形式定理证明中缺乏洞察(识别核心技巧)的瓶颈,提出统一训练框架DeepInsight,通过分层数据集、渐进式多阶段SFT和基于洞察的策略优化方法,显著提升大语言模型的数学推理能力。

详情
AI中文摘要

尽管大多数自动定理证明方法依赖于形式证明系统,但非形式定理证明能更好地发挥大语言模型(LLMs)在自然语言处理方面的优势。在这项工作中,我们识别出非形式定理证明的一个主要瓶颈是缺乏洞察,即难以识别解决复杂问题所需的核心技巧。为了解决这个问题,我们提出了$ exttt{DeepInsight}$,一个统一的训练框架,旨在培养这种基本的推理技能,并使LLMs能够进行洞察推理。我们的框架由三个部分组成:(1)$ exttt{DeepInsightTheorem}$,一个分层数据集,通过显式提取核心技巧和证明草图以及最终证明来结构化非形式证明;(2)渐进式多阶段SFT策略,模拟人类学习过程,教授模型证明写作、规划和洞察识别;(3)$ exttt{InsightPO}$,一种策略优化方法,在此洞察层次结构上分配结构化奖励。我们在具有挑战性的数学基准上的实验表明,这种洞察感知的生成策略显著优于基线。这些结果表明,教模型识别和应用核心技巧可以大幅提高其数学推理能力。

英文摘要

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose $\texttt{DeepInsight}$, a unified training framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. Our framework consists of three components: (1) $\texttt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof; (2) a Progressive Multi-Stage SFT strategy that mimics the human learning process, teaching the model proof writing, planning, and insight identification; and (3) $\texttt{InsightPO}$, a policy optimization method that assigns structured rewards over this insight hierarchy. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.

2603.12277 2026-06-01 cs.CL cs.AI cs.CR 版本更新

Prompt Injection as Role Confusion

提示注入作为角色混淆

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过角色探测和CoT伪造攻击,揭示提示注入源于LLM对文本来源的角色感知混淆,并提出角色混淆程度可预测攻击成功率。

Comments ICML 2026

详情
AI中文摘要

LLM将世界视为单一的文本流,并划分为<user>或<tool>等角色。我们将提示注入追溯到角色混淆:模型根据文本听起来的方式而非其标记的角色来感知文本来源。隐藏在网页中的命令劫持了代理,仅仅因为它听起来像<user>文本,尽管其标签是<tool>。我们设计了角色探测器来测量LLM内部如何感知“谁在说话”,并发现注入的文本占据了与它所模仿的可信角色相同的表示空间。我们通过CoT伪造(一种零样本攻击)证明了这一点,该攻击将捏造的推理注入用户提示和工具输出中。模型将伪造内容误认为是自己的思维,导致对前沿模型的攻击成功率达到60%,而基线接近零。引人注目的是,角色混淆的程度可以在生成单个token之前预测攻击成功。这一机制超越了CoT伪造,适用于标准的代理提示注入,揭示了提示注入是角色感知的可测量后果。对模型而言,听起来像某个角色与成为该角色是无法区分的。

英文摘要

LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.

2604.06484 2026-06-01 cs.CL 版本更新

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

ValueGround: 评估多模态大语言模型中文化条件化的视觉价值基础

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

发表机构 * University of Technology Nuremberg (UTN)(图恩大学)

AI总结 提出ValueGround基准,通过最小对比图像对评估多模态大语言模型在文化条件化视觉价值判断中的表现,发现模型在可视化选项下准确率显著低于文本选项。

Comments Updated preprint

详情
AI中文摘要

文化价值观不仅通过语言表达,还通过视觉场景和日常社会实践体现。然而,现有对语言模型中文化价值观的评估几乎完全是文本形式的,尚不清楚当响应选项可视化时,文化条件化的判断是否保持稳定。我们引入了ValueGround,一个用于评估多模态大语言模型(MLLMs)中文化条件化视觉价值基础的基准。ValueGround基于世界价值观调查问题,使用最小对比图像对来表示对立的响应选项,同时控制无关变量。给定一个国家、一个问题和一个图像对,模型必须选择最符合该国价值倾向的图像,而无法访问原始响应选项文本。在六个MLLM和13个国家上的实验表明,模型在可视化响应选项下的表现显著差于原始文本选项,平均准确率从72.8%下降到62.6%。我们的基准为研究文化条件化价值判断的跨模态迁移提供了一个受控测试平台。

英文摘要

Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, leaving it unclear whether culture-conditioned judgments remain stable when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Experiments across six MLLMs and 13 countries show that models perform substantially worse with visualized response options than with the original textual options, with average accuracy dropping from 72.8% to 62.6%. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

2604.10495 2026-06-01 cs.CL 版本更新

Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

为什么你不知道?评估不确定性来源对大型语言模型中不确定性量化的影响

Maiya Goloburda, Roman Vashurin, Fedor Chernogorskii, Nurkhan Laiyk, Daniil Orel, Preslav Nakov, Maxim Panov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文通过引入一个明确分类不确定性来源的新数据集,系统评估了现有不确定性量化方法在不同不确定性来源下的表现,发现多数方法在模型知识局限下表现良好,但在其他来源下性能下降或产生误导。

详情
AI中文摘要

随着大型语言模型(LLM)在现实世界应用中的日益普及,可靠的不确定性量化(UQ)对于安全有效使用变得至关重要。现有的大多数语言模型UQ方法旨在产生单一的置信度分数——例如,估计模型答案正确的概率。然而,自然语言任务中的不确定性源于多个不同的来源,包括模型知识差距、输出可变性和输入歧义,这些对系统行为和用户交互有不同的影响。在这项工作中,我们研究了不确定性来源如何影响现有UQ方法的行为和有效性。为了进行受控分析,我们引入了一个新数据集,该数据集明确分类了不确定性来源,允许系统评估每种条件下的UQ性能。我们的实验表明,虽然许多UQ方法在不确定性仅源于模型知识限制时表现良好,但当引入其他来源时,它们的性能会下降或变得具有误导性。这些发现强调了需要明确考虑大型语言模型中不确定性来源的不确定性感知方法。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

2509.10078 2026-06-01 cs.CL cs.AI 版本更新

Human Psychometric Questionnaires Mischaracterize LLM Behavior

人类心理测量问卷误判LLM行为

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Communication, Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学通信系人工智能交叉学科项目)

AI总结 通过比较LLM在Likert问卷和生成概率上的价值与人格特征,发现问卷存在系统性偏差,提出基于生成概率的评估方法更准确。

Comments 38 pages, 6 figures

详情
AI中文摘要

我们检验了人类心理测量问卷是否可以作为可靠工具来表征和预测LLM在日常用户交互中的行为。我们分析了八个开源LLM,比较了从两种不同方法得出的价值和人格特征:基于既定问卷(PVQ-40/21和BFI-44/10)的Likert自我报告,以及对日常用户查询的价值负载响应的生成概率。两种特征显著不同。在生成概率中,常被引为LLM稳定倾向证据的构念内项目一致性消失了。我们将这一差距归因于既定问卷项目中的显式词汇线索使模型能够识别目标构念并以一致、社会期望的方式响应,而现实用户查询不提供此类线索。此外,人口统计角色提示以与真实人类模式一致的方式改变了模型对人类问卷的响应,但在对现实用户查询的响应生成概率中没有出现此类变化,表明它们在模拟目标人口统计在真实世界用户交互中的行为方面能力有限。总体而言,我们的研究表明,人类心理测量问卷不足以预测LLM行为,并建议基于生成的评估作为更准确的度量。

英文摘要

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

2603.23160 2026-06-01 cs.CL 版本更新

UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

UniDial-EvalKit:多面对话能力评估的统一工具包

Qi Jia, Haodong Zhao, Dun Pei, Xiujie Song, Ye Shen, Shibo Wang, Zijian Chen, Zicheng Zhang, Xiangyang Zhu, Guangtao Zhai

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出UniDial-EvalKit统一评估工具包,通过标准化数据格式、模块化流水线和层次化评分聚合,解决多轮交互场景下评估协议异构问题,并基于大规模实验揭示当前系统无一致最优、记忆智能体常不及全上下文基线的现象。

详情
AI中文摘要

在多轮交互场景中对大型语言模型(LLM)和智能体进行基准测试对于理解其实际能力至关重要。然而,现有的评估协议高度异构,在数据集格式、模型接口和评估流水线上差异显著,严重阻碍了系统比较。在这项工作中,我们提出了UniDial-EvalKit(UDE),一个用于评估交互式AI系统的统一评估工具包。UDE的核心贡献在于其整体统一性:它将异构数据格式标准化为通用模式,通过模块化架构简化复杂的评估流水线,并在层次化评分聚合下对齐指标计算。它还通过并行生成和评分以及检查点恢复来支持高效的大规模评估,消除冗余计算。利用UDE,我们在多个多维基准上进行了广泛评估。我们的实证分析表明,没有单一系统在所有基准上持续优于其他系统,而当前的记忆智能体通常无法超越全上下文基线。进一步的分析指出了几个未来方向,包括基准去重和更自适应的记忆架构。

英文摘要

Benchmarking large language models (LLMs) and agents in multi-turn interactive scenarios is essential for understanding their practical capabilities. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a hierarchical scoring aggregation. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint resume to eliminate redundant computation. Leveraging UDE, we conduct an extensive evaluation across diverse multi-dimensional benchmarks. Our empirical analysis shows that no single system consistently outperforms others across all benchmarks, while current memory agents often fail to surpass full-context baselines. Further analyses highlight several future directions, including benchmark deduplication and more adaptive memory architectures.

2511.11440 2026-06-01 cs.CV cs.CL 版本更新

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

合成刺激,真实收益:通过完全受控的数据生成重新思考VLM微调

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento(信号与交互系统实验室,特伦托大学)

AI总结 本文提出一种完全受控的数据生成与标注流程,用于微调视觉语言模型(VLM),通过平衡分布和干净标注消除偏差,在空间推理任务上仅用130个样本即可实现均匀性能,并在真实世界数据上提升13%的性能。

详情
AI中文摘要

通过微调获得的视觉语言模型(VLM)的性能提升通常基于对真实世界场景的临时数据收集和标注。尽管有所改进,但这一过程往往容易受到偏差、错误和分布不平衡的影响,导致过拟合和性能不平衡。虽然少数研究探索了合成数据生成,但它们通常缺乏对数据分布和标注质量的控制。在这项工作中,我们通过探索完全受控的数据生成和标注流程,重新评估了模型微调的潜力,获得了具有平衡分布和干净标注的无偏差数据。以识别物体绝对位置的空间推理任务作为用例,我们微调了最先进的VLM,并在合成和真实世界基准上进行了详尽的评估,包括对真实世界场景的可迁移性。我们的实验揭示了两个关键发现:1)在平衡数据上微调可以在视觉场景中产生均匀的性能,并且仅用130个样本就能缓解常见偏差;2)在合成刺激上微调使真实世界数据(COCO)的性能提升了13%,优于在完整COCO训练集上微调的模型。

英文摘要

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

2510.25110 2026-06-01 cs.CL 版本更新

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

DEBATE:用于评估角色扮演LLM代理中观点动态的大规模基准

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, You Li, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Google DeepMind(谷歌DeepMind) Stanford University(斯坦福大学)

AI总结 提出DEBATE基准,通过多轮公共消息和Likert量表信念数据,评估多代理角色扮演LLM模拟中观点动态的真实性,发现零样本设置下代理组过度收敛,而监督微调可改善立场对齐并减少组级收敛误差。

详情
AI中文摘要

准确建模通过社交互动产生的观点变化对于理解和缓解极化、错误信息及社会冲突至关重要。近期工作使用角色扮演LLM代理(RPLA)模拟观点动态,但多代理模拟常表现出不自然的群体行为(如过早收敛),且缺乏评估与真实人类群体互动一致性的经验基准。我们提出DEBATE,一个大规模基准,用于评估多代理RPLA模拟中观点动态的真实性。DEBATE包含来自美国参与者在107个话题上的多轮公共消息和私有Likert量表信念;实验中使用的清理基准包含697个组的2,788名参与者,支持在话语和组级别进行评估,并为未来个体级别分析提供支持。我们使用七个LLM实例化“数字孪生”RPLA,并在两种设置(下一消息预测和完整动态模拟)下,使用基于立场的观点动态指标进行评估。在零样本设置中,RPLA组相对于人类组表现出强烈的观点收敛。在保留组划分上,对Llama-3.1-8B-Instruct进行监督微调(SFT)改善了辅助立场对齐并减少了组级收敛误差,尽管在观点变化和信念更新方面仍存在差异。DEBATE能够对模拟观点动态进行严格基准测试,并支持未来关于使多代理RPLA与真实人类互动对齐的研究。

英文摘要

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains multi-round public messages and private Likert-scale beliefs from U.S.-based participants across 107 topics; the cleaned benchmark used in our experiments contains 2,788 participants in 697 groups, enabling evaluation at the utterance and group levels and supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full dynamics simulation, using stance-based opinion-dynamics metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. On the held-out group split, supervised fine-tuning (SFT) for Llama-3.1-8B-Instruct improves auxiliary stance alignment and reduces group-level convergence error, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

2603.19262 2026-06-01 cs.CL cs.AI 版本更新

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

大型语言模型中推理时引发的概率变换的经验表征

Mike Farmer, Abhinav Kochar, Yugyung Lee

发表机构 * Bloch School of Management, Regnier Institute for Entrepreneurship & Innovation, University of Missouri–Kansas City(布洛赫管理学院、雷尼创业与创新研究所、密苏里大学堪萨斯城分校) Department of Computer Science, School of Science and Engineering, University of Missouri–Kansas City(计算机科学系、科学与工程学院、密苏里大学堪萨斯城分校)

AI总结 本研究通过经验观察发现,在多种推理时流程(如思维链、自我细化、检索增强和验证器引导修订)下,候选答案的概率变换遵循近似的对数比率关系,并分析了其系数变化和鲁棒性。

Comments 22 pages, 11 figures, 5 tables

详情
AI中文摘要

大型语言模型越来越依赖推理时程序,如思维链推理、自我细化、检索增强和验证器引导修订,但这些程序下引发的概率变换结构仍不清楚。我们研究外部引发的候选答案概率分配,并观察到重复出现的近似对数比率关系:\[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] 其中 $q_t$ 和 $\tilde q_t$ 分别是引发前和引发后的概率,$b_t$ 是外部构建的证据信号,$α_t$ 是提示配置的经验描述符。在来自 GPQA Diamond、TheoremQA、MMLU-Pro 和 ARC-Challenge 的 4,975 个推理问题上,对多个指令微调模型系列进行评估,我们在约 $1.3 \times 10^5$ 个候选级观测上观察到近似对数比率关系,平均 $R^2 \approx 0.76$。系数在不同引发设置下变化,但定性相似的关系在评估条件下持续存在。使用替代统计表示、提示配置、保留评估和 token 级对数概率的鲁棒性分析表明,观察到的结构不依赖于特定的提示程序或概率估计方法。主要贡献不是代数形式本身(它与广义贝叶斯更新和概率变换框架相关),而是经验观察:在受控条件下,多样化的推理时提示流程反复表现出可复现的对数比率结构。该框架为分析推理时 LLM 流程中的校准、证据放大、不确定性传播和交互敏感性提供了协议敏感的视角。

英文摘要

Large language models increasingly rely on inference-time procedures such as chain-of-thought reasoning, self-refinement, retrieval augmentation, and verifier-guided revision, yet the structure of elicited probability transformations under these procedures remains poorly understood. We study externally elicited probability assignments over candidate answers and observe recurring approximate log-ratio relationships: \[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] where $q_t$ and $\tilde q_t$ are pre- and post-elicitation probabilities, $b_t$ is an externally constructed evidence signal, and $α_t$ is an empirical descriptor of the prompting configuration. Across 4,975 reasoning problems from GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge, evaluated on multiple instruction-tuned model families, we observe approximate log-ratio relationships with mean $R^2 \approx 0.76$ over about $1.3 \times 10^5$ candidate-level observations. Coefficients vary across elicitation settings, but qualitatively similar relationships persist across evaluated conditions. Robustness analyses using alternative statistical representations, prompting configurations, held-out evaluation, and token-level log-probabilities suggest that the observed structure is not tied to one prompting procedure or probability estimation method. The main contribution is not the algebraic form itself, which is related to generalized Bayesian updating and probability-transformation frameworks, but the empirical observation that diverse inference-time prompting pipelines repeatedly exhibit reproducible log-ratio structure under controlled conditions. The framework provides a protocol-sensitive perspective for analyzing calibration, evidence amplification, uncertainty propagation, and interaction sensitivity in inference-time LLM pipelines.

2603.17306 2026-06-01 cs.CL q-bio.NC 版本更新

Evidence for systematic semantic structure in individual phonemes

单个音素中系统性语义结构的证据

Gexin Zhao

发表机构 * Columbia University(哥伦比亚大学)

AI总结 本研究通过大型语言模型、跨语言听者实验和发音特征分析,证明英语单个音素携带结构化的多维语义轮廓,挑战了音义关系任意性的传统假设。

Comments 31 pages, 4 figures

详情
AI中文摘要

语言学的一个基本假设认为,声音与意义之间的关系在很大程度上是任意的。这里我们表明,这一假设在单个音素层面上不成立:每个英语音素都携带一个结构化的、多维的语义轮廓,该轮廓可从文本中恢复、跨语言感知,并以发音为基础。三个大型语言模型独立检测到220对字母对比中九个感知维度上的一致语义结构。以英语为母语者(N=93)在一项预先注册的强制选择任务中确认了这些关联(与模型预测的一致性为85.3%),而五种类型学上不同语言的听者(N=155)在音频呈现下复制了该效应(准确率73.2%-81.9%)。发音特征以交叉验证的R²为0.56-0.98预测了该结构,表明发出声音的身体行为系统地塑造了其所传达的意义。这些发现将音素层面的象似性重新定义为音系系统中一种普遍的、具身的属性。

英文摘要

A foundational assumption in linguistics holds that sound-meaning relations are largely arbitrary. Here we show that this assumption fails at the level of individual phonemes: each English phoneme carries a structured, multidimensional semantic profile that is recoverable from text, perceived across languages, and grounded in articulation. Three large language models independently detected consistent semantic structure across nine perceptual dimensions in 220 pairwise letter contrasts. Native English speakers (N = 93) confirmed these associations in a preregistered forced-choice task (85.3% agreement with model predictions), and listeners of five typologically diverse languages (N = 155) replicated the effect under audio presentation (73.2%-81.9% accuracy). Articulatory features predicted the structure with cross-validated R^2 of 0.56-0.98, indicating that the bodily act of producing a sound systematically shapes the meaning it conveys. These findings reframe phoneme-level iconicity as a pervasive, embodied property of the phonological system.

2601.05770 2026-06-01 cs.LG cs.CL 版本更新

Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

权重到代码:从离散Transformer中提取可解释算法

Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

发表机构 * Key Laboratory of High Confidence Software Technology (PKU), MOE, Beijing, China(高可信软件技术重点实验室(PKU),教育部,北京,中国) School of Computer Science, Peking University, Beijing, China(计算机学院,北京大学,北京,中国) Shanghai AI Lab, Shanghai, China(上海人工智能实验室,上海,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国)

AI总结 提出离散Transformer架构,通过温度退火采样注入离散性,结合假设检验和符号回归从模型权重中提取可解释算法,在离散任务上性能与RNN基线相当,并扩展到连续中间计算任务。

详情
AI中文摘要

算法提取旨在直接从算法任务训练的模型中合成可执行程序,从而无需依赖人工编写的目标程序即可从权重中重新发现可执行机制。然而,将此范式应用于Transformer时,由于表示纠缠(例如叠加),其中重叠方向编码的特征严重阻碍了符号表达式的恢复。我们提出了离散Transformer,这是一种专门设计用于弥合连续表示与离散符号逻辑之间差距的架构。通过温度退火采样注入离散性,我们的框架有效利用假设检验和符号回归来提取人类可读的程序。实验表明,离散Transformer在共享离散任务上实现了与基于RNN的MIPS基线相当的性能,同时将提取扩展到具有连续值中间计算的任务。最后,我们展示了架构归纳偏置对合成程序提供了细粒度控制,使离散Transformer成为算法提取和Transformer可解释性的可控测试平台。

英文摘要

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo recovery of executable mechanisms from weights without relying on human-written target programs. However, applying this paradigm to Transformer is complicated by representation entanglement (e.g., superposition), where features encoded in overlapping directions substantially hinder the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to the RNN-based MIPS baseline on shared discrete tasks, while broadening extraction to tasks with continuous-valued intermediate computations. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a controllable testbed for algorithm extraction and Transformer interpretability.

2603.13875 2026-06-01 cs.CL cs.LG 版本更新

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

GradMem: 通过测试时梯度下降将上下文写入记忆

Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev

发表机构 * AXXX, Cognitive AI Systems Lab, Moscow, Russia(AXXX认知人工智能系统实验室,莫斯科,俄罗斯) London Institute for Mathematical Sciences, London, UK(伦敦数学科学研究所,伦敦,英国)

AI总结 提出GradMem方法,利用测试时梯度下降将上下文写入紧凑记忆状态,通过自监督重构损失优化记忆令牌,在键值检索和自然语言任务上优于前向式记忆写入方法。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

许多大型语言模型应用需要基于长上下文进行条件生成。Transformer通常通过存储每层过去激活的KV缓存来支持这一点,这会产生大量内存开销。一种理想的替代方案是压缩记忆:一次性读取上下文,将其存储在紧凑状态中,并从该状态回答许多查询。我们在上下文移除设置中研究这一点,其中模型在推理时无法访问原始上下文的情况下必须生成答案。我们引入了GradMem,它通过每个样本的测试时优化将上下文写入记忆。给定一个上下文,GradMem在保持模型权重冻结的情况下,对一小部分前缀记忆令牌执行几步梯度下降。GradMem显式优化模型级的自监督上下文重构损失,从而产生带有迭代纠错的损失驱动写入操作,这与仅前向方法不同。在关联键值检索中,GradMem在相同记忆大小下优于仅前向记忆写入器,并且额外的梯度步长比重复的前向写入更有效地扩展容量。我们进一步表明,GradMem可以迁移到合成基准之外:使用预训练语言模型,它在自然语言任务(包括bAbI和SQuAD变体)上取得了有竞争力的结果,仅依赖于记忆中的编码信息。

英文摘要

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is compressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

2602.12192 2026-06-01 cs.CL 版本更新

Query-focused and Memory-aware Reranker for Long Context Processing

面向长上下文处理的查询聚焦与记忆感知重排序器

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Yanyu Chen, Zheng Lin, Wei Zhang, Jie Zhou

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) Pattern Recognition Center, WeChat AI, Tencent(腾讯微信人工智能研究院模式识别中心) East China Normal University(华东师范大学)

AI总结 提出一种基于大语言模型中检索头注意力分数的列表式重排序框架,利用候选列表整体信息估计段落-查询相关性,无需Likert监督,在多个领域取得最优性能。

Comments Add new experiments and compare more baselines

详情
AI中文摘要

基于现有对大语言模型中检索头的分析,我们提出了一种替代的重排序框架,该框架训练模型使用选定头的注意力分数来估计段落-查询相关性。这种方法提供了一种列表式解决方案,在排序过程中利用整个候选短列表中的整体信息。同时,它自然地产生连续的相关性分数,使得能够在任意检索数据集上进行训练,而无需Likert量表监督。我们的框架轻量且有效,仅需小规模模型(如3B参数)即可实现强性能。大量实验表明,我们的方法在多个领域(包括维基百科和长叙事数据集)中优于现有的最先进点式和列表式重排序器。它进一步在评估对话理解和记忆使用的LoCoMo基准上建立了新的最先进水平。我们还证明我们的框架支持灵活的扩展。例如,用上下文信息增强候选段落可进一步提高排序准确性,而训练中间层的注意力头可在不牺牲性能的情况下提高效率。

英文摘要

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages the holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models, such as 3B parameters, to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark, which assesses dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

2603.07751 2026-06-01 cs.CV cs.CL 版本更新

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

3ViewSense: 视觉-语言模型中基于正交视图的空间与心理视角推理

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院) School of Software Engineering, Chongqing University, Chongqing, China(重庆大学软件学院)

AI总结 提出3ViewSense框架,通过正交视图的“模拟-推理”机制解决视觉-语言模型在空间推理中的视角一致性问题,显著提升遮挡计数和空间推理性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前大型语言模型已达到奥林匹克级别的逻辑推理能力,然而视觉-语言模型在诸如积木计数等基础空间任务上却出人意料地表现不佳。这种能力不匹配揭示了一个关键的“空间智能鸿沟”,即模型无法从2D观测中构建连贯的3D心理表征。我们通过诊断分析发现,这一瓶颈在于缺乏视角一致的空间接口,而非视觉特征不足或推理能力薄弱。为弥合这一鸿沟,我们引入了 extbf{3ViewSense}框架,该框架将空间推理建立在正交视图之上。借鉴工程认知,我们提出了一种“模拟-推理”机制,将复杂场景分解为规范的正交投影以解决几何歧义。通过将自我中心感知与这些异中心参考对齐,我们的方法促进了显式的心理旋转与重建。在空间推理基准上的实验结果表明,我们的方法显著优于现有基线,在遮挡密集计数和视角一致空间推理上取得了一致的提升。该框架还提高了空间描述的稳定性和一致性,为多模态系统中更强的空间智能提供了一条可扩展的路径。~ ootnote{https://github.com/Jasaxion/3ViewSense}

英文摘要

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.~\footnote{https://github.com/Jasaxion/3ViewSense}

2510.15614 2026-06-01 cs.CL 版本更新

HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds

HypoSpace: 欠定性与次线性覆盖边界下集合值假设生成的诊断基准

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Anirudh Goyal, Yew-Soon Ong, Dianbo Liu

发表机构 * National University of Singapore(国立新加坡大学) College of Computing(计算学院) Data Science(数据科学) Nanyang Technological University(南洋理工大学) Centre for Frontier AI Research (CFAR)(前沿人工智能研究中心 (CFAR)) Agency for Science, Technology and Research(科技研究局) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出HypoSpace基准,通过三个结构化领域评估大语言模型在欠定假设空间中的采样能力,发现模型在高有效性下存在覆盖失败,并展示分层解码可部分缓解该问题。

详情
AI中文摘要

许多科学问题是欠定的:多个不同的假设与相同的观察结果一致。在这种情况下,有效的推理不仅需要产生有效的解释,还需要系统地探索和覆盖可接受的假设集。我们引入了HypoSpace,这是一个将大语言模型(LLMs)视为有限假设空间上的采样器,并在三个指标上评估它们的基准:有效性、唯一性和恢复率。HypoSpace涵盖三个结构化领域(因果图推理、重力约束的3D体素重建和布尔遗传相互作用建模),具有确定性验证器和精确可枚举的解空间,以及基于真实世界的案例研究。实验上,HypoSpace揭示了一种依赖于能力和规模的覆盖失败:随着可接受假设空间变得更大或更具组合性,模型可以保持高有效性,同时表现出降低的唯一性和恢复率。我们进一步表明,对分层解码的分析部分缓解了这种崩溃,证明了HypoSpace作为集合值推理的诊断基准的实用性。代码可在 https://github.com/CTT-Pavilion/_HypoSpace 获取。

英文摘要

Many scientific problems are underdetermined: multiple distinct hypotheses are equally consistent with the same observations. In such settings, effective inference requires not only producing valid explanations, but also systematically exploring and covering the admissible hypothesis set. We introduce HypoSpace, a benchmark that treats large language models (LLMs) as samplers over finite hypothesis spaces and evaluates them on three metrics: Validity, Uniqueness, and Recovery. HypoSpace spans three structured domains (causal graph inference, gravity-constrained 3D voxel reconstruction, and Boolean genetic interaction modeling) with deterministic validators and exactly enumerable solution spaces, plus real-world anchored case studies. Empirically, HypoSpace reveals a capability- and scale-dependent coverage failure: models can maintain high Validity while exhibiting reduced Uniqueness and Recovery as admissible hypothesis spaces become larger or more combinatorial. We further show that the analysis on stratified decoding partially mitigates this collapse, demonstrating HypoSpace's utility as a diagnostic benchmark for set-valued inference. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

2408.10441 2026-06-01 cs.CL 版本更新

Goldfish: Monolingual Language Models for 350 Languages

Goldfish: 面向350种语言的单语语言模型

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

发表机构 * Department of Cognitive Science(认知科学系) Department of Linguistics(语言学系) Department of Computer Science(计算机科学系) University of California San Diego(加州大学圣地亚哥分校)

AI总结 针对低资源语言,发现大型多语言模型在基本语法生成上不如双元模型,而小规模单语模型在困惑度和语法性基准上表现更优,并发布了覆盖350种语言的1000多个单语模型套件Goldfish。

Comments LREC 2026

详情
AI中文摘要

对于许多低资源语言,唯一可用的语言模型是在多种语言上同时训练的大型多语言模型。尽管在推理任务上表现最先进,我们发现这些模型在许多语言的基本语法文本生成上仍然存在困难。首先,使用FLORES困惑度作为评估指标,大型多语言模型在许多语言上的表现甚至不如双元模型(例如,XGLM 4.5B中24%的语言;BLOOM 7.1B中43%的语言)。其次,当我们为350种语言训练仅有1.25亿参数、数据量不超过1GB的小型单语模型时,这些小型模型在困惑度和大规模多语言语法性基准上都优于大型多语言模型。为了促进未来低资源语言建模的工作,我们发布了Goldfish,一个包含超过1000个小型单语语言模型的套件,这些模型在350种语言上进行了可比训练。这些模型代表了其中215种语言首次公开可用的单语语言模型。

英文摘要

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.

2603.04946 2026-06-01 cs.CL 版本更新

LocalSUG: City-Preference-Enhanced LLM for Query Suggestion in Local-Life Services

LocalSUG:面向本地生活服务的城市偏好增强大语言模型查询建议

Jinwen Chen, Shiwen Zhang, Shuai Gong, Zheng Zhang, Yachao Zhao, Lingxiang Wang, Haibo Zhou, Wei Lin, Hainan Zhang

发表机构 * Meituan(美团)

AI总结 提出LocalSUG框架,通过挖掘城市偏好候选词注入提示、束搜索GRPO算法和加速解码,解决大语言模型在本地生活服务查询建议中城市偏好不足、偏好暴露偏差和延迟约束问题。

详情
AI中文摘要

在本地生活服务平台中,查询建议通过从输入前缀生成候选查询来减少用户操作。传统的多阶段系统严重依赖历史热门查询,限制了其捕捉长尾和新兴需求的能力。尽管大语言模型提供了强大的语义泛化能力,但它们在本地生活服务中的部署面临三个挑战:城市偏好感知不足、偏好优化中的暴露偏差以及严格的在线延迟约束。我们提出了LocalSUG,一个基于大语言模型的本地生活服务查询建议框架。LocalSUG从术语共现中挖掘城市偏好增强的候选词,并将其作为动态参考注入提示中,而非融合到模型参数中。这使得模型能够适应变化的城市偏好(如商家开业或关闭),同时减少过时或本地无效的建议。我们进一步引入了束搜索驱动的GRPO算法,使训练与推理时解码对齐,并优化相关性以及业务导向的奖励。最后,质量感知的束加速和词汇剪枝在保持生成质量的同时降低了在线延迟。离线评估和大规模在线A/B测试表明,LocalSUG将点击率提高了0.35%,并将低/无结果率降低了3.98%,证明了其在实际部署中的有效性。

英文摘要

In local-life service platforms, query suggestion reduces user effort by generating candidate queries from input prefixes. Traditional multi-stage systems rely heavily on historical popular queries, limiting their ability to capture long-tail and emerging demand. Although LLMs provide strong semantic generalization, their deployment in local-life services faces three challenges: insufficient city-preference awareness, exposure bias in preference optimization, and strict online latency constraints. We propose LocalSUG, an LLM-based query suggestion framework for local-life services. LocalSUG mines city-preference-enhanced candidates from term co-occurrence and injects them into prompts as dynamic references rather than fusing them into model parameters. This allows the model to adapt to changing city preferences, such as merchant openings or closures, while reducing stale or locally invalid suggestions. We further introduce a beam-search-driven GRPO algorithm to align training with inference-time decoding and optimize relevance together with business-oriented rewards. Finally, quality-aware beam acceleration and vocabulary pruning reduce online latency while preserving generation quality. Offline evaluations and large-scale online A/B testing show that LocalSUG improves CTR by +0.35% and reduces the low/no-result rate by 3.98%, demonstrating its effectiveness in real-world deployment.

2602.24210 2026-06-01 cs.CL cs.AI 版本更新

From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves

从泄露思维到私有推理:控制LRM对自己说的话

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

发表机构 * Mohamed bin Zayed University of Artificial Intelligence, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋)

AI总结 针对大型推理模型(LRM)推理过程中隐私泄露问题,提出通过指令跟随(IF)训练和分阶段解码策略(Staged Decoding)增强隐私保护,在IF和隐私基准上分别提升高达20.9和51.9个百分点。

详情
AI中文摘要

大型推理模型(LRM)产生的推理轨迹(RT)通常包含敏感信息。这些泄露的思维难以控制,且经常违反明确的隐私指令。由于RT可能通过提示注入攻击暴露,这对用户构成了直接的隐私风险。我们将此视为一个可控性问题:由于隐私指令本身就是指令,在RT内改进指令跟随(IF)为减少隐私泄露提供了直接途径。为此,我们引入了一个SFT数据集,教会模型在其推理过程中遵循通用指令,并提出了分阶段解码(Staged Decoding),一种简单的解码策略,通过使用独立的LoRA适配器解耦RT和答案生成,以最大化每个组件的IF。我们在两个系列(1.7B-14B参数)的六个模型上,在两个IF基准和两个隐私基准上评估了我们的方法。我们的方法带来了显著的改进,在IF上提升高达20.9分,在隐私基准上提升51.9个百分点,尽管由于推理性能与IF之间的权衡,这些改进可能以牺牲任务效用为代价。我们的结果表明,改进LRM中的IF可以显著增强隐私,为未来隐私感知的LRM提供了一个有前景的方向。我们的代码可在https://github.com/UKPLab/arxiv2026-controllable-reasoning-models获取。

英文摘要

Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks. Our method yields substantial improvements, with gains of up to 20.9 points in IF and 51.9 percentage points on privacy benchmarks, though these can come at the cost of task utility due to the trade-off between reasoning performance and IF. Our results show that improving IF in LRMs can significantly enhance privacy, suggesting a promising direction for future privacy-aware LRMs. Our code is available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models.

2602.19049 2026-06-01 cs.CL cs.LG 版本更新

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

IAPO:面向令牌高效推理的信息感知策略优化

Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li

发表机构 * University of Virginia(弗吉尼亚大学) LinkedIn Inc.(LinkedIn公司)

AI总结 提出信息感知策略优化框架IAPO,通过基于条件互信息的令牌级优势塑造,在提升推理准确率的同时将推理长度减少高达36%。

详情
AI中文摘要

大型语言模型越来越依赖长思维链来提高准确性,但这种提升伴随着巨大的推理时间成本。我们重新审视令牌高效的后训练,并认为现有的序列级奖励塑造方法对推理努力在令牌间的分配控制有限。为弥补这一差距,我们提出IAPO,一个信息论后训练框架,根据每个令牌与最终答案的条件互信息(MI)分配令牌级优势。这提供了一种明确、有原则的机制来识别信息丰富的推理步骤并抑制低效探索。我们的理论分析表明,IAPO可以在不损害正确性的情况下诱导推理冗长性的单调减少。实验上,IAPO在保持推理准确率的同时,将推理长度减少高达36%,在各种推理数据集上优于现有的令牌高效强化学习方法。广泛的实证评估表明,信息感知优势塑造是令牌高效后训练的一个强大且通用的方向。代码可在 https://github.com/YinhanHe123/IAPO 获取。

英文摘要

Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

2506.01928 2026-06-01 cs.CL cs.LG 版本更新

Esoteric Language Models: A Family of Any-Order Diffusion LLMs

深奥语言模型:一类任意阶扩散LLM

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

发表机构 * Cornell University(康奈尔大学) NVIDIA(英伟达)

AI总结 提出Eso-LMs模型,融合自回归与掩码扩散范式,通过因果注意力实现精确似然计算和KV缓存,在速度-质量帕累托前沿上达到新最优。

Comments ICML 2026

详情
AI中文摘要

基于扩散的语言模型通过并行和可控生成为自回归(AR)模型提供了引人注目的替代方案。在这一类模型中,掩码扩散模型(MDM)目前表现最佳,但在困惑度上仍不如AR模型,并且缺乏关键的推理时效率特性,尤其是KV缓存。我们引入了Eso-LMs,这是一个融合AR和MDM范式的新模型家族,能够平滑地插值它们的困惑度,同时克服各自的局限性。与以往使用具有双向注意力的Transformer作为MDM去噪器的工作不同,我们利用了MDM与任意阶自回归模型之间的联系,并采用因果注意力。这种设计使我们首次能够计算MDM的精确似然,并且关键的是,首次能够在保持并行生成的同时为MDM引入KV缓存,从而显著提高推理效率。结合优化的采样调度,Eso-LMs在无条件生成的快速-质量帕累托前沿上建立了新的最先进水平。我们在项目页面上提供代码、模型检查点和视频教程:https://s-sahoo.com/Eso-LMs。

英文摘要

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key inference-time efficiency features, most notably KV caching. We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, smoothly interpolating between their perplexities while overcoming their respective limitations. Unlike prior work, which uses transformers with bidirectional attention as MDM denoisers, we exploit the connection between MDMs and Any-Order autoregressive models and adopt causal attention. This design lets us compute the exact likelihood of MDMs for the first time and, crucially, enables us to introduce KV caching for MDMs while preserving parallel generation for the first time, significantly improving inference efficiency. Combined with an optimized sampling schedule, Eso-LMs establish a new state of the art on the speed-quality Pareto frontier for unconditional generation. We provide the code, model checkpoints, and the video tutorial on the project page: https://s-sahoo.com/Eso-LMs.

2602.18333 2026-06-01 cs.LG cs.CL 版本更新

On the "Induction Bias" in Sequence Models

论序列模型中的“归纳偏置”

M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic

发表机构 * Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc(高通人工智能研究,由高通技术公司发起)

AI总结 通过大规模实验比较Transformer和RNN在状态跟踪任务上的数据效率,发现Transformer需要更多训练数据且难以跨长度共享权重,而RNN通过权重共享实现有效学习。

Comments Accepted to the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

尽管基于Transformer的语言模型在实际应用中取得了显著成功,但近期研究对其执行状态跟踪的能力提出了担忧。特别是,越来越多的文献通过分布外(OOD)泛化失败(如长度外推)来展示这一局限性。在本工作中,我们将注意力转向这些局限性的分布内影响。我们在多种监督机制下对Transformer和循环神经网络(RNN)的数据效率进行了大规模实验研究。我们发现,Transformer所需的训练数据量随状态空间大小和序列长度的增长远快于RNN。此外,我们分析了学习到的状态跟踪机制在不同序列长度上的共享程度。我们表明,Transformer在不同长度上的权重共享可以忽略甚至有害,表明它们孤立地学习长度特定的解决方案。相比之下,循环模型通过跨长度共享权重实现了有效的摊销学习,使得一个序列长度的数据能够提高其他长度上的性能。这些结果共同表明,即使训练和评估分布匹配,状态跟踪仍然是Transformer的一个基本挑战。

英文摘要

Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

2602.13110 2026-06-01 cs.CL cs.AI 版本更新

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

SCOPE: 选择性保形优化的成对LLM评判

Sher Badshah, Ali Emami, Hassan Sajjad

发表机构 * Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada(达勒豪西大学计算机科学学院) Department of Computer Science, Emory University, Atlanta, GA, USA(埃默里大学计算机科学系)

AI总结 提出SCOPE框架,通过校准接受阈值控制非弃权判断的错误率,并引入双向偏好熵(BPE)提供无偏不确定性信号,实现可靠且高覆盖率的LLM成对评估。

Comments Accepted at ICML 2026. 23 pages (9 main plus appendix), 7 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作成对评估中的可扩展评判者,但它们仍然容易受到校准偏差和偏见的影响。我们提出SCOPE(选择性保形优化成对评估),一个校准接受阈值的框架,使得在可交换性条件下,非弃权判断的错误率最多为用户指定的水平$α$。为了向SCOPE提供无偏的不确定性信号,我们引入了双向偏好熵(BPE),它在两个响应位置下查询评判者,并将顺序平均的偏好概率转换为基于熵的分数。在各种成对评判基准上,BPE在校准和区分能力方面优于标准置信度代理,而SCOPE始终满足目标风险界限(在$α=0.10$时,经验FDR约为0.097至0.099),并保持较高的覆盖率。与原始基线相比,在相同风险约束下,SCOPE接受的判断数量最多增加2.4倍,表明BPE能够实现可靠且高覆盖率的基于LLM的评估。

英文摘要

Large language models (LLMs) are increasingly used as scalable judges in pairwise evaluation, but they remain prone to miscalibration and biases. We propose SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework that calibrates an acceptance threshold so that, under exchangeability, the error rate among non-abstained judgments is at most a user-specified level $α$. To supply SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions and converts the order-averaged preference probability into an entropy-based score. Across various pairwise judging benchmarks, BPE outperforms standard confidence proxies in calibration and discrimination, while SCOPE consistently satisfies the target risk bound (empirical FDR $\approx 0.097$ to $0.099$ at $α= 0.10$) and retains substantial coverage. Compared to vanilla baselines, SCOPE accepts up to $2.4\times$ more judgments under the same risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

2602.15778 2026-06-01 cs.CL 版本更新

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

*-PLUIE:使用大语言模型改进评估的可个性化度量

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

发表机构 * Univ Rennes, CNRS, IRISA, EXPRESSION(里尔大学、法国国家科学研究中心、IRISA、EXPRESSION) Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004(南特大学、南特中央理工学院、法国国家科学研究中心、LS2N、UMR 6004) Univ Rennes, CNRS, IRISA, SOTERN(里尔大学、法国国家科学研究中心、IRISA、SOTERN) Univ of South Brittany, CNRS, IRISA, ARCHIMEDIA(布列塔尼南部大学、法国国家科学研究中心、IRISA、ARCHIMEDIA)

AI总结 提出*-PLUIE,一种基于困惑度的可个性化LLM评判度量,通过任务特定提示变体实现与人类判断的更强相关性,同时保持低计算成本。

Comments Accepted at *SEM 2026

详情
AI中文摘要

评估自动生成文本的质量通常依赖于LLM作为评判者(LLM-judge)的方法。虽然有效,但这些方法计算成本高且需要后处理。为了解决这些限制,我们基于ParaPLUIE进行改进,ParaPLUIE是一种基于困惑度的LLM评判度量,它通过估计“是/否”答案的置信度而不生成文本。我们引入了*-PLUIE,即ParaPLUIE的任务特定提示变体,并评估它们与人类判断的一致性。实验表明,个性化的*-PLUIE在保持低计算成本的同时,与人类评分的相关性更强。

英文摘要

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

2602.15293 2026-06-01 cs.LG cs.AI cs.CL stat.ML 版本更新

The Information Geometry of Softmax: Probing and Steering

Softmax的信息几何:探测与引导

Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文从信息几何角度研究AI系统如何将语义结构编码到表示空间的几何结构中,并提出一种利用线性探针鲁棒引导表示以展现特定概念的“双重引导”方法。

Comments Code is available at https://github.com/KihoPark/dual-steering

详情
Journal ref
In Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
AI中文摘要

本文关注AI系统如何将语义结构编码到其表示空间的几何结构中的问题。动机观察是,这些表示空间的自然几何应反映模型使用表示产生行为的方式。我们聚焦于定义softmax分布的重要特例。在这种情况下,我们认为自然几何是信息几何。我们的重点是信息几何在语义编码和线性表示假设中的作用。作为一个说明性应用,我们开发了“双重引导”,一种利用线性探针鲁棒地引导表示以展现特定概念的方法。我们证明双重引导在最小化对非目标概念改变的同时,最优地修改目标概念。实验上,我们发现双重引导增强了概念操控的可控性和稳定性。

英文摘要

This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

2602.13069 2026-06-01 cs.LG cs.CL 版本更新

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

面向设备上大语言模型微调的内存高效结构化反向传播

Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee

发表机构 * OptAI Inc.(OptAI公司)

AI总结 提出MeSP方法,通过手动推导利用LoRA低秩结构的反向传播,在计算数学等价梯度的同时平均减少49%内存,使内存受限设备上的微调成为可能。

Comments ACL2026

详情
AI中文摘要

设备上微调能够实现大语言模型的隐私保护个性化,但移动设备存在严重的内存限制,通常所有工作负载共享6-12GB内存。现有方法迫使在高内存的精确梯度(MeBP)和低内存的噪声估计(MeZO)之间进行权衡。我们提出内存高效结构化反向传播(MeSP),通过手动推导利用LoRA低秩结构的反向传播来弥合这一差距。我们的关键洞察是,中间投影 $h = xA$ 可以在反向传播中以最小成本重新计算,因为秩 $r \ll d_{in}$,从而无需存储它。在Qwen2.5模型(0.5B-3B)上,MeSP相比MeBP平均减少49%内存,同时计算数学上等价的梯度。我们的分析还揭示,MeZO的梯度估计与真实梯度的相关性接近零(余弦相似度≈0.001),解释了其收敛缓慢的原因。MeSP将Qwen2.5-0.5B的峰值内存从361MB降低到136MB,使得先前在内存受限设备上不可行的微调场景成为可能。

英文摘要

On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

2602.11137 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Weight Decay Improves Language Model Plasticity

权重衰减提升语言模型可塑性

Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

发表机构 * Broad Institute, Schmidt Center(Broad研究所,Schmidt中心) University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心) Harvard University(哈佛大学)

AI总结 本文通过系统实验表明,预训练中较大的权重衰减能提高模型的可塑性,使微调后下游性能更优,并揭示了其促进线性可分表示、正则化注意力矩阵和减少过拟合的机制。

详情
AI中文摘要

大型语言模型通常分两个主要阶段训练:预训练以产生基础模型,然后进一步训练以提高下游性能。然而,超参数优化和缩放定律主要从基础模型验证损失的角度研究,忽略了一个关键的模型属性:下游适应性。在这项工作中,我们从模型可塑性的角度研究预训练,即基础模型在额外训练后成功适应下游任务的能力。我们关注权重衰减的作用,这是预训练中的一个关键正则化参数,并通过系统实验表明,较大的权重衰减提高了预训练模型的可塑性,导致微调后下游性能提升更大。这种效应可能导致反直觉的权衡,即预训练后表现较差的基础模型在进一步训练后可能表现更好。对权重衰减对模型行为的机制影响的进一步研究表明,它鼓励线性可分的表示,正则化注意力矩阵,并减少对训练数据的过拟合。这些发现共同强调了预训练模型可塑性的重要性,使用交叉熵损失作为超参数优化的唯一指标的局限性,以及单个优化超参数在塑造模型行为中的多方面作用。

英文摘要

Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to improve downstream performance. However, hyperparameter optimization and scaling laws are studied primarily from the perspective of the base model's validation loss, overlooking a crucial model property: downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks upon additional training. We focus on the role of weight decay, a key regularization parameter during pretraining, and show through systematic experiments that larger weight decay increases the plasticity of the pretrained model, resulting in greater performance gains downstream after fine-tuning. This effect can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after further training. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. Together, these findings highlight the importance of pretrained model plasticity, the limits of using cross-entropy loss as the sole metric for hyperparameter optimization, and the multifaceted role that a single optimization hyperparameter plays in shaping model behavior.

2602.10324 2026-06-01 cs.AI cs.CL cs.CY cs.HC 版本更新

Discovering Differences in Strategic Behavior Between Humans and LLMs

发现人类与LLM在战略行为上的差异

Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro

发表机构 * Department of Computer Science, University of Texas at Austin. Work performed as a student researcher at Google DeepMind(德克萨斯大学计算机科学系。作为谷歌DeepMind的学生研究员进行的工作) Google DeepMind(谷歌DeepMind)

AI总结 使用AlphaEvolve程序发现工具,从数据中直接发现可解释的人类和LLM行为模型,揭示在迭代石头剪刀布中前沿LLM比人类具有更深层次的战略行为。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在社交和战略场景中,了解它们的行为在何处以及为何与人类行为产生差异变得至关重要。虽然行为博弈论(BGT)为分析行为提供了框架,但现有模型并未完全捕捉到人类或像LLM这样的黑箱非人类代理的独特行为。我们采用AlphaEvolve这一前沿程序发现工具,直接从数据中发现可解释的人类和LLM行为模型,从而能够开放式地发现驱动人类和LLM行为的结构因素。我们对迭代石头剪刀布的分析表明,前沿LLM可能比人类具有更深层次的战略行为。这些结果为理解驱动人类和LLM在战略互动中行为差异的结构性差异奠定了基础。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

2602.07721 2026-06-01 cs.LG cs.CL cs.DB 版本更新

ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

ParisKV:面向长上下文LLM的快速且漂移鲁棒的KV缓存检索

Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas

发表机构 * Xi'an Jiaotong University, Xi'an, China(西安交通大学) Qwen Team, Alibaba Group, China(通义实验室) Harvard University, Cambridge, MA, USA(哈佛大学) Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所)

AI总结 提出基于碰撞候选选择和量化内积重排序的GPU原生KV缓存检索框架ParisKV,在百万token上下文中实现低延迟、高吞吐且分布漂移鲁棒的检索,性能优于或持平全注意力。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

KV缓存检索对于长上下文LLM推理至关重要,但现有方法在处理大规模分布漂移和高延迟时存在困难。我们提出ParisKV,一种基于碰撞候选选择、随后使用量化内积重排序估计器的漂移鲁棒、GPU原生的KV缓存检索框架。对于百万token上下文,ParisKV通过统一虚拟寻址(UVA)支持CPU卸载的KV缓存,实现按需的top-$k$获取,开销极小。ParisKV在长输入和长生成基准测试中匹配或超越全注意力质量。它实现了最先进的长上下文解码效率:即使在长上下文的批大小为1时,也能匹配或超过全注意力速度;在全注意力可运行范围内提供高达2.8倍的吞吐量;并扩展到全注意力内存不足的百万token上下文。在百万token规模下,与两个最先进的KV缓存Top-$k$检索基线MagicPIG和PQCache相比,ParisKV分别将解码延迟降低了17倍和44倍。代码可在https://github.com/amy-77/ParisKV/tree/main获取。

英文摘要

KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines, code is available at https://github.com/amy-77/ParisKV/tree/main.

2602.09276 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Effective Reasoning Chains Reduce Intrinsic Dimensionality

有效推理链降低内在维度

Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文通过内在维度量化推理链有效性,发现有效推理策略能降低任务内在维度,并在GSM8K上验证其与泛化性能的强负相关。

Comments ICML (spotlight) camera-ready; 22 pages, 3 figures

详情
AI中文摘要

思维链推理及其变体显著提升了语言模型在复杂推理任务上的性能,但不同策略促进泛化的精确机制仍不明确。虽然当前解释常指向增加测试时计算或结构引导,但建立这些因素与泛化之间一致、可量化的联系仍具挑战。本文中,我们将内在维度识别为表征推理链有效性的定量度量。内在维度量化了在给定任务上达到特定准确率阈值所需的最小模型维度数。通过固定模型架构并改变不同推理策略下的任务表述,我们证明有效推理策略持续降低任务的内在维度。在GSM8K上使用Gemma-3 1B和4B验证这一点,我们观察到推理策略的内在维度与其在分布内和分布外数据上的泛化性能之间存在强负相关。我们的发现表明,有效推理链通过使用更少参数更好地压缩任务来促进学习,为分析推理过程提供了新的定量度量。

英文摘要

Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

2602.08964 2026-06-01 cs.LG cs.AI cs.CL cs.CY 版本更新

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

语言模型智能体中目标导向性的行为与表征评估

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

发表机构 * University of Pennsylvania(宾夕法尼亚大学) New York University(纽约大学) Indiana University, Bloomington(印第安纳大学,布卢明顿) Northeastern University(东北大学) University College London(伦敦大学学院)

AI总结 本文提出一种结合行为评估与内部表征可解释性分析的目标导向性评估框架,并以LLM智能体在2D网格世界中的导航为例,验证了其行为与表征的一致性。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

理解智能体的目标有助于解释和预测其行为,但目前尚无可靠的方法来归因智能系统的目标。我们提出一个评估目标导向性的框架,该框架将行为评估与基于可解释性的模型内部表征分析相结合。作为案例研究,我们考察了一个在二维网格世界中导航至目标状态的LLM智能体。在行为上,我们评估智能体在不同网格大小、障碍物密度和目标结构下的最优策略,发现其性能随任务难度扩展,同时对保持难度的变换和多目标结构具有鲁棒性。然后,我们使用探测方法解码环境及多步行动计划的内部表征。我们发现,LLM智能体非线性地编码了一个粗略的空间地图,保留了关于其位置和目标位置的任务相关近似线索;其行动与这些内部表征大致一致;推理过程重新组织这些表征,从空间线索转向即时行动选择。我们的研究结果支持这样的观点:除了行为评估之外,还需要内省检查来表征智能体如何表示和追求其目标。

英文摘要

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world towards a goal state. Behaviourally, we evaluate the agent against optimal policies across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and multi-goal structures. We then use probing methods to decode internal representations of the environment and multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from spatial cues towards immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

2602.07457 2026-06-01 cs.SE cs.AI cs.CL 版本更新

Pull Requests as a Training Signal for Repo-Level Code Editing

拉取请求作为仓库级代码编辑的训练信号

Qinglin Zhu, Tianyu Chen, Shuai Lu, Lei Ji, Runcong Zhao, Murong Ma, Xiangxiang Dai, Yulan He, Lin Gui, Peng cheng, Yeyun Gong

发表机构 * King's College London, UK(伦敦国王学院) Chinese University of Hong Kong, HK(香港中文大学) National University of Singapore, SG(新加坡国立大学) Microsoft Research Asia, CN(微软亚洲研究院) The Alan Turing Institute, UK(艾伦·图灵研究所)

AI总结 提出Clean-PR方法,利用真实GitHub拉取请求作为训练信号,通过重建和验证转换为搜索/替换编辑块,结合无代理对齐的监督微调,在SWE-bench上显著提升仓库级代码编辑性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

仓库级代码编辑要求模型理解复杂依赖关系并在大型代码库中执行精确的多文件修改。虽然最近在SWE-bench上的进展严重依赖于复杂的代理脚手架,但尚不清楚这种能力有多少可以通过高质量的训练信号内化。为了解决这个问题,我们提出了Clean Pull Request (Clean-PR),一种利用真实世界GitHub拉取请求作为仓库级编辑训练信号的中间训练范式。我们引入了一个可扩展的流水线,通过重建和验证将嘈杂的拉取请求差异转换为搜索/替换编辑块,从而得到最大的公开可用语料库,包含200万个拉取请求,涵盖12种编程语言。使用这个训练信号,我们进行中间训练阶段,然后进行无代理对齐的监督微调过程,并带有错误驱动的数据增强。在SWE-bench上,我们的模型显著优于指令微调基线,在SWE-bench Lite上实现了13.6%的绝对改进,在SWE-bench Verified上实现了12.3%的绝对改进。这些结果表明,仓库级代码理解和编辑能力可以在简化的、无代理的协议下有效地内化到模型权重中,而无需依赖繁重的推理时脚手架。

英文摘要

Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.

2601.01754 2026-06-01 cs.LG cs.CC cs.CL cs.FL 版本更新

Context-Free Recognition with Transformers

使用Transformer进行上下文无关语言识别

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

发表机构 * ETH Zürich(苏黎世联邦理工学院) Boston University(波士顿大学) Allen Institute for AI(人工智能研究院)

AI总结 本文证明循环Transformer通过O(log N)层和O(N^6)填充符号可识别所有上下文无关语言,并针对无歧义子类将填充需求降至O(N^3)。

详情
AI中文摘要

Transformer在处理符合某种语法的良好形式输入(如自然语言和代码)的任务中表现出色。然而,它们如何处理语法句法仍不清楚。事实上,在标准复杂性猜想下,标准Transformer无法识别上下文无关语言(CFL)——一种描述句法的规范形式,甚至无法识别正则语言(CFL的子类)。过去的工作表明,O(log(N))循环层(相对于输入长度N)允许Transformer识别正则语言,但循环Transformer识别上下文无关语言的问题仍然开放。在这项工作中,我们证明具有O(log(N))循环层和O(N^6)填充符号的循环Transformer可以识别所有CFL。然而,使用O(N^6)填充符号的训练和推理可能不切实际。幸运的是,我们表明,对于无歧义CFL等自然子类,Transformer上的识别问题变得更加易处理,只需要O(N^3)填充。实验上,循环和填充Transformer在识别CFL方面比固定深度Transformer表现更好。总体而言,我们的结果揭示了Transformer识别CFL的复杂性:虽然一般识别可能需要难以处理的填充量,但无歧义性等自然约束产生了高效的识别算法。

英文摘要

Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work has shown that $\mathcal{O}(\log(N))$ looping layers (w.r.t. input length $N$) allow transformers to recognize regular languages, but the question of context-free recognition with looped transformers remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(N))$ looping layers and $\mathcal{O}(N^6)$ padding symbols can recognize all CFLs. However, training and inference with $\mathcal{O}(N^6)$ padding symbols is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(N^3)$ padding. Empirically, looped and padded transformers perform better than fixed-depth transformers in recognizing CFLs. Overall, our results shed light on the intricacy of CFL recognition by transformers: while general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

2602.06161 2026-06-01 cs.CL cs.AI 版本更新

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

停止翻转:面向快速可撤销扩散解码的上下文保持验证

Yanzheng Xiang, Lan Wei, Yizhen Yao, Qinglin Zhu, Hanqi Yan, Chen Jin, Philip Alexander Teare, Dandan Zhang, Lin Gui, Amrutha Saseendran, Yulan He

发表机构 * King's College London, UK Centre for AI, Data Science \& Artificial Intelligence, BioPharmaceuticals R\&D, AstraZeneca, UK The Alan Turing Institute, UK Imperial College London, UK

AI总结 针对并行扩散解码中因激进并行导致的翻转振荡问题,提出COVER方法,通过KV缓存覆盖和稳定性感知评分实现单次前向传递中的留一验证与稳定草稿,减少不必要修订并加速解码。

详情
AI中文摘要

并行扩散解码可以通过每步解掩多个令牌来加速扩散语言模型推理,但激进的并行常常损害质量。可撤销解码通过重新检查早期令牌来缓解这一问题,然而我们观察到现有的验证方案频繁触发翻转振荡,即令牌被重新掩码后又原样恢复。这种行为以两种方式减慢推理:重新掩码已验证位置削弱了并行草稿的条件上下文,且重复的重新掩码循环消耗修订预算而进展甚微。我们提出COVER(用于高效修订的缓存覆盖验证),它在单次前向传递中执行留一验证和稳定草稿。COVER通过KV缓存覆盖构建两种注意力视图:选定的种子被掩码用于验证,而其缓存的键值状态被注入到所有其他查询中以保留上下文信息,同时通过闭式对角校正防止种子位置的自泄漏。COVER进一步使用稳定性感知评分对种子进行优先级排序,该评分平衡不确定性、下游影响和缓存漂移,并自适应调整每步验证的种子数量。在多个基准测试中,COVER显著减少不必要的修订,实现更快的解码,同时保持输出质量。

英文摘要

Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.

2602.06055 2026-06-01 cs.CL 版本更新

Are we chasing ghosts? Quantifying unattributable polarization, and attributing the rest to annotator groups

我们在追逐幽灵吗?量化不可归因的极化,并将其余归因于标注者群体

Dimitris Tsirmpas, John Pavlopoulos

发表机构 * Athens University of Economics and Business(雅典经济与商业大学) Archimedes, Athena Research Center(阿基米德-雅典研究中心)

AI总结 针对标注者群体间系统性意见差异难以捕捉的问题,提出一种新的极化归因度量方法,能避免固有极化和群体效应抵消,并验证了性别和种族对极化模式的解释力。

Comments 19 pages, 7 tables, 9 figures

详情
AI中文摘要

标准一致性指标通常无法捕捉少数群体和多数群体标注者之间的系统性意见差异,从而危及仇恨言论和毒性检测等任务。极化最近被提出作为一种更稳健的方式来区分微小分歧与系统性意见差异,但现有方法并未提供将其归因于特定标注者群体的实用工具。我们评估了当前方法,并识别出在现实场景中的两个主要局限性:(1)存在无法归因于任何已知或潜在群体的“固有”极化;(2)对立的极化效应在聚合标注中相互抵消。为了解决这些问题,我们引入了一种新的度量方法,能够测量并检验标注者群体极化归因的统计显著性,同时避免这些局限性,并提供了一个开源的Python库实现,发现每个评论只需不超过20个标注者即可实现可靠估计。我们将该方法应用于四个主观NLP数据集,发现性别和种族一致地解释了极化模式,而标注者群体之间的差异随着群体距离增大而增强。

英文摘要

Standard agreement metrics often fail to capture systematic differences in opinion between minority and majority-group annotators, jeopardizing tasks such as hate speech and toxicity detection. Polarization has recently been proposed as a more robust way of distinguishing minor disagreements from systematic differences in opinion, but existing approaches do not provide practical tools for attributing it to specific annotator groups. We evaluate current methods and identify two major limitations in realistic settings: (1) the presence of ``inherent'' polarization that cannot be attributed to any known or latent groups, and (2) opposing polarization effects canceling each other out in aggregated annotations. To address these issues, we introduce a new metric that measures and tests the statistical significance of polarization attribution for annotator groups while avoiding these limitations, as well as an open-source Python library implementation, finding that no more than 20 annotators are needed per comment for reliable estimation. We apply our method to four subjective NLP datasets and find that gender and race consistently explain polarization patterns, while differences between annotator groups become stronger as the groups are further apart.

2601.20789 2026-06-01 cs.CL cs.LG cs.SE 版本更新

SERA: Soft-Verified Efficient Repository Agents

SERA:软验证的高效仓库智能体

Ethan Shen, Daniel Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学) Paul G. Allen School of Computer Science and Engineering(保罗·G·艾伦计算机科学与工程学院) Allen Institute of Artificial Intelligence(艾伦人工智能研究所) Machine Learning Department(机器学习系)

AI总结 提出SERA方法,通过软验证生成(SVG)高效训练编码智能体,使其快速适应私有代码库,在开源模型中取得领先性能且成本极低。

Comments 21 main pages, 6 pages appendix

详情
AI中文摘要

开源编码智能体应比闭源系统具有根本优势,因为它们可以专门化到私有代码库,将仓库特定信息直接编码在其权重中。然而,训练的成本和复杂性一直使这一优势停留在理论层面。我们提出了软验证高效仓库智能体(SERA),一种高效的编码智能体训练方法,能够快速、廉价地创建专门化到私有代码库的智能体。利用软验证生成(SVG),我们可以从任何代码仓库生成数千条轨迹,而无需单元测试。除了仓库专门化,我们将SVG应用于更大的代码库语料库,生成了超过200,000条合成轨迹。仅使用监督微调(SFT),SERA在全开源(开放数据、方法、代码)模型中取得了领先结果,同时匹配了如Devstral-Small-2等开源权重模型的性能。创建SERA模型的成本比强化学习便宜26倍,比先前达到同等性能的合成数据方法便宜57倍。我们利用数据集提供了关于训练编码智能体的缩放定律、消融实验和混淆因素的详细分析。总体而言,我们相信我们的工作将极大加速开源编码智能体的研究,并展示能够适应私有代码库的开源模型的优势。我们将SERA作为Ai2开源编码智能体系列的第一个模型发布,同时公开所有代码、数据和Claude Code集成,以支持研究社区。

英文摘要

Open-weight coding agents should hold a fundamental advantage over closed-source systems because they can specialize to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical until now. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using Soft Verified Generation (SVG), we generate thousands of trajectories from any code repository, without requiring unit tests. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating 200,000+ synthetic trajectories. Using only supervised finetuning (SFT), SERA achieves leading results among fully open-source (open data, method, code) models while matching the performance of open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. We use our dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can adapt to private codebases. We release SERA as the first model in Ai2's Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.

2510.00845 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

作为统计估计的机械可解释性:方差分析

Maxime Méloux, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG)

AI总结 本文从统计估计角度审视机械可解释性中的电路发现,揭示因果中介分析中单输入得分的固有方差导致电路不稳定,并系统分解方差来源,倡导更严谨的实践。

详情
AI中文摘要

机械可解释性(MI)旨在通过识别功能子网络来逆向工程模型行为。然而,这些发现的科学有效性取决于其稳定性。在这项工作中,我们认为电路发现不是一个独立的任务,而是一个建立在因果中介分析(CMA)基础上的统计估计问题。我们揭示了这一基础层的根本不稳定性:精确的单输入CMA得分表现出高固有方差,这意味着组件的因果效应是一个易变的随机变量,而非固定属性。然后,我们证明电路发现流程继承了这一方差并进一步放大。快速近似方法,如边缘属性修补及其后续方法,引入了额外的估计噪声,而在数据集上聚合这些噪声得分会导致脆弱的结构估计。因此,输入数据或超参数的小扰动会产生截然不同的电路。我们系统地分解了这些方差来源,并倡导更严格的MI实践,优先考虑统计稳健性和稳定性指标的常规报告。

英文摘要

Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

2512.19673 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

自底向上策略优化:你的语言模型策略内部隐藏着内部策略

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) Tencent AI Lab(腾讯AI实验室)

AI总结 本文通过分解Transformer残差流中的内部层策略和内部模块策略,提出自底向上策略优化(BuPO)方法,通过早期优化内部层来重建LLM的推理基础,在复杂推理基准上验证了有效性。

Comments Preprint. Our code is available at https://github.com/Trae1ounG/BuPO

详情
AI中文摘要

现有的强化学习方法将大型语言模型(LLM)视为统一策略,忽略了其内部机制。在本文中,我们通过Transformer的残差流将基于LLM的策略分解为内部层策略和内部模块策略。我们对内部策略的熵分析揭示了不同的模式:(1)普遍地,内部策略从早期层的高熵探索演变为顶层层的确定性精炼;(2)Qwen表现出显式的渐进推理结构,与Llama中的突然收敛形成对比。此外,我们发现优化内部层会引发特征精炼,迫使较低层早期捕获高层推理表示。受这些发现启发,我们提出了自底向上策略优化(BuPO),一种新的强化学习范式,通过在早期阶段优化内部层来自底向上重建LLM的推理基础。在复杂推理基准上的大量实验证明了BuPO的有效性。

英文摘要

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's residual stream. Our entropy analysis of internal policy reveals distinct patterns: (1) universally, internal policies evolve from high-entropy exploration in early layers to deterministic refinement in the top layers; and (2) Qwen exhibits an explicit progressive reasoning structure, contrasting with the abrupt convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO.

2601.13433 2026-06-01 cs.CL cs.LG 版本更新

Who Endorsed It? Measuring Authority Bias Across Expertise Levels in Language Models

谁背书了它?测量语言模型中跨专业水平的权威偏差

Priyanka Mary Mammen, Emil Joswin, Shankar Venkitachalam

发表机构 * UMass Amherst(马萨诸塞大学阿默斯特分校) Independent Research(独立研究)

AI总结 研究语言模型在推理任务中是否因背书来源的专业水平而产生系统性偏差,发现模型对高权威来源的错误背书更易受影响,导致准确率下降和错误答案置信度增加,但可通过机制干预减轻偏差。

详情
AI中文摘要

先前研究表明,语言模型在推理任务上的表现可能受到建议、提示和背书的影响。然而,背书来源可信度的影响仍未充分探索。我们调查语言模型是否根据背书提供者的感知专业水平表现出系统性偏差。跨越数学、法律和医学推理的4个数据集,我们使用代表每个领域四个专业水平的角色评估了11个模型。我们的结果表明,随着来源专业水平的增加,模型越来越容易受到错误/误导性背书的影响,更高权威的来源不仅导致准确率下降,还增加了对错误答案的置信度。我们还表明,这种权威偏差在模型内部被机制性地编码,并且模型可以被引导远离偏差,从而即使在专家给出误导性背书时也能提高其性能。

英文摘要

Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.

2509.24319 2026-06-01 cs.CL cs.AI 版本更新

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models

价值表达的双重机制:大型语言模型中的内在价值与提示价值

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(数据科学研究生院,首尔国立大学)

AI总结 本文通过价值向量和价值神经元分析,揭示大型语言模型中内在价值表达与提示价值表达在机制上部分共享核心组件,但各自拥有独特功能,内在机制促进多样性,提示机制增强指令遵从。

Comments Accepted at ICML 2026. Project page: https://holi-lab.github.io/ValueMechanism/

详情
AI中文摘要

大型语言模型可以通过两种主要方式表达价值:(1)内在表达,反映模型在训练过程中学习到的固有价值;(2)提示表达,由显式提示引发。鉴于它们在价值对齐中的广泛应用,清楚理解其潜在机制至关重要,特别是它们是否主要重叠(如人们可能预期的)或依赖于不同的机制。我们在机制层面使用两种方法分析这个很大程度上未被充分研究的问题:(1)价值向量,从残差流中提取的代表价值机制的特征方向;(2)价值神经元,对价值向量有贡献的MLP神经元。我们证明内在和提示价值机制部分共享对诱导价值表达至关重要的共同组件,这些组件跨语言泛化并在模型的内部表示中重建理论上的价值间相关性。然而,每种机制也拥有独特的组件,发挥不同的作用。特别是,内在机制在更多样化的价值相关场景中激活并促进响应多样性,而提示机制增强指令遵从,甚至在遥远任务(如越狱)中也能生效。

英文摘要

Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms. We analyze this largely understudied problem at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical inter-value correlations in the model's internal representations. Yet, each mechanism also possesses unique components that fulfill distinct roles. In particular, the intrinsic mechanism activates in more diverse value-related scenarios and promotes response diversity, whereas the prompted mechanism strengthens instruction compliance, taking effect even in distant tasks like jailbreaking.

2509.20784 2026-06-01 cs.CL cs.AI 版本更新

Towards Atoms of Large Language Models

迈向大型语言模型的原子

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(认知与决策智能复杂系统重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(人工智能学院,中国科学院大学,北京,中国)

AI总结 本文提出原子理论,通过原子内积(AIP)定义、评估和识别大型语言模型的基本表示单元(原子),并证明在阈值激活稀疏自编码器(TSAE)下原子可识别,实验发现神经元和特征不满足理想原子标准,而通过匹配TSAE容量与数据规模可识别出近乎完美的原子。

Comments To be published in ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的基本表示单元(FRUs)尚未定义,这限制了对它们底层机制的进一步理解。在本文中,我们引入原子理论来系统地定义、评估和识别这样的FRUs,我们称之为原子。基于原子内积(AIP),一种捕捉LLM表示底层几何结构的非欧几里得度量,我们正式定义了原子,并提出了理想原子的两个关键标准:忠实性($R^2$)和稳定性($q^*$)。我们进一步证明,在阈值激活稀疏自编码器(TSAEs)下原子是可识别的。在实验上,我们揭示了LLMs中普遍存在的表示偏移,并证明AIP纠正了这种偏移以捕捉底层的表示几何结构。我们发现两个广泛使用的单元——神经元和特征——不符合理想原子的条件:神经元是忠实的($R^2=1$)但不稳定($q^*=0.5\%$),而特征更稳定($q^*=68.2\%$)但不忠实($R^2=48.8\%$)。为了找到LLMs的原子,利用TSAEs下的原子可识别性,我们通过大规模实验表明,只有当TSAE容量与数据规模匹配时,才能实现可靠的原子识别。在此洞察的指导下,我们在Gemma2-2B、Gemma2-9B和Llama3.1-8B的各层中识别出具有近乎完美忠实性($R^2=99.9\%$)和稳定性($q^*=99.8\%$)的FRUs,在统计上满足理想原子的标准。进一步分析证实,这些原子与理论预期一致,并表现出显著更高的单语义性。总体而言,我们提出并验证了原子理论作为理解LLMs内部表示的基础。代码可在https://github.com/ChenhuiHu/towards_atoms获取。

英文摘要

The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

2505.17595 2026-06-01 cs.LG cs.CL 版本更新

NeUQI: Near-Optimal Uniform Quantization Parameter Initialization for Low-Bit LLMs

NeUQI: 低比特大语言模型的近最优均匀量化参数初始化

Li Lin, Xinyu Hu, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University, China(北京大学王轩计算机技术研究院)

AI总结 针对低比特大语言模型均匀量化中参数初始化依赖Min-Max公式的局限,提出NeUQI方法,通过推导零点实现仅优化缩放因子,从而高效获得近最优初始化,在LLaMA和Qwen系列上优于现有方法,且结合轻量蒸馏可超越资源密集的PV-tuning。

Comments accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)在跨领域任务中表现出色,但由于高内存消耗和推理成本,在消费级GPU或个人设备(如笔记本电脑)上部署时面临重大挑战。LLM的训练后量化(PTQ)提供了一种有前景的解决方案,可减少内存占用和解码延迟。实践中,均匀量化表示的PTQ因其高效性和易于部署而受到青睐,因为均匀量化被主流硬件和软件库广泛支持。近期关于低比特均匀量化的研究在量化后模型性能上取得了显著改进;然而,这些研究主要关注量化方法,而量化参数的初始化仍未被充分探索,且仍依赖于传统的Min-Max公式。在本工作中,我们识别了Min-Max公式的局限性,突破其约束,提出了NeUQI,一种高效确定均匀量化近最优初始化的方法。我们的NeUQI通过为给定缩放因子推导零点,简化了缩放因子和零点的联合优化,从而将问题简化为仅缩放因子优化。得益于改进的量化参数,我们的NeUQI在LLaMA和Qwen系列的各种设置和任务上的实验中一致优于现有方法。此外,当与轻量蒸馏策略结合时,NeUQI甚至实现了优于PV-tuning(一种资源密集得多的方法)的性能。

英文摘要

Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored due to its efficiency and ease of deployment, as uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on low-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they mainly focus on quantization methodologies, while the initialization of quantization parameters remains underexplored and still relies on the conventional Min-Max formula. In this work, we identify the limitations of the Min-Max formula, move beyond its constraints, and propose NeUQI, a method that efficiently determines near-optimal initialization for uniform quantization. Our NeUQI simplifies the joint optimization of the scale and zero-point by deriving the zero-point for a given scale, thereby reducing the problem to a scale-only optimization. Benefiting from the improved quantization parameters, our NeUQI consistently outperforms existing methods in the experiments with the LLaMA and Qwen families on various settings and tasks. Furthermore, when combined with a lightweight distillation strategy, NeUQI even achieves superior performance to PV-tuning, a considerably more resource-intensive method.

2601.19936 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Gap-K%: 通过测量 Top-1 预测差距检测预训练数据

Minseo Kwak, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出 Gap-K% 方法,利用 LLM 的 top-1 预测与目标 token 的对数概率差距及滑动窗口策略,在 WikiMIA 和 MIMIR 基准上实现预训练数据检测的最优性能。

Comments ACL 2026 Main Conference; 15 pages

详情
AI中文摘要

大型语言模型(LLM)中大规模预训练语料库的不透明性引发了严重的隐私和版权问题,使得预训练数据检测成为一项关键挑战。现有的最先进方法通常依赖于 token 似然,但它们往往忽略了目标 token 与模型 top-1 预测之间的差距,以及相邻 token 之间的局部相关性。在这项工作中,我们提出了 Gap-K%,一种基于 LLM 预训练优化动态的新型预训练数据检测方法。通过分析下一个 token 预测目标,我们观察到模型 top-1 预测与目标 token 之间的差异会引发强烈的梯度信号,这些信号在训练过程中被明确惩罚。受此启发,Gap-K% 利用 top-1 预测 token 与目标 token 之间的对数概率差距,并结合滑动窗口策略来捕获局部相关性并缓解 token 级别的波动。在 WikiMIA 和 MIMIR 基准上的大量实验表明,Gap-K% 实现了最先进的性能,在各种模型大小和输入长度上始终优于先前的基线方法。

英文摘要

The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlations between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model's top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.

2510.05115 2026-06-01 cs.AI cs.CL cs.PL 版本更新

SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling

SAC-Opt:优化建模中用于迭代修正的语义锚点

Yansen Zhang, Qingcan Kang, Yujie Chen, Yufei Wang, Xiongwei Han, Tao Zhong, Mingxuan Yuan, Chen Ma

发表机构 * Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China(香港城市大学计算机科学系) Huawei Noah's Ark Lab, Hong Kong SAR, China(华为诺亚实验室(香港)) Huawei's Supply Chain Management Department, Shenzhen, China(华为供应链管理部(深圳))

AI总结 提出SAC-Opt框架,通过语义锚点对齐和选择性修正,在无需额外训练的情况下提升大语言模型生成优化建模代码的语义忠实度,平均建模准确率提升7.7%。

Comments ICML 2026 accepted

详情
AI中文摘要

大语言模型(LLMs)通过从自然语言描述生成可执行的求解器代码,为优化建模开辟了新范式。尽管前景广阔,现有方法通常仍以求解器驱动:它们依赖单次前向生成,并基于求解器错误信息进行有限的事后修正,这留下了未被检测到的语义错误,这些错误会静默地产生语法正确但逻辑有缺陷的模型。为应对这一挑战,我们提出SAC-Opt,一种反向引导的修正框架,将优化建模建立在问题语义而非求解器反馈之上。在每一步中,SAC-Opt将原始语义锚点与从生成代码中重建的锚点对齐,并仅选择性修正不匹配的组件,从而驱动模型收敛到语义忠实的模型。这种锚点驱动的修正能够对约束和目标逻辑进行细粒度改进,在无需额外训练或监督的情况下增强忠实性和鲁棒性。在七个公共数据集上的实验结果表明,SAC-Opt将平均建模准确率提升了7.7%,在ComplexLP数据集上提升高达21.9%。这些发现强调了在基于LLM的优化工作流中,语义锚点修正对于确保从问题意图到求解器可执行代码的忠实翻译的重要性。

英文摘要

Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.

2601.09633 2026-06-01 cs.CL 版本更新

TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion

TaxoBell: 用于自监督分类体系扩展的高斯盒嵌入

Sahil Mishra, Srinitish Srinivasan, Srikanta Bedathur, Tanmoy Chakraborty

发表机构 * Indian Institute of Technology Delhi(印度理工学院德里)

AI总结 提出TaxoBell高斯盒嵌入框架,通过将盒几何与多元高斯分布映射,解决现有方法在非对称关系建模、梯度不稳定、语义不确定性和多义性表示上的问题,在五个基准数据集上MRR提升19%、Recall@k提升约25%。

Comments Accepted in The Web Conference (WWW) 2026

详情
AI中文摘要

分类体系构成了跨领域结构化知识表示的骨干,支持电子商务和语义搜索等应用。然而,手动分类体系扩展既费力又缓慢。现有方法依赖于基于点的向量嵌入,这些嵌入建模对称相似性,因此难以处理分类体系中基本的非对称关系。盒嵌入通过支持包含和不相交性提供了一种有前景的替代方案,但它们面临关键问题:(i) 在交集边界处梯度不稳定,(ii) 没有语义不确定性的概念,(iii) 表示多义性或歧义的能力有限。我们通过TaxoBell解决了这些缺点,这是一个高斯盒嵌入框架,在盒几何与多元高斯分布之间进行转换,其中均值编码语义位置,协方差编码不确定性。基于能量的优化实现了稳定优化、对模糊概念的鲁棒建模以及可解释的层次推理。在五个基准数据集上的大量实验表明,TaxoBell在MRR上显著优于八种最先进的分类体系扩展基线19%,在Recall@k上约25%。我们进一步通过错误分析和消融研究展示了TaxoBell的优势和缺陷。

英文摘要

Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce and semantic search. Yet, manual taxonomy expansion is labor-intensive and slow. Existing methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experiments on five benchmark datasets demonstrate that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.

2502.12119 2026-06-01 cs.CV cs.AI cs.CL 版本更新

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

PRISM:免训练多模态数据选择的自剪枝内在选择方法

Jinhe Bi, Aniri, Zengjie Jin, Yifan Wang, Danqi Yan, Wenke Huang, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

发表机构 * LMU Munich(慕尼黑大学) Munich Research Center, Huawei Technologies(慕尼黑研究中心,华为技术) METEOR School of Computer Science, Wuhan University(武汉大学计算机学院) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 针对多模态大语言模型视觉指令数据冗余问题,提出一种免训练框架PRISM,通过隐式重中心化消除视觉特征各向异性导致的全局语义漂移,实现高效数据选择,在降低计算成本的同时提升模型性能。

Comments Accepted to ACL 2026 and selected for the Best Paper list; later desk-rejected due to an inadvertent manual bibliography-editing error. Previous versions are withdrawn due to an inadvertent manual bibliography-editing error; please refer to the latest corrected version

详情
AI中文摘要

视觉指令微调使预训练的多模态大语言模型(MLLMs)能够遵循人类指令以应用于现实场景。然而,这些数据集的快速增长引入了显著的冗余,导致计算成本增加。现有的指令数据选择方法旨在修剪这种冗余,但主要依赖于计算密集型技术,如基于代理的推理或基于训练的指标。因此,这些选择过程产生的巨大计算成本往往加剧了它们本应解决的效率瓶颈,对MLLMs的可扩展和有效微调构成了重大挑战。为了解决这一挑战,我们首先发现了一个关键但先前被忽视的因素:视觉特征分布中固有的各向异性。我们发现这种各向异性引发了 extit{全局语义漂移},而忽视这一现象是限制当前数据选择方法效率的关键因素。受此启发,我们设计了 extbf{PRISM},这是第一个用于高效视觉指令选择的免训练框架。PRISM通过隐式重中心化建模内在视觉语义,精确移除全局背景特征的干扰影响。实验表明,PRISM将数据选择和模型微调的端到端时间减少到传统流程的30%。更值得注意的是,它在实现这一效率的同时提升了性能,在八个多模态和三个语言理解基准上超越了在全数据集上微调的模型,最终相对于基线实现了101.7%的相对改进。代码可通过\href{https://github.com/bibisbar/PRISM}{此仓库}获取。

英文摘要

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

2510.15859 2026-06-01 cs.CL cs.AI 版本更新

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

InfiMed-ORBIT: 通过基于评分标准的增量训练使大语言模型对齐开放复杂任务

Pengkai Wang, Pengwei Liu, Qi Zuo, Zhijie Sang, Congkai Xie, Hongxia Yang

发表机构 * Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学计算机系) Department of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院)

AI总结 提出ORBIT框架,利用动态生成的病例条件评分标准指导增量强化学习,仅用2k样本将Qwen3-4B-Instruct在HealthBench-Hard上的得分从7.0提升至27.5,达到同规模开源模型最优。

详情
AI中文摘要

强化学习(RL)推动了大语言模型(LLM)的许多近期突破,尤其是在奖励可自动计算的任务(如代码生成)中。然而,在开放式的医学对话中,RL效果较差,因为反馈模糊、依赖上下文,且难以简单总结为单一标量信号——通常需要高度监督的奖励模型,并存在奖励破解的风险。因此,我们引入了ORBIT,一个专为关键医学对话设计的基于评分标准的开放式增量训练框架。ORBIT将医学对话构建与动态生成的病例条件评分标准相结合,这些评分标准作为增量RL的自适应指南。与依赖外部医学知识库或手工规则的方法不同,ORBIT使用评分标准引导的评估,并可与通用指令遵循LLM一起实现,避免了任务特定的评判微调。仅使用2k训练样本,ORBIT将Qwen3-4B-Instruct的HealthBench-Hard得分从7.0提升至27.5,在相似规模的开源模型中实现了最先进的性能,同时随着评分标准覆盖范围的扩大,保持了良好的咨询质量。

英文摘要

Reinforcement learning (RL) has powered many recent breakthroughs in large language models (LLMs), especially for tasks where rewards can be computed automatically, such as code generation. However, it is less effective in open-ended medical dialogue, where feedback is ambiguous, context-dependent, and difficult to simply summarize into a single scalar signal-often requiring heavily supervised reward models and creating risks of reward hacking. Thus, we introduce ORBIT, an open-ended rubric-based incremental training framework tailored for critical medical dialogues. ORBIT integrates medical dialogue construction with dynamically generated case-conditioned rubrics that serve as adaptive guides for incremental RL. Unlike approaches that rely on external medical knowledge bases or handcrafted rules, ORBIT uses rubric-guided evaluation and can be implemented with general-purpose instruction-following LLMs, avoiding task-specific judge fine-tuning. With only 2k training samples, ORBIT raises Qwen3-4B-Instruct's HealthBench-Hard score from 7.0 to 27.5, achieving state-of-the-art performance among similarly sized open-source models while maintaining strong consultation quality as rubric coverage broadens.

2511.19923 2026-06-01 cs.CV cs.CL 版本更新

Distilling Counterfactual Reasoning from Language to Vision: Causal Graph Guided Post-Training for Video Understanding

从语言到视觉的反事实推理蒸馏:因果图引导的视频理解后训练

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

发表机构 * Rutgers University(新泽西罗格斯大学)

AI总结 针对视觉语言模型在反事实推理上的不足,提出CounterVQA基准和CFGPT后训练方法,通过从语言模态蒸馏反事实推理能力提升视频理解。

详情
AI中文摘要

视觉语言模型(VLM)最近在视频理解方面取得了显著进展,特别是在特征对齐、事件推理和指令遵循任务中。然而,它们在反事实推理(即在假设条件下推断替代结果)方面的能力仍未得到充分探索。这种能力对于鲁棒的视频理解至关重要,因为它需要识别潜在的因果结构并推理未观察到的可能性,而不仅仅是识别观察到的模式。为了系统评估这一能力,我们引入了CounterVQA,一个基于视频的基准测试,具有三个渐进难度级别,评估反事实推理的不同方面。通过对最先进的开源和闭源模型的全面评估,我们发现了一个显著的性能差距:虽然这些模型在简单的反事实问题上达到了合理的准确性,但在复杂的多跳因果链上性能显著下降。为了解决这些限制,我们开发了一种后训练方法CFGPT,通过从语言模态蒸馏其反事实推理能力来增强模型的视觉反事实推理能力,在CounterVQA的所有难度级别上均取得了一致的改进。数据集和代码将后续发布。

英文摘要

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

2511.17826 2026-06-01 cs.LG cs.CL stat.ML 版本更新

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

跨张量并行大小的确定性推理,消除训练-推理不匹配

Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu

发表机构 * Independent Researcher(独立研究者) University of Minnesota, Minneapolis, Minnesota, USA(明尼苏达大学) Rice University, Houston, Texas, USA(里士满大学) NVIDIA Corp., Santa Clara, California, USA(NVIDIA公司)

AI总结 针对不同张量并行大小导致浮点运算非结合性引起的推理非确定性问题,提出基于树的核(TBIK)实现跨TP大小的比特级一致结果,消除RL训练中推理与训练引擎间的精度不匹配。

详情
AI中文摘要

确定性推理对于大型语言模型(LLM)应用(如LLM-as-a-judge评估、多智能体系统和强化学习(RL))日益关键。然而,现有的LLM服务框架表现出非确定性行为:当系统配置(例如张量并行(TP)大小、批大小)变化时,即使采用贪心解码,相同的输入也可能产生不同的输出。这是由于浮点运算的非结合性以及GPU间归约顺序不一致导致的。虽然先前的工作通过批不变核解决了与批大小相关的非确定性,但跨不同TP大小的确定性仍然是一个开放问题,特别是在RL设置中,训练引擎通常使用全分片数据并行(即TP=1),而部署引擎依赖多GPU TP以最大化推理吞吐量,从而在两者之间产生自然的不匹配。这种精度不匹配问题可能导致RL训练性能次优甚至崩溃。我们识别并分析了TP引起不一致的根本原因,并提出了基于树的核(TBIK),这是一组TP不变的矩阵乘法和归约原语,无论TP大小如何,都能保证比特级相同的结果。我们的关键见解是通过统一的层次二叉树结构对齐GPU内和GPU间的归约顺序。我们在Triton中实现了这些核,并将其集成到vLLM和FSDP中。实验证明,在不同TP大小下,确定性推理的概率发散为零,且具有比特级可重复性。此外,在采用不同并行策略的RL训练流程中,我们在vLLM和FSDP之间实现了比特级相同的结果。代码可在https://github.com/nanomaoli/llm_reproducibility获取。

英文摘要

Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

2509.12440 2026-06-01 cs.CL cs.AI 版本更新

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

MedFact:大型语言模型在中文医学文本上的事实核查能力基准测试

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai

发表机构 * Xunfei Healthcare Technology Co., Ltd.(讯飞医疗科技有限公司)

AI总结 为评估LLM在中文医学文本中的事实核查能力,构建了包含2116个专家标注实例的MedFact基准,涵盖13个专科、8种错误类型等,并发现模型在错误定位上表现不足,存在“过度批评”现象。

Comments Accepted to The Fifth Workshop on Generation, Evaluation, and Metrics (GEM) at ACL 2026

详情
AI中文摘要

在医疗应用中部署大型语言模型(LLM)需要具备事实核查能力,以确保患者安全和法规合规。我们引入了MedFact,一个具有挑战性的中文医学事实核查基准,包含来自多样化真实文本的2,116个专家标注实例,涵盖13个专科、8种错误类型、4种写作风格和5个难度级别。构建采用混合AI-人类框架,其中迭代的专家反馈优化AI驱动的多标准过滤,以确保高质量和难度。我们评估了20个领先的LLM在真实性分类和错误定位方面的表现,结果显示模型通常能判断文本是否包含错误,但难以精确定位错误,顶级模型的表现仍不及人类。我们的分析揭示了“过度批评”现象,即模型倾向于将正确信息误判为错误,而高级推理技术(如多智能体协作和推理时扩展)可能加剧这一问题。MedFact突显了部署医疗LLM的挑战,并为开发事实可靠的医疗AI系统提供了资源。

英文摘要

Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

2503.22996 2026-06-01 cs.CL 版本更新

Rethinking Sparse Mixture of Experts from a Unified Perspective

从统一视角重新思考稀疏混合专家模型

Giang Do, Hung Le, Truyen Tran

发表机构 * Applied Artificial Intelligence Intiative (A2I2), Deakin University, Victoria, Australia(应用人工智能倡议(A2I2),德金大学,维多利亚,澳大利亚)

AI总结 针对稀疏混合专家模型中固定预算导致无关选择或遗漏关键分配的问题,提出基于线性规划的统一框架USMoE,通过统一机制和评分实现灵活专家选择,提升性能并降低推理成本。

Comments 35 pages

详情
Journal ref
ICML 2026
AI中文摘要

稀疏混合专家(SMoE)模型在保持恒定计算开销的同时扩展了模型容量。SMoE方法分为两类:Token Choice(将每个令牌路由到固定数量的专家)和Expert Choice(为每个专家分配固定数量的令牌)。然而,对令牌或专家使用固定预算导致两种方法都会选择不相关的令牌-专家对或忽略关键分配,从而降低整体性能。为填补这一空白,我们通过线性规划的视角从统一角度重新思考SMoE,为SMoE模型提供了通用公式。此外,我们引入了统一稀疏混合专家(USMoE),这是一个包含统一机制和统一评分的新框架,以克服这些限制。我们提供了理论证明和实证证据,展示了USMoE的有效性。在多种数据设置(干净和损坏)、多个领域(包括文本和视觉任务)以及不同学习方法(无训练和基于训练)上的广泛评估表明,USMoE不仅比现有SMoE方法带来了显著的性能提升,而且实现了更灵活的专家选择预算,在不影响模型性能的情况下降低了推理成本。我们的实现已在https://github.com/giangdip2410/USMoE公开。

英文摘要

Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories: Token Choice, which routes each token to a fixed number of experts, and Expert Choice, which assigns a fixed number of tokens to each expert. However, the use of fixed budgets for tokens or experts causes both approaches to select irrelevant token-expert pairs or overlook critical assignments, which degrades overall performance. To fill that gap, we rethink SMoE from a unified perspective through the lens of linear programming, which provides a general formulation for SMoE models. Furthermore, we introduce Unified Sparse Mixture of Experts (USMoE), a novel framework comprising a unified mechanism and a unified score to overcome these limitations. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness. Extensive evaluations across diverse data settings (clean and corrupted), multiple domains (including texts and vision tasks), and different learning approaches (training-free and training-based) show that USMoE not only delivers significant performance improvements over existing SMoE methods, but also enables more flexible expert selection budgets, reducing inference costs without compromising model performance. Our implementation is publicly available at https://github.com/giangdip2410/USMoE.

2510.20853 2026-06-01 eess.AS cs.CL cs.SD 版本更新

Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization

超越听觉:通过生理学启发的标记化从耳机学习任务无关的ExG表示

Hyungjun Yoon, Seungjoo Lee, Yu Yvonne Wu, Xiaomeng Chen, Taiting Lu, Freddy Yifei Liu, Taeckyung Lee, Hyeongheon Cha, Haochen Zhao, Gaoteng Zhao, Dongyao Chen, Cecilia Mascolo, Sung-Ju Lee, Lili Qiu

发表机构 * KAIST(韩国科学技术院) Carnegie Mellon University(卡内基梅隆大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学) Pennsylvania State University(宾夕法尼亚州立大学) UCLA(加州大学洛杉矶分校) Northwest University(北华大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft Research(微软研究院)

AI总结 提出一种基于耳机的生理学启发的多频带标记化方法(PiMT),通过无干扰的日常ExG数据采集和重建任务学习鲁棒表示,实现跨多种任务(包括五种人类感官)的通用ExG监测。

Comments Accepted to ICLR 2026

详情
AI中文摘要

电生理(ExG)信号为人类生理学提供了有价值的见解,但由于两个关键限制,构建能够泛化到日常任务的基础模型仍然具有挑战性:(i)数据多样性不足,因为大多数ExG记录是在受控实验室中使用笨重、昂贵的设备收集的;以及(ii)任务特定的模型设计需要定制的处理(即目标频率滤波器)和架构,这限制了跨任务的泛化。为了解决这些挑战,我们引入了一种可扩展的、任务无关的野外ExG监测方法。我们使用基于耳机的硬件原型收集了50小时的无干扰自由生活ExG数据,以缩小数据多样性差距。我们方法的核心是生理学启发的多频带标记化(PiMT),它将ExG信号分解为12个生理学启发的标记,然后通过重建任务学习鲁棒的表示。这使得能够在全频谱范围内进行自适应特征识别,同时捕获任务相关信息。在我们新的DailySense数据集(第一个支持基于ExG的五种人类感官分析的数据集)以及四个公共ExG基准上的实验表明,PiMT在多种任务上始终优于最先进的方法。

英文摘要

Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i)~insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii)~task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset, the first to enable ExG-based analysis across five human senses, together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.

2503.05846 2026-06-01 cs.CL cs.AI 版本更新

EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

EMCEE:通过提取合成多语言上下文桥接知识与推理以提升大语言模型的多语言能力

Hamin Koo, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出EMCEE框架,通过从LLM自身提取并融合语言特定知识,结合推理输出,显著提升多语言任务性能,尤其在低资源语言上平均提升31.7%。

Comments ACL 2026 Main

详情
AI中文摘要

大语言模型(LLMs)在广泛任务中取得了显著进展,但其对以英语为中心的训练数据的严重依赖导致在非英语语言中性能大幅下降。虽然现有的多语言提示方法强调将查询重新表述为英语或增强推理能力,但它们往往未能融入对某些查询至关重要的语言和文化特定基础。为了解决这一局限性,我们提出了EMCEE(提取合成多语言上下文并合并),一个简单而有效的框架,通过从LLM自身显式提取和利用查询相关知识来增强其多语言能力。具体来说,EMCEE首先提取合成上下文以揭示LLM中编码的潜在语言特定知识,然后通过基于判断的选择机制动态地将这种上下文见解与面向推理的输出合并。在涵盖多种语言和任务的四个多语言基准上的大量实验表明,EMCEE始终优于先前的方法,总体平均相对提升16.4%,在低资源语言中提升31.7%。

英文摘要

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCEE (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCEE first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCEE consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

2510.11683 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

边界引导策略优化:面向扩散大语言模型的内存高效强化学习

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

发表机构 * Tsinghua University(清华大学)

AI总结 针对扩散大语言模型中似然函数难以处理导致强化学习内存开销大的问题,提出边界引导策略优化(BGPO),通过构造满足线性和等价性的下界实现内存高效训练,在数学求解、代码生成和规划任务中显著优于现有方法。

详情
AI中文摘要

将强化学习(RL)应用于扩散大语言模型(dLLMs)的一个关键挑战是其似然函数的难解性,而似然函数对于RL目标至关重要,因此在训练过程中需要相应的近似。现有方法通过自定义蒙特卡洛(MC)采样,利用证据下界(ELBO)近似对数似然,但由于需要保留所有MC样本用于RL目标中非线性项的梯度计算,导致显著的内存开销,从而限制了可行的样本量,导致似然近似不精确和RL目标失真。为了解决这个问题,我们提出了边界引导策略优化(BGPO),一种内存高效的RL算法,它最大化基于ELBO的目标的一个特殊构造的下界。该下界经过精心设计,满足两个关键性质:(1)线性:它是一个线性求和,其中每一项仅依赖于单个MC样本,从而能够跨样本进行梯度累积并确保恒定的内存使用;(2)等价性:在在线策略训练中,该下界的值和梯度与基于ELBO的目标相等,因此它也是对原始RL目标的有效近似。这些性质使得BGPO能够采用大的MC样本量,改进似然近似和RL目标估计,从而带来性能提升。实验表明,BGPO在数学问题求解、代码生成和规划任务中显著优于先前的dLLMs RL算法。我们的代码和模型可在https://github.com/THU-KEG/BGPO获取。

英文摘要

A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation during training. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, they incur significant memory overhead due to the need to retain all MC samples for the gradient computation of non-linear terms in the RL objective, and thus restrict feasible sample sizes, leading to imprecise likelihood approximations and distorted RL objective. To address this, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, improving likelihood approximations and RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks. Our codes and models are available at \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}.

2505.17607 2026-06-01 cs.AI cs.CL 版本更新

Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning

符号中介作为LLM驱动几何推理的语言-数值接口

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院) Honda Research Institute Europe(本田欧洲研究院) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) National Biomarker Centre, CRUK-MI, University of Manchester(曼彻斯特大学国家生物标记中心)

AI总结 提出符号中介作为连接物理模拟器数值输出与语言模型推理的接口,通过符号回归将连续数值转化为符号表达式,并在协同优化循环中提升几何推理性能。

Comments 33 pages, 18 figures

详情
AI中文摘要

大型语言模型(LLM)在语言和符号对象上展示出推理能力,但直接解释物理模拟器的连续数值输出(例如距离、曲率和轨迹)的能力有限,这些输出难以进行离散分词。在从机构设计到运动规划等空间基础的工程推理任务中,这定义了一个根本性的差距,限制了LLM在更广泛几何领域(例如与物理模拟器接口)的应用。我们提出符号中介,即通过符号回归发现的紧凑解析表达式,作为一种结构化接口,将模拟器的数值轨迹转换为符号形式,语言模型可以解释、比较和批评,同时保留原始几何语义。围绕这个接口,我们构建了一个智能体协调与优化循环:设计智能体将自然语言规范映射为可执行模拟代码,批评智能体基于共享符号词汇进行推理,修订步骤将此反馈转化为基于基础的优化决策,从而实现无需参数更新的推理时泛化。在平面机构综合的MSynth基准上,所有三个评估的LLM智能体比预算匹配的遗传算法基线高出19-53%(带反馈时中位误差降低高达63%),对三种模型架构的批评条目分析表明,该接口将推理从通用结构评论转向基于基础的几何验证。将连续模拟输出转换为符号形式的原理可推广到任何需要以语言方式解释模拟器行为的领域。

英文摘要

Large Language Models (LLMs) display reasoning capabilities over linguistic and symbolic objects but have limited capabilities to directly interpret the continuous numerical outputs of physics simulators, e.g., distances, curvatures, and trajectories that resist discrete tokenisation. Across spatially grounded engineering reasoning tasks, from mechanism design to motion planning, this defines a fundamental gap, which limits the wider application of LLMs within broader geometrical domains, for exmaple interfacing with physics simulators. We propose symbolic intermediaries, compact analytical expressions discovered via symbolic regression, as a structured interface that translates a simulator's numerical traces into a symbolic form, which language models can interpret, compare, and critique while preserving the original geometric semantics. Around this interface we build an agentic coordination-and-refinement loop: a design agent maps natural-language specifications to executable simulation code, a critique agent reasons over the shared symbolic vocabulary, and a revision step turns this feedback into grounded refinement decisions, enabling inference-time generalization without parameter updates. On the MSynth benchmark for planar mechanism synthesis, all three evaluated LLM agents outperform a budget-matched genetic-algorithm baseline by 19-53% (up to 63% lower median error with feedback), and analysis of the critique entries across three model architectures shows that the interface shifts reasoning from generic structural commentary to grounded geometric verification. The principle of translating continuous simulation outputs into symbolic forms generalises to any domain where simulator behaviour must be interpreted linguistically.

2510.03415 2026-06-01 cs.PL cs.AI cs.CL cs.SE 版本更新

LLMs Lean on Priors, Not Programming Language Semantics

LLMs 依赖先验而非编程语言语义

Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Cisco Research(思科研究)

AI总结 通过 PLSemanticsBench 基准测试,发现前沿大语言模型在程序执行任务中依赖预训练统计规律而非形式语义规则,语义变异和结构复杂度导致准确率大幅下降。

Comments Accepted at ICML 2026

详情
AI中文摘要

近期工作探究大语言模型(LLMs)是否基于显式规则而非预训练统计规律进行推理。程序执行提供了一个典型实例:形式语义通过符号转换规则定义行为,这些规则在分布偏移下可被系统性改变。我们研究 LLMs 能否通过程序执行基于形式语义进行推理,并引入 PLSemanticsBench,将轻量级 C 程序与两种语义系统(小步操作语义和 K 语义)配对,探测四种能力:组合规则得到最终状态、状态未变时选择规则、在长轨迹上维持这种条件推理、以及在新语义下遵循提供的规则。为解耦语义推理与语法熟悉度,我们重新定义熟悉运算符以引发符号-含义冲突,并引入仅通过提供规则定义的新符号,同时在人类编写、LLM 翻译和模糊生成的分割上以递增的结构复杂度进行压力测试。在 11 个前沿 LLM 上,标准语义下的最终状态准确率(高达 90%)在语义变异和结构复杂度增加时急剧下降,降幅达 40-60 个百分点。仅少数模型实现了非零的长程条件推理准确率,即使最佳系统也仅达到 35%。这些结果表明,当代 LLMs 往往依赖预训练的词汇关联,而非系统地基于提供的正式规则进行推理。PLSemanticsBench 公开于 https://EngineeringSoftware.github.io/PLSemanticsBench。

英文摘要

Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics define behavior through symbolic transition rules that can be systematically altered under distribution shift. We investigate whether LLMs can condition their reasoning on formal semantics through program execution and introduce PLSemanticsBench, pairing featherweight C programs with two semantic systems -- small-step operational semantics and K semantics -- and probing four capabilities: composing rules for final states, selecting rules when state is unmutated, sustaining such conditioning over long traces, and following supplied rules under novel semantics. To decouple semantic reasoning from syntactic familiarity, we redefine familiar operators to induce symbol-meaning conflict and introduce novel symbols defined only through the supplied rules, and stress-test models on Human-Written, LLM-Translated, and Fuzzer-Generated splits with increasing structural complexity. Across 11 frontier LLMs, strong final-state accuracy under standard semantics (up to 90%) drops sharply -- by as much as 40--60% points -- under semantic mutations and increasing structural complexity. Only a handful of models achieve non-zero long-horizon conditioning accuracy, and even the best systems reach just 35%. Together, these results suggest that contemporary LLMs often rely on pretrained lexical associations rather than systematically conditioning on supplied formal rules. PLSemanticsBench is publicly available at https://EngineeringSoftware.github.io/PLSemanticsBench.

2509.14221 2026-06-01 cs.IR cs.CL 版本更新

GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing

GEM-Bench:生成引擎营销中广告注入响应生成的基准测试

Silan Hu, Shiqi Zhang, Yimin Shi, Xiaokui Xiao

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对生成引擎营销中广告注入响应生成缺乏专门基准的问题,提出首个综合基准GEM-Bench,包含多场景数据集、多维度指标和基线方案,实验表明简单提示方法虽能提升点击率但降低用户满意度,而基于预生成无广告响应的插入方法可缓解此问题但增加开销。

Comments Technical Report

详情
AI中文摘要

生成引擎营销(GEM)是一种新兴的生态系统,通过将相关广告无缝集成到生成引擎(如基于LLM的聊天机器人)的响应中来实现货币化。GEM的核心在于广告注入响应的生成与评估。然而,现有基准并非专门为此设计,限制了未来的研究。为填补这一空白,我们提出了GEM-Bench,这是首个针对GEM中广告注入响应生成的综合基准。GEM-Bench包含三个精心策划的数据集,涵盖聊天机器人和搜索场景;一个捕获用户满意度和参与度多个维度的指标本体;以及在一个可扩展的多智能体框架内实现的多个基线解决方案。我们的初步结果表明,虽然简单的基于提示的方法能实现合理的参与度(如点击率),但往往降低用户满意度。相比之下,基于预生成的无广告响应插入广告的方法有助于缓解这一问题,但引入了额外开销。这些发现凸显了未来需要研究设计更有效、更高效的GEM广告注入响应生成解决方案。该基准及所有相关资源已在https://gem-bench.org/公开。

英文摘要

Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM-based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad-injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM-Bench, the first comprehensive benchmark for ad-injected response generation in GEM. GEM-Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi-agent framework. Our preliminary results indicate that, while simple prompt-based methods achieve reasonable engagement such as click-through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre-generated ad-free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad-injected responses in GEM. The benchmark and all related resources are publicly available at https://gem-bench.org/.

2510.02919 2026-06-01 cs.CL 版本更新

Self-Reflective Generation at Test Time

测试时的自反生成

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Nanyang Technological University(南洋理工大学) University of Edinburgh(爱丁堡大学) City University of Hong Kong(香港城市大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出SRGen框架,通过动态熵阈值识别高不确定性token并训练校正向量,在测试时进行自反生成以纠正概率分布,提升大模型推理的可靠性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地通过长思维链解决复杂推理任务,但其仅前向的自回归生成过程是脆弱的;早期token错误可能级联,这明确需要自反机制。然而,现有的自反要么对完整草稿进行修订,要么通过昂贵的训练学习自我修正,两者本质上都是反应性的且低效。为解决此问题,我们提出测试时的自反生成(SRGen),一种轻量级测试时框架,在不确定点生成前进行反思。在token生成过程中,SRGen利用动态熵阈值识别高不确定性token。对于每个识别的token,它训练一个特定的校正向量,充分利用已生成的上下文进行自反生成以纠正token概率分布。通过回顾性分析部分输出,这种自反使得决策更可靠,从而显著降低高不确定点错误的概率。在具有挑战性的数学推理基准和多种LLM上的评估表明,SRGen能显著增强模型推理能力。此外,我们的发现将SRGen定位为一种即插即用的方法,将反思融入生成过程以实现可靠的LLM推理,在有限开销下取得一致收益,并可与其他训练时(如RLHF)和测试时(如SLOT)技术结合。

英文摘要

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can significantly strengthen model reasoning. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and can be combined with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

2509.23195 2026-06-01 cs.CL q-bio.NC 版本更新

The relative strength of hierarchical structure and statistics differs across the measures in naturalistic reading

自然阅读中层级结构与统计的相对强度因测量指标而异

Nan Wang, Hanlin Wu, Jiaxuan Li

发表机构 * Department of Brain and Cognitive Sciences, University of Rochester(罗切斯特大学脑科学与认知科学系) Department of Linguistics and Modern Languages, the Chinese University of Hong Kong(香港中文大学语言学与现代语言系) Department of Language Science, University of California Irvine(加州大学 Irvine 分校语言科学系)

AI总结 本研究通过同步脑电图和眼动追踪,结合贝叶斯网络建模和回归分析,探究层级句法结构与统计因素在在线理解中的相对强度,发现层级结构在阅读前即可影响理解,但其强度因行为或神经层面而异。

详情
AI中文摘要

层级句法结构与非层级、统计或序列因素长期以来被视为解释在线理解的竞争理论。大量证据表明,层级和非层级因素都能塑造理解,而更开放的问题是层级何时以及以多强的影响力作用于理解。我们通过同步脑电图和眼动追踪来探讨这一问题,将句法深度作为操作化层级结构的变量。关于时间问题,层级句法结构在阅读句子之前就已影响阅读,并且最早可在阅读前108毫秒出现。这一点得到了转移概率分析和注视相关电位回归的支持。注视转移分析表明,读者更倾向于在句法核心词之间移动,而非按照序列词序,这表明扫描路径是由深层句法结构而非纯统计驱动的。关于强度问题,我们结合贝叶斯网络建模和回归分析,表明变量的强度取决于待解释的现象。贝叶斯网络分析显示,层级句法结构比统计特征携带更多的预测权重。注视相关电位回归表明,在回归分析中,层级句法结构显著预测了右前脑区的词级神经活动,但与词汇惊奇度相比普遍较弱。综合证据,我们的分析表明,层级结构可以在行为和神经层面预期性地引导受试者的在线理解,其强度随阅读行为的不同方面而变化。

英文摘要

The hierarchical syntactic structure and non-hierarchical, statistical, or sequential factors have long been framed as rival theories in accounting for online comprehension. A lot of evidence has shown that both hierarchical and non-hierarchical factors can shape comprehension and the more open question is when, and how strongly, hierarchy exerts its influence in comprehension. We addressed the question with co-registered EEG and eye-tracking, treating syntactic depth as the variable for operationalizing hierarchical structure. For the timing question, hierarchical syntactic structure is shown to influence reading before reading a sentence and can emerge as early as 108ms before reading. This is supported by both transitional probability analysis and regression on fixation-related potential. Analyses on fixation-transition showed that readers preferentially moved between syntactically central words rather than according to serial word order, suggesting that scanpaths are driven by deep syntactic structure rather than by pure statistics. For the strength question, we combined Bayesian network modeling and regression analysis to show that strength of a variable is dependent on the phenomenon that is to be explained. Bayesian network analysis showed that hierarchical syntactic structure carried more predictive weight than statistical features. Regression on fixation-related potential demonstrated that hierarchical syntactic structure significantly predicted word-level neural activity in the front-right region in regression analyses, but is generally weaker in comparison with lexical surprisal. Evidence combined, our analyses suggested that hierarchical structure can anticipatorily guide subjects' online comprehension both on a behavioral and neural level, with its strength varies across different facets of reading behavior.

2501.04661 2026-06-01 cs.CL cs.AI 版本更新

Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

超越记忆:使用短语结构评估大型语言模型中的语义泛化

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

发表机构 * Georgetown University(乔治城大学) University of Bath(巴斯大学) DEVCOM U.S. Army Research Laboratory(美国陆军研究实验室) University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 通过构式语法构建诊断评估,测试大型语言模型在低频但人类易理解的短语结构上的语义理解与泛化能力,发现模型在句法相同但语义不同的构式上性能下降超40%。

Comments Camera Ready: AACL-IJCNLP (2025)

详情
AI中文摘要

预训练数据的网络规模带来了一个重要的评估挑战:将预训练数据中充分代表的案例的语言能力与对域外语言(特别是预训练数据中较少见的动态真实世界实例)的泛化能力区分开来。为此,我们利用构式语法(CxG)构建了一个诊断评估,系统性地评估大型语言模型(LLM)的自然语言理解。CxG为测试泛化提供了一个心理语言学上合理的框架,因为它明确地将句法形式与抽象的、非词汇意义联系起来。我们的新颖推理评估数据集包含英语短语构式,已知说话者能够抽象出常见的实例以理解和产生创造性实例。我们的评估数据集使用CxG评估两个核心问题:第一,模型是否能够“理解”那些可能在预训练数据中出现频率较低、但对人类而言直观且易于理解的句子的语义;第二,LLM是否能够在句法相同但意义不同的构式中部署适当的构式语义。我们的结果表明,包括GPT-o1在内的最先进模型在第二个任务上性能下降超过40%,揭示了模型无法像人类那样在句法相同的形式上进行泛化以得出不同的构式意义。我们公开了我们的新颖数据集和相关的实验数据,包括提示和模型响应。

英文摘要

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can 'understand' the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

2508.08204 2026-06-01 cs.CL cs.AI 版本更新

Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

大型语言模型中推理时间不确定性的人类对齐与校准

Kyle Moore, Jesse Roberts, Daryl Watson

AI总结 本文评估了多种推理时间不确定性度量,发现它们与人类群体不确定性高度对齐,尽管与人类答案偏好不一致,但在正确性相关性和分布分析上表现出中等到强校准证据。

Comments We have discovered a critical error in the normalized entropy calculation that may have substantially inflated nearly all results herein. We have since fixed this error in a new work, but we believe that the new work is sufficiently dissimilar in focus, methods, dataset, and results as to be misleading if presented as a simple replacement. As such, we propose removal and retraction instead

详情
AI中文摘要

最近,评估大型语言模型的不确定性校准引起了广泛关注,以促进模型控制和调节用户信任。推理时间不确定性可能为模型或外部控制模块提供实时信号,对于应用这些概念以改善LLM用户体验尤为重要。尽管许多现有论文考虑模型校准,但相对较少的工作试图评估模型不确定性与人类不确定性的对齐程度。在这项工作中,我们使用既有度量和新颖变体评估了一系列推理时间不确定性度量,以确定它们与人类群体水平不确定性以及传统模型校准概念的接近程度。我们发现,许多度量显示出与人类不确定性强烈对齐的证据,尽管与人类答案偏好缺乏对齐。对于那些成功的度量,我们在正确性相关性和分布分析方面发现了中等到强校准证据。

英文摘要

There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

2507.17335 2026-06-01 cs.CV cs.CL 版本更新

TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

TransLPRNet:用于单/双行中文车牌识别的轻量级视觉-语言网络

Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

发表机构 * Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University(水电工程智能视觉监测湖北省重点实验室,中国三峡大学) College of Computer and Information Technology, China Three Gorges University(计算机与信息学院,中国三峡大学) Hubei Key Laboratory of Digital Finance Innovation(数字金融创新湖北省重点实验室) School of Information Engineering, Hubei University of Economics(信息工程学院,湖北经济学院)

AI总结 针对开放环境中车牌类型多样和成像条件复杂的问题,提出一种集成轻量视觉编码器和文本解码器的统一解决方案,通过预训练框架和透视校正网络实现单/双行中文车牌的高精度识别。

详情
AI中文摘要

开放环境中的车牌识别在各个领域广泛应用,但车牌类型和成像条件的多样性带来了显著挑战。为了解决基于CNN和CRNN的方法在车牌识别中遇到的局限性,本文提出了一种统一解决方案,该方案在针对单行和双行中文车牌的预训练框架内,集成了轻量级视觉编码器和文本解码器。为缓解双行车牌数据集的稀缺性,我们通过合成图像、将纹理映射到真实场景并与真实车牌图像混合,构建了单/双行车牌数据集。此外,为提高系统的识别精度,我们引入了一个透视校正网络(PTN),该网络将车牌角点坐标回归作为隐变量,并通过车牌视角分类信息进行监督。该网络具有更好的稳定性、可解释性和较低的标注成本。所提出的算法在粗定位扰动下的校正CCPD测试集上实现了99.34%的平均识别准确率。在细定位扰动下评估时,准确率进一步提高到99.58%。在双行车牌测试集上,平均识别准确率达到98.70%,处理速度高达每秒167帧,显示出较强的实际应用性。

英文摘要

License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system's recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

2502.12851 2026-06-01 cs.CL cs.AI 版本更新

MeMo: Towards Language Models with Associative Memory Mechanisms

MeMo:迈向具有联想记忆机制的语言模型

Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli

发表机构 * Human-centric ART, University of Rome Tor Vergata(人文导向的ART,罗马大学Tor Vergata) University of Edinburgh(爱丁堡大学) Almawave S.p.A.(Almawave公司)

AI总结 提出MeMo架构,通过分层联想记忆直接记忆文本,实现透明化和模型编辑,实验证明单层和多层配置的记忆能力。

详情
Journal ref
Proceedings of Association for Computational Linguistics (Findings), 2025
AI中文摘要

记忆是基于Transformer的大型语言模型通过学习实现的基本能力。在本文中,我们提出了一种范式转变,通过设计一种直接记忆文本的架构,牢记记忆先于学习的原则。我们引入了MeMo,一种用于语言建模的新颖架构,它在分层联想记忆中显式记忆标记序列。通过设计,MeMo提供了透明性和模型编辑的可能性,包括遗忘文本。我们对MeMo架构进行了实验,展示了单层和多层配置的记忆能力。

英文摘要

Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.

2505.22934 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

解开LoRA干扰:用于鲁棒模型合并的正交子空间

Haobo Zhang, Jiayu Zhou

发表机构 * University of Michigan Ann Arbor(密歇根大学安娜堡分校)

AI总结 针对LoRA微调模型合并时性能下降的问题,提出通过微调前约束LoRA子空间正交性来减少任务间干扰的方法OSRM,可无缝集成现有合并算法,提升合并性能并保持单任务准确率。

Comments 14 pages, 5 figures, 16 tables, accepted by ACL 2025

详情
AI中文摘要

针对单个任务微调大型语言模型(LM)虽然性能强劲,但部署和存储成本高昂。近期研究探索模型合并,将多个任务特定模型组合成单个多任务模型,无需额外训练。然而,现有合并方法对于使用低秩适应(LoRA)微调的模型往往失败,导致性能显著下降。本文表明,这一问题源于模型参数与数据分布之间先前被忽视的相互作用。我们提出用于鲁棒模型合并的正交子空间(OSRM),在微调*之前*约束LoRA子空间,确保与一个任务相关的更新不会对其他任务的输出产生不利偏移。我们的方法可以无缝集成到大多数现有合并算法中,减少任务间的意外干扰。在八个数据集上使用三种广泛使用的LM和两种大型LM进行的广泛实验表明,我们的方法不仅提升了合并性能,还保持了单任务准确率。此外,我们的方法对合并的超参数表现出更强的鲁棒性。这些结果突显了数据-参数交互在模型合并中的重要性,并为合并LoRA模型提供了一种即插即用的解决方案。

英文摘要

Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

2411.13865 2026-06-01 cs.IR cs.AI cs.CL cs.LG 版本更新

Breaking Information Cocoons: A Hyperbolic Framework for Balancing Exploration and Exploitation in Recommender Systems

打破信息茧房:推荐系统中平衡探索与利用的双曲框架

Qiyao Ma, Menglin Yang, Mingxuan Ju, Tong Zhao, Neil Shah, Rex Ying

发表机构 * University of California, Davis(加州大学戴维斯分校) The Hong Kong University of Science(香港科学大学) Snap Inc.(Snap公司) Yale University(耶鲁大学)

AI总结 提出双曲框架HERec,通过语义增强的层次机制和自动层次聚类,在推荐系统中平衡探索与利用,有效缓解信息茧房。

Comments Accepted to KDD 2026. Code: https://github.com/Martin-qyma/HERec

详情
AI中文摘要

现代推荐系统常常形成信息茧房,限制用户接触多样化内容。核心挑战在于平衡内容探索与利用,同时允许用户调整推荐偏好。理想情况下,这种平衡可以通过层次表示来捕捉,其中深度搜索促进利用,广度搜索促进探索。然而,现有方法面临两个基本限制:欧几里得方法难以捕捉层次结构,而双曲方法尽管在层次建模上表现优越,但缺乏对用户和物品画像的语义理解,且未能提供平衡探索与利用的原则性机制。为解决这些问题,我们提出HERec,一个在推荐系统中有效平衡探索与利用的双曲框架。我们的框架引入两项关键创新:(1)语义增强的层次机制,直接在双曲空间中将丰富的文本描述与协同信息对齐。理论梯度分析表明,这种对齐有效利用了底层双曲流形结构,从而更准确地建模用户和物品;(2)通过优化Dasgupta代价的自动层次聚类机制,无需预定义超参数即可发现层次结构,实现用户可调节的探索-利用权衡。大量实验表明,HERec持续优于欧几里得和双曲基线,在效用指标上提升高达5.49%,多样性指标提升11.39%,有效缓解了信息茧房。

英文摘要

Modern recommender systems often create information cocoons, restricting users' exposure to diverse content. The central challenge is to balance content exploration and exploitation while allowing users to adjust their recommendation preferences. Ideally, this balance can be captured with a hierarchical representation, where depth search facilitates exploitation and breadth search enables exploration. However, existing approaches face two fundamental limitations: Euclidean methods struggle to capture hierarchical structures, while hyperbolic methods, despite their superior hierarchical modeling, lack semantic understanding of user and item profiles and fail to provide a principled mechanism for balancing exploration and exploitation. To address these challenges, we propose HERec, a hyperbolic framework that effectively balances exploration and exploitation in recommender systems. Our framework introduces two key innovations: (1) a semantic-enhanced hierarchical mechanism that aligns rich textual descriptions with collaborative information directly in hyperbolic space. Theoretical gradient analysis demonstrates that this alignment effectively leverages the underlying hyperbolic manifold structure, resulting in more accurate modeling of users and items; (2) an automatic hierarchical clustering mechanism by optimizing Dasgupta's cost, which discovers hierarchical structures without requiring predefined hyperparameters, enabling user-adjustable exploration-exploitation trade-offs. Extensive experiments demonstrate that HERec consistently outperforms both Euclidean and hyperbolic baselines, achieving up to 5.49% improvement in utility metrics and 11.39% increase in diversity metrics, effectively mitigating information cocoons.

2504.11972 2026-06-01 cs.CL 版本更新

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

重新评估大规模抽取式问答数据集:LLM作为评判者与深入分析

Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa

发表机构 * National Institute of Informatics, Japan(日本国立信息研究所) The University of Tokyo, Japan(东京大学) Inria, LS2N, Nantes Université, France(法国Inria、LS2N、南特大学)

AI总结 本研究系统性地使用LLM作为评判者评估四个抽取式问答数据集,发现其与人类评价的相关性远高于EM和F1,并分析了答案类型敏感性、提示变化和自偏好偏差等因素。

Comments GEM Workshop at ACL 2026; code and data are available at https://github.com/Alab-NII/llm-judge-extract-qa

详情
AI中文摘要

抽取式问答任务通常使用精确匹配(EM)和F1分数进行评估,但这些指标往往无法反映模型的真实性能。最近的研究提出使用大型语言模型(LLM)作为评判者(LLM-as-a-judge),然而它们通常缺乏跨数据集的全面评估,并忽略了关键因素,如对答案类型的敏感性、提示变化和自偏好偏差。在这项工作中,我们跨四个抽取式问答数据集和各种提示变化对LLM作为评判者进行了系统研究,评估了多个LLM系列在回答和评判角色中的表现。我们的结果表明,LLM作为评判者的判断与人类评价的相关性远高于EM(0.22)和F1(0.40),与开源模型的相关性高达0.85。进一步分析显示,LLM作为评判者在数字相关答案上表现特别好,但在更复杂的类型(如职位名称)上面临挑战。与其他NLP任务的发现相反,我们未观察到自偏好偏差,即使同一模型同时作为问答模型和评判者。最后,我们发现提示措辞的影响很小,零样本、无上下文的评判通常能带来最佳的评估性能。

英文摘要

Extractive QA tasks are commonly evaluated using Exact Match (EM) and F1-score, but these metrics often fail to reflect true model performance. Recent studies have proposed using large language models (LLMs) as judges (LLM-as-a-judge), yet they often lack comprehensive evaluation across datasets and overlook key factors such as sensitivity to answer types, prompt variations, and self-preference bias. In this work, we conduct a systematic study of LLM-as-a-judge across four extractive QA datasets and various prompt variations, assessing multiple LLM families in both answering and judging roles. Our results show that LLM-as-a-judge judgments correlate much more strongly with human evaluations than EM (0.22) and F1 (0.40), achieving correlations up to 0.85 with open-source models. Further analysis reveals that LLM-as-a-judge performs particularly well on number-related answers but faces challenges with more complex types, such as job titles. Contrary to findings in other NLP tasks, we observe no self-preference bias, even when the same model serves as both QA model and judge. Finally, we find that prompt phrasing has minimal impact, and zero-shot, context-free judging often yields the best evaluation performance.

2404.14928 2026-06-01 cs.LG cs.AI cs.CL cs.SI 版本更新

Graph Machine Learning in the Era of Large Language Models (LLMs)

大语言模型时代的图机器学习

Shijie Wang, Jiani Huang, Zhikai Chen, Yu Song, Wenzhuo Tang, Haitao Mao, Wenqi Fan, Hui Liu, Xiaorui Liu, Dawei Yin, Qing Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Michigan State University(密歇根州立大学) North Carolina State University(北卡罗来纳州立大学) Baidu Inc(百度公司)

AI总结 本文综述了大语言模型如何增强图机器学习的泛化、迁移和少样本学习能力,以及图如何提升大语言模型的推理和可解释性。

Comments Accepted by TIST

详情
AI中文摘要

图在表示社交网络、知识图谱和分子发现等各个领域的复杂关系中扮演着重要角色。随着深度学习的出现,图神经网络(GNN)已成为图机器学习(Graph ML)的基石,促进了图的表示和处理。最近,大语言模型(LLM)在语言任务中展现出前所未有的能力,并被广泛应用于计算机视觉和推荐系统等各种应用中。这一显著成功也引起了将LLM应用于图领域的兴趣。越来越多的努力致力于探索LLM在提升图机器学习的泛化性、迁移性和少样本学习能力方面的潜力。同时,图,尤其是知识图谱,富含可靠的事实知识,可用于增强LLM的推理能力,并可能缓解其局限性,如幻觉和缺乏可解释性。鉴于这一研究方向的快速进展,有必要对LLM时代图机器学习的最新进展进行系统综述,为研究人员和从业者提供深入理解。因此,在本综述中,我们首先回顾了图机器学习的最新发展。然后,我们探讨了如何利用LLM来增强图特征的质量,减轻对标注数据的依赖,并解决图异质性和分布外(OOD)泛化等挑战。之后,我们深入探讨了图如何增强LLM,突出了它们增强LLM预训练和推理的能力。此外,我们调查了各种应用,并讨论了这一有前景领域的潜在未来方向。

英文摘要

Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graphs. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML's generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph Heterophily and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.