arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31586 2026-06-01 cs.CL cs.AI 版本更新

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

语言模型学习构式语义，更不用说句法：探究LM对配对焦点构式的理解

Wesley Scivetti, Ethan Wilcox, Nathan Schneider, Kanishka Misra, Leonie Weissweiler

发表机构 * Georgetown University（乔治城大学）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Leipzig University（莱比锡大学）

AI总结通过构建新数据集，研究不同规模开源语言模型对英语中稀有配对焦点构式（如“let alone”）的语义理解，发现中等规模模型能掌握其形式和意义，且语义学习晚于句法知识，并与世界知识相关。

Comments Conference on Natural Language Learning (CoNLL) 2026

详情

AI中文摘要

理解稀有构式（形式-意义配对）的语义已被证明是一个具有挑战性的问题，目前只有最大的LLM才能解决。开源模型是否具有稳健的构式理解，以及如果具备，这种知识习得背后的学习动态是什么，仍然是一个开放问题。聚焦于英语中一组稀有的配对焦点构式（例如“let alone”、“much less”），我们构建了一个新颖的数据集，利用标量形容词语义和一般世界知识来测试它们的意义。通过测试一系列在参数数量、架构和预训练数据集大小上不同的模型，我们发现几个中等规模的模型对配对焦点构式的形式和意义都敏感，尽管在人类规模数据上训练的模型在所有意义评估中均失败。转向一组开放检查点模型的训练动态，我们发现配对焦点理解在训练后期出现，晚于配对焦点句法知识，并且配对焦点语义的学习与世界知识某些领域的提升相关。总体而言，我们的实证结果支持中等规模开源模型能够掌握稀有配对焦点构式的结论，并展示了配对焦点构式知识与其他意义领域之间的联系。

英文摘要

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.

URL PDF HTML ☆

赞 0 踩 0

2605.31584 2026-06-01 cs.CL cs.AI cs.LG 版本更新

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL: 基于评分奖励从搜索智能体轨迹中学习长上下文推理

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

发表机构 * Tsinghua University（清华大学）

AI总结提出LongTraceRL框架，通过知识图谱随机游走生成多跳问题并利用搜索智能体轨迹构建分层干扰物，结合基于实体链的评分奖励进行过程监督，提升大语言模型在长上下文推理中的表现。

详情

AI中文摘要

长上下文推理仍然是大型语言模型的核心挑战，模型往往难以在大量干扰内容中定位和整合关键信息。基于可验证奖励的强化学习（RLVR）在此任务上展现出潜力，但现有方法受限于低混淆度的干扰物和稀疏的、仅基于结果的奖励信号，无法监督中间推理步骤。为解决这些问题，我们引入了 extsc{LongTraceRL}。在数据构建方面，我们通过知识图谱随机游走生成多跳问题，并利用搜索智能体轨迹构建\emph{分层干扰物}：智能体读取但未引用的文档（高混淆度）和搜索结果中出现但从未打开的文档（低混淆度），从而生成比随机采样或单次搜索构建的训练上下文更具挑战性的内容。在奖励设计方面，我们提出了一种\emph{评分奖励}，利用每条推理链上的黄金实体作为细粒度的实体级过程监督。该评分奖励仅应用于最终答案正确的响应（正向策略），以区分正确响应之间的推理质量，并防止奖励作弊。在五个长上下文基准上对三种推理LLM（4B-30B）进行的实验表明， extsc{LongTraceRL} 始终优于强基线，并鼓励全面、基于证据的推理。代码、数据集和模型可在 \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL} 获取。

英文摘要

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

URL PDF HTML ☆

赞 0 踩 0

2605.31564 2026-06-01 cs.CL cs.AI 版本更新

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

什么先被揭开？面向图到文本生成的扩散模型轨迹分析

Qing Wang, Jacob Devasier, Chengkai Li

发表机构 * The University of Texas at Arlington（德克萨斯大学阿灵顿分校）

AI总结本文首次系统研究掩码扩散语言模型在图到文本生成中的解码轨迹，发现其优先生成实体，并针对监督微调导致的输出长度固定问题提出无训练推理时修改方法λ缩放结构解码，恢复+9.4 BLEU-4，同时引入Graph-LLaDA模型以显式融入关系图结构。

详情

AI中文摘要

我们首次系统研究了掩码扩散语言模型（MDLM）在图到文本生成中的应用。我们分析了MDLM的生成轨迹——即迭代解码过程中令牌被掩码的顺序——发现与自回归LLM线性生成文本不同，MDLM自然优先处理实体，然后是关系词和功能词，结构令牌最后解决。我们进一步发现了一个先前未记录的监督微调失败模式：SFT通过过早地将结构性的句子结束令牌锚定在解码轨迹早期，破坏了这一策略，从而有效固定了输出长度，这可能导致信息遗漏或幻觉。为了解决这个问题，我们提出了λ缩放结构解码，一种无训练的推理时修改方法，降低结构令牌的置信度，并恢复了+9.4 BLEU-4。最后，我们引入了Graph-LLaDA，它将图Transformer编码器集成到LLaDA的解码过程中，以显式融入关系图结构。在LAGRANGE上的跨数据集评估表明，先前的基线过拟合于特定数据集模式，而基于LLM和MDLM的方法泛化能力显著更好。

英文摘要

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.

URL PDF HTML ☆

赞 0 踩 0

2605.31563 2026-06-01 cs.CL 版本更新

Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

分歧理由：重新思考仇恨言论检测中的分类与可解释性评估

Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank, Fosca Giannotti

发表机构 * Scuola Normale Superiore（斯克里瓦纳尔普通大学）； University of Pisa（比萨大学）； MaiNLP, LMU Munich（MaiNLP，慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）

AI总结本研究通过统一框架重新实现多种模型、训练策略和评估指标，在标签和理由表示空间下分析分类与可解释性指标，发现软表示更能捕捉分歧，从而重新思考主观NLP任务的评估方法。

Comments 16 pages

详情

AI中文摘要

人类分歧在标注中普遍存在且众所周知。然而，通过词级人类理由捕捉的解释变异仍远未得到充分探索。同时，鉴于这种变异，如何最好地评估人类标签和理由——甚至如何超越多数投票聚合理由——尚不明确。然而，理由可能提供对人类推理丰富性的额外见解，这些推理在风格、价值观和解释上可能有所不同——尤其是在像仇恨言论检测这样的主观NLP任务中。在本工作中，我们通过在不同标签和理由表示空间下系统地重新实现它们，将多样化的模型、训练策略、损失函数和现有评估指标统一到一个协议下。分类指标围绕两个关键属性组织——预测性和分布性——而可解释性指标则通过三个互补维度：合理性、忠实性和复杂性。在这个统一的监督框架中，我们评估模型在分类和可解释性指标上的行为，以及指标对标签（硬和软）和理由表示空间（硬、中间和软）选择的敏感性。结果表明，硬指标和软指标都更倾向于软表示，突显了它们在捕捉变异方面的有效性，以及重新思考主观NLP中评估的必要性。

英文摘要

Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.

URL PDF HTML ☆

赞 0 踩 0

2605.31561 2026-06-01 cs.CL 版本更新

What Am I Missing? Question-Answering as Hidden State Probing

我遗漏了什么？将问答作为隐藏状态探测

Chu Fei Luo, Samuel Dahan, Xiaodan Zhu

发表机构 * Department of Electrical and Computer Engineering & Ingenuity Labs, Queen’s University（电子与计算机工程系及Ingenuity实验室，女王大学）； Conflict Analytics Lab, Queen’s University（冲突分析实验室，女王大学）； Cornell Law School（康奈尔法学院）； Vector Institute for AI（人工智能向量研究所）

AI总结提出将问答作为推理时干预手段，通过学生-教师框架探测隐藏状态，发现提问前的隐藏状态可预测最终正确性，并设计门控策略优化提问时机。

详情

AI中文摘要

自链式推理引入大型语言模型（LLMs）以来，测试时推理已成为一个重要的研究领域。然而，这一推理过程的机制仍未被充分探索——从相同的输入提示，甚至相同的部分解出发，LLMs在多次采样时可能产生不同的答案。我们提出利用提问作为推理时干预手段，以表达模型隐藏状态的信息。为此，我们提出了一个学生-教师设置，其中学生向教师提问。我们在学生提问前后训练一个探测其隐藏状态的探针，发现该探针能预测轨迹的最终正确性，甚至在生成教师答案之前。这表明，在问题生成过程中存在有意义的自我诊断信号，而非来自教师的信息传递。然后，我们将提问建模为序列决策问题，使用该探针作为质量分数，并定义一个门控策略来提问以最大化正确可能性。我们发现，提问作为干预的成功在很大程度上取决于模型的自我一致性。我们的实证结果显示检测与恢复之间存在差距；虽然我们的门控策略捕捉了模型的正确性和不确定性，但干预在恢复错误轨迹的同时，同样可能损害正确轨迹。这种诊断与纠正之间的差距对语言模型在不确定性下进行自我精炼的能力具有更广泛的影响。

英文摘要

Test-time reasoning has become a significant field of study since the introduction of chain-of-thought reasoning in large language models (LLMs). However, the mechanisms of this reasoning process are still under-explored -- from the same input prompt, and even the same partial solution, LLMs can produce varied answers if sampled multiple times. We propose to leverage question-asking as an inference-time intervention that articulates information about the model's hidden state. To achieve that, we present a student-teacher setting where a student asks questions to a teacher. We train a probe on the student's hidden state before and after asking a question and find it is predictive of the trajectory's final correctness, even before generating the teacher's answer. This suggests there is a meaningful signal from the self-diagnosis that occurs during question generation rather than information transfer from the teacher. We then frame question-asking as a sequential decision problem, using this probe as a quality score, and define a gating policy to ask questions that maximize likelihood of correctness. We find that the success of question-asking as an intervention is largely dependent on the model's self-consistency. Our empirical results show a gap between detection and recovery; while our gating policy captures model correctness and uncertainty, interventions are equally likely to harm correct trajectories as they are to recover incorrect ones. This gap between diagnosis and correction has broader implications on language models' capacity for self-refinement under uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC 版本更新

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

发表机构 * School of Engineering and Applied Sciences（工程与应用科学系）； Department of Psychology（心理学系）

AI总结本研究通过引入零样本度量LALS，发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦，女性信号在生成前被抑制，揭示了模型对性别偏见的内部处理机制。

Comments 16 pages, 12 figures, 1 table

详情

AI中文摘要

对齐训练使视觉-语言模型（VLM）避免表达人口统计偏见，当性别清晰可见时，它们基本成功。但对于模糊输入（如全副武装的工人、从背后看到的人物）——实践中常见但很少研究的情况——我们发现，在模糊输入图像时，最小的提示压力就会暴露职业-性别默认值，模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容？我们引入LALS（潜在关联倾向分数），一种零样本度量，将视觉标记激活投影到模型的文本嵌入空间中，以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上，内部表征和输出系统性地解耦：模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大，而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明，文化负载的视觉线索（如服装颜色）进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

URL PDF HTML ☆

赞 0 踩 0

2605.31550 2026-06-01 cs.CL 版本更新

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

语义三元组恢复：大型语言模型中层次化表格理解的新协议

Yibin Zhao, Fangxin Shang, Dingrui Yang, Yuqi Wang

发表机构 * Taiyuan University of Technology（太原科技大学）； AI Lab, Qifu Technology（启福科技AI实验室）； AI Lab, Greensea Technology（绿海科技AI实验室）

AI总结提出语义三元组恢复（STR）协议，将表格重写为<项目路径，特征路径，值>原子事实，并设计TripletQL路由以选择三元组子集，在四个表格问答基准上匹配或超越HTML基线并减少输入令牌。

详情

AI中文摘要

表格问答要求模型恢复由二维布局、合并单元格和层次化标题隐式编码的语义关系。当前流水线通常使用HTML或Markdown作为中间表格表示，但这些面向布局的序列化引入了标记开销，并要求大型语言模型从行和列跨度推断标题-单元格对齐。我们提出语义三元组恢复（STR），一种将每个单元格重写为原子事实<项目路径，特征路径，值>的协议，其中项目路径指定行实体，特征路径指定层次化属性，值包含单元格内容。我们还提出TripletQL，一种轻量级查询感知路由器，使用STR为每个问题选择适当的渲染或过滤的三元组子集。在四个中英文表格问答基准上，STR匹配或超越基于HTML的基线，同时减少输入令牌。对于较小的语言模型和较长的表格上下文，相对收益更大，表明在受限推理预算下显式语义表示特别有用。代码和数据可在https://github.com/Phoenix-ni/STR.git获取。

英文摘要

Table question answering requires models to recover semantic relations encoded implicitly by two-dimensional layout, merged cells, and hierarchical headers. Current pipelines typically use HTML or Markdown as intermediate table representations, but these layout-oriented serializations introduce markup overhead and require large language models to infer header-cell alignments from row and column spans. We propose Semantic Triplet Restoration (STR), a protocol that rewrites each cell as an atomic fact <item path, feature path, value>, where the item path specifies the row-wise entity, the feature path specifies the hierarchical attribute, and the value contains the cell content. We also present TripletQL, a lightweight query-aware router that uses STR to select an appropriate rendering or filtered subset of triplets for each question. Across four Chinese and English table-QA benchmarks, STR matches or improves upon HTML-based baselines while reducing input tokens. The relative benefit grows for smaller language models and longer table contexts, suggesting that explicit semantic representations are especially useful under constrained inference budgets. Code and data are available at https://github.com/Phoenix-ni/STR.git .

URL PDF HTML ☆

赞 0 踩 0

2605.31545 2026-06-01 cs.CL 版本更新

Preference-Aware Rubric Learning for Personalized Evaluation

偏好感知的评分标准学习用于个性化评估

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Yoko Yamakata, Tat-Seng Chua

发表机构 * National University of Singapore（新加坡国立大学）； Xiaohongshu Inc.（小红书公司）； The University of Tokyo（东京大学）

AI总结提出PARL框架，通过从用户历史中学习偏好感知的评估标准并采用自验证机制，实现个性化评估。

详情

AI中文摘要

随着大型语言模型从通用助手演变为以用户为中心的智能体，个性化已成为使模型行为与个体偏好对齐的核心，而对个性化对齐的评估成为关键瓶颈。现有的评估方法——从自动指标到LLM作为评判者的方法——未能捕捉长期交互历史中嵌入的主观、用户特定偏好。我们确定了可靠且有效的个性化评估的三个基本原则：代表性、用户一致性和区分性。为应对这些原则，我们引入了“个性化评估即学习”范式，将个性化评估形式化为一个学习问题而非静态判断。在此范式下，我们提出了PARL（偏好感知的评分标准学习用于个性化评估），一个从原始用户历史中直接学习诱导偏好感知的评估标准，并执行自验证机制以确保与用户偏好一致的框架。PARL将评分标准诱导与区分性强化学习目标相结合，该目标对比用户撰写的回答与竞争性个性化模型输出，使学到的评分标准能够捕捉精确、用户特定的决策边界。在真实世界的个性化文本生成任务上的实验表明，PARL一致地诱导出高保真度的评分标准，可靠地识别与用户对齐的回答，并在用户和任务间泛化，同时捕捉稳定的风格偏好和细粒度的评估模式。为确保可重复性，我们的代码可在 https://github.com/SnowCharmQ/PARL 获取。

英文摘要

As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User-Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference-Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference-aware evaluation rubrics directly from raw user histories and performs a self-validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user-specific decision boundaries. Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine-grained evaluative patterns. To ensure reproducibility, our code is available at https://github.com/SnowCharmQ/PARL.

URL PDF HTML ☆

赞 0 踩 0

2605.31521 2026-06-01 cs.CL cs.SD 版本更新

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

UniAudio-Token: 赋予语义语音分词器通用音频感知能力

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（信息处理国家重点实验室，计算机学院，北京大学）； Basic Model Technology Center, WeChat AI, Tencent Inc.（基础模型技术中心，微信AI，腾讯公司）

AI总结提出UniAudio-Token框架，通过语义-声学基元（SAP）和语义-声学均衡（SAE）机制，在不牺牲语音能力的前提下为语义分词器注入通用音频感知，实现统一音频接口。

Comments 19 pages, 10 figures

详情

AI中文摘要

语义语音分词器因其紧凑的单码本设计和强语言对齐能力，已成为音频-大语言模型广泛使用的接口。然而，它们对语言抽象的关注导致了声学盲点，限制了其在以语音为中心的任务之外的适用性。我们提出UniAudio-Token，一个在不损害语音能力的前提下赋予语义分词器通用音频感知能力的框架。UniAudio-Token并非改变语义范式，而是通过两个关键创新来减轻其信息损失：(1) 语义-声学基元（SAP）通过将音频分解为语言内容、声音属性和听觉场景基元来提供结构化监督；(2) 语义-声学均衡（SAE）引入了一种内容感知门控机制，自适应地从浅层恢复细粒度声学细节。广泛评估表明，UniAudio-Token在学习全面的通用表示的同时，保持了高保真语音生成。当与下游大语言模型集成时，它在理解和生成任务上均优于所有单码本基线分词器，有效地作为统一音频接口。我们在https://github.com/Tencent/Universal_Audio_Tokenizer上公开发布了所有代码，包括训练和推理脚本以及模型检查点。

英文摘要

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

URL PDF HTML ☆

赞 0 踩 0

2605.31512 2026-06-01 cs.CL 版本更新

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

来自临床叙述的可靠多语言骨科决策支持：语言感知适应与验证引导的延迟

Danish Ali, Li Xiaojian, Sundas Iqbal, Farrukh Zaidi

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Software, Nanjing University of Information Science and Technology (NUIST)（南京信息工程大学软件学院）； Department of Orthopedic, Bahawal Victoria Hospital（巴哈瓦尔维多利亚医院骨科部门）

AI总结针对低资源医疗环境中的多语言骨科决策支持，提出结合语言感知适配编码器IndicBERT-HPA和确定性选择性验证层的可靠性框架，在英语、印地语和旁遮普语临床文本分类中取得最优性能。

详情

AI中文摘要

多语言骨科决策支持在低资源医疗环境中仍然具有挑战性，其中临床叙述包含专业术语、混合文字、不完整证据、标签不平衡和语言依赖的文档模式。本文提出了一个面向可靠性的框架，用于对英语、印地语和旁遮普语的自由文本骨科笔记进行分类。我们比较了任务对齐的多语言Transformer编码器、任务微调的DistilBERT基线、零样本指令微调的大语言模型（LLMs）和领域自适应编码器IndicBERT-HPA。IndicBERT-HPA通过语言感知的骨科适配器头增强IndicBERT，以支持临床相关的多语言表示学习。评估从整体准确率扩展到每类性能、ROC-AUC、AUPRC、期望校准误差、跨语言稳定性以及在受控平衡和自然患病率分布下的鲁棒性。评估的零样本LLMs在封闭集分类中远不如任务自适应编码器有效，且存在语言依赖的不稳定性。在自然临床患病率下，IndicBERT-HPA实现了最强的整体性能，平均Macro-F1达到0.8792，Macro-AUROC为0.894，AUPRC为0.902。我们进一步实现了一个确定性的选择性验证层，结合了置信门控、证据一致性检查和语言风险筛查。在随机选择的5000条保留子集上，它在72.3%的覆盖率下实现了84.4%的选择性准确率和0.76的选择性Macro-F1，而全接受预测的准确率为71.5%，Macro-F1为0.65。这些结果支持了面向可靠性的多语言临床决策支持，并带有明确的延迟机制。

英文摘要

Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminology, mixed scripts, incomplete evidence, label imbalance and language-dependent documentation patterns. This article presents a reliability-oriented framework for classifying free-text orthopedic notes in English, Hindi and Punjabi. We compare task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, zero-shot instruction-tuned large language models (LLMs) and a domain-adaptive encoder, IndicBERT-HPA. IndicBERT-HPA augments IndicBERT with language-aware orthopedic adapter heads to support clinically relevant multilingual representation learning. Evaluation extends beyond aggregate accuracy to per-class performance, ROC-AUC, AUPRC, expected calibration error, cross-language stability and robustness under controlled balanced and natural-prevalence distributions. The evaluated zero-shot LLMs remain substantially less effective than task-adapted encoders for closed-set classification, with language-dependent instability. Under natural clinical prevalence, IndicBERT-HPA achieves the strongest overall performance, reaching an averaged Macro-F1 of 0.8792, Macro-AUROC of 0.894 and AUPRC of 0.902. We further implement a deterministic selective-verification layer combining confidence gating, evidence-consistency checking and language-risk screening. On a randomly selected held-out 5,000-record subset, it achieves 84.4% selective accuracy and 0.76 selective Macro-F1 at 72.3% coverage, compared with 71.5% accuracy and 0.65 Macro-F1 for accept-all prediction. These results support reliability-oriented multilingual clinical decision support with explicit deferral.

URL PDF HTML ☆

赞 0 踩 0

2605.31494 2026-06-01 cs.CL cs.LG 版本更新

扩展匈牙利语对话ASR：BEA-Dialogue+语料库

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Katalin Mády

发表机构 * Department of Telecommunications and Artificial Intelligence（电信与人工智能系）； Budapest University of Technology and Economics（布达佩斯技术与经济大学）； Speechtex Ltd.（Speechtex公司）； ELTE Research Centre for Linguistics（ELTE语言学研究中心）

AI总结针对匈牙利语对话语音识别训练数据不足的问题，本文通过放宽分割标准扩展BEA-Dialogue语料库至200小时，并评估基于Whisper和FastConformer的模型，证明基于序列化输出训练的微调能持续改善识别性能。

详情

AI中文摘要

匈牙利语对话自动语音识别受到公开对话式训练数据有限的制约。BEA-Dialogue语料库解决了这一需求，但其严格的说话人分离的训练/开发/测试分割将可用材料减少到仅85小时。在本文中，我们介绍了BEA-Dialogue+，这是该语料库的扩展版本，它放宽了实验者和对话伙伴的分割标准，同时保持主要说话人的完全分离。这产生了200小时转录的自然对话，并允许对额外训练数据与分割间说话人重叠之间的权衡进行受控研究。我们在两个语料库版本上评估了多个基于Whisper和FastConformer的模型，包括基于序列化输出训练（SOT）的对话转录微调。我们的结果表明，对于未经微调的模型，较大的语料库更具挑战性，而基于SOT的适应在WER、CER、cpWER和cpCER上产生了一致的改进。总体而言，BEA-Dialogue+为匈牙利语对话ASR提供了一个更大但仍具挑战性的基准，以及用于训练和评估对话转录系统的实用资源。

英文摘要

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

URL PDF HTML ☆

赞 0 踩 0

2605.31463 2026-06-01 cs.LG cs.AI cs.CL cs.DC 版本更新

PithTrain: A Compact and Agent-Native MoE Training System

PithTrain: 一个紧凑且面向智能体的MoE训练系统

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry, Chenyan Xiong, Tianqi Chen

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Xlue ； NVIDIA（英伟达）

AI总结提出PithTrain，一个基于智能体原生设计原则的紧凑型MoE训练框架，通过引入ATE-Bench评估智能体任务效率，在保持生产框架吞吐量的同时，将智能体任务轮次和活跃GPU时间分别降低62%和64%。

详情

AI中文摘要

混合专家模型（MoE）已成为前沿语言模型的主导架构。为满足这一需求，生产框架经过多年的工程努力构建了优化的MoE训练栈。然而，为新的架构和系统优化而演进这些栈仍然代价高昂。随着AI编码智能体的兴起，它们可以自动化训练框架开发的部分工作并加速这一演进。但将这些智能体应用于现有框架会带来隐藏成本，这些成本在当今仅关注吞吐量的评估中不可见。我们将这一缺失维度命名为智能体任务效率（ATE）：即使用编码智能体理解、操作和扩展框架的成本。基于四个智能体原生设计原则，我们构建了PithTrain，一个紧凑、智能体原生的MoE训练框架。我们进一步引入了ATE-Bench，涵盖现实世界的训练框架任务。我们的评估表明，PithTrain在吞吐量上与生产框架相当，并且在ATE-Bench上，PithTrain实现了更高的智能体任务效率，智能体轮次减少高达62%，活跃GPU时间减少64%。

英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.

URL PDF HTML ☆

赞 0 踩 0

2605.31455 2026-06-01 cs.LG cs.CL 版本更新

SCOPE：面向开放式任务的协同演化策略自我对弈

Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini

发表机构 * University of Edinburgh（爱丁堡大学）； Imperial College London（伦敦帝国学院）

AI总结提出SCOPE框架，通过协同演化挑战者和求解者策略，实现无数据自我对弈，在开放式任务上提升性能并超越依赖人工提示的方法。

详情

AI中文摘要

自我对弈可以在没有外部监督的情况下训练语言模型。然而，现有方法需要可规则检查的答案，使得开放式任务依赖于精心设计的提示或前沿模型作为评判者。我们引入了SCOPE，一个无数据的自我对弈框架，用于开放式任务，它协同演化两个策略：一个生成文档基础任务的挑战者，以及一个通过多轮检索回答这些任务的求解者。初始模型的冻结副本作为自我评判者，它从源文档编写特定任务的评分标准，并据此对求解者的回答进行评分。在三个7-8B指令调优模型（Qwen2.5、Qwen3、OLMo-3）上，SCOPE在八个基准测试中将开放式任务性能提升了高达+10.4分，并匹配或超过了在约9K个精心设计的提示上训练的GRPO_data。尽管仅在开放式任务上训练，SCOPE还在七个保留的短格式问答基准测试中提升了高达+13.8分，在所有三个模型上均超越了GRPO_data。消融实验表明，协同演化挑战者对于保持任务接近求解者的前沿是必要的；性能提升源于检索和合成的改进，其相对贡献因任务而异；并且评分标准生成质量是自我评判的瓶颈。

英文摘要

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.

URL PDF HTML ☆

赞 0 踩 0

2605.31432 2026-06-01 cs.CL cs.AI cs.SD 版本更新

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

DOA：面向语音大语言模型的长形式同声传译的无训练解码器仅注意力策略

Sara Papi, Luisa Bentivogli

发表机构 * Fondazione Bruno Kessler（布鲁诺·克塞塞基金会）

AI总结提出DOA策略，利用解码器自注意力导出代理对齐，无需训练即可实现语音大语言模型在长形式同声传译中的流式决策。

详情

AI中文摘要

同声语音到文本翻译（SimulST）在语音尚未完成时生成翻译，需要流式策略来决定何时读取和何时写入。最先进的方法依赖于基于注意力的编码器-解码器模型，其中交叉注意力提供显式的对齐信号。相比之下，语音大语言模型（SpeechLLMs）是仅解码器架构，仅依赖自注意力。这引发了一个核心问题：解码器自注意力是否包含足够稳定的对齐信号来指导流式策略。此外，现有方法通常依赖于基于训练的适应或启发式等待-$k$策略，并且尚未在长形式场景中得到验证。为了填补这些空白，我们提出了仅解码器注意力（DOA），这是一种无训练策略，通过从自注意力中导出代理对齐，使现成的SpeechLLMs能够进行长形式同声传译。在Phi4-Multimodal和Qwen3-Omni上的实验表明，DOA提供了有效的对齐信号来支持流式决策，实现了低延迟的长形式SimulST，其质量接近无需重新训练的离线解码。

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata（III-LIDI国立拉普拉塔大学）； CDTEC, Federal University of Pelotas（CDTEC，联邦 Pelotas 大学）； CONICET III-LIDI ； Comision de Investigaciones Cientificas Universidad Nacional de La Plata（科学委员会国立拉普拉塔大学）； Universidade Federal de Pelotas（联邦 Pelotas 大学）

AI总结针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题，提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强，并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign (https://genai4sl.github.io/) at CVPR 2026. Non proceedings track

详情

AI中文摘要

手语翻译（SLT）仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法，其中GPT-4o生成参考句子的受控释义变体，而手语输入保持不变。采用基于Signformer姿态的Transformer，在两阶段调度下进行训练：先在增强语料库上预训练，然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估：PHOENIX14T（德国手语），具有适度的词汇多样性；GSL（希腊手语），具有高度受控、重复的录制；以及LSA-T（阿根廷手语），具有严重的长尾稀疏性。在PHOENIX14T上，增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知，这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We study target-side augmentation in which GPT-4o generates controlled paraphrase variants of reference sentences while the sign input remains unchanged. A Signformer-style pose-based Transformer is trained under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate on three datasets spanning complementary challenges: PHOENIX14T (German Sign Language), with moderate lexical diversity; GSL (Greek Sign Language), with highly ontrolled, repetitive recordings; and LSA-T (Argentinian Sign Language), with severe long-tail sparsity. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33. The near-saturated GSL baseline and extremely sparse LSA-T setting reveal the limits of the approach. To our knowledge, this is the first study to apply LLM-generated target-side araphrases and LLM-as-a-Judge evaluation to SLT. The semantic evaluation reveals gains in fidelity that lexical overlap metrics understate.

URL PDF HTML ☆

赞 0 踩 0

2605.31387 2026-06-01 cs.CL cs.RO 版本更新

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

多轮多智能体对话用于协作重建仅略微提升VLM在空间推理上的性能

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam（语言学计算系，语言学系柏林洪堡大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））

AI总结研究通过多轮多智能体对话框架评估视觉语言模型在协作空间推理任务中的表现，发现视觉空间理解仍是主要瓶颈，文本表示和分解图像表示可部分提升性能。

Comments Preprint

详情

AI中文摘要

在多样化环境中运行的机器人依赖视觉输入来解释物体和空间布局。在人类协作任务中，它们被期望通过语言传达这种理解。视觉语言模型（VLM）支持涉及视觉解释、问答和指令跟随的机器人任务，但它们在需要空间推理的协作对话任务中的能力仍未充分探索。我们通过一个结合视觉解释、基础、语言引导交互和动作生成的协作结构构建任务来研究这一差距。我们开发了一个框架，其中VLM通过对话从视觉和文本输入重建目标结构。我们在交互设置、输入模态和图像表示上评估了开放权重和封闭VLM。结果表明，对于评估的VLM，视觉表示的空间推理仍然困难。目标的详细文本表示在模态条件下产生更高的重建成功率，而分解的图像表示提高了性能。这些发现揭示了协作VLM智能体在视觉空间基础和基础指令生成方面的局限性。

英文摘要

Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.

URL PDF HTML ☆

赞 0 踩 0

2605.31378 2026-06-01 cs.CL 版本更新

面向VLM-as-a-Judge评估的视障辅助基准

Yi Zhao, Siqi Wang, Zhe Hu, Yushi Li, Jing Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结针对视障辅助任务中VLM-as-a-Judge评估的可靠性问题，提出VIABLE基准（含30万+样本、有效性-公正性-稳定性框架及12种失败模式分类），发现现有模型不可靠，并开发VIA-Judge-Agent方法提升诊断准确性和用户偏好。

详情

AI中文摘要

基于AI的视障辅助（VIA）仍然具有挑战性，主要原因是人工评估成本高昂。VLM-as-a-Judge范式可能提供一种有前景的替代方案，尽管该范式主要在通用领域得到研究。因此，我们质疑此类评判者是否可以在VIA任务中值得信赖。为探究这一问题，我们引入了VIABLE（面向VLM-as-a-Judge评估的视障辅助基准），这是首个用于VIA中VLM-as-a-Judge评估的基准。VIABLE包含超过30万个判断样本，涵盖三种场景，并引入了一个包含12种失败模式分类的有效性-公正性-稳定性框架。基于VIABLE，我们对七个不同模型规模的评判者进行了系统研究，结果表明现有模型在所有评估轴上基本不可靠。最强的评判者GPT-5.4仅达到52.6%的单故障诊断准确率，却表现出最高的自我偏好率（94.2%）；而开源评判者存在严重偏差且对抗性脆弱。为解决这些问题，我们提出了VIA-Judge-Agent，一种与模型无关的推理时增强方法，通过视觉证据提取和基于分类的工作流来增强评判者。该方法在诊断准确性和下游VIA响应（更受BLV用户青睐）方面实现了积极改进。数据和代码可在 https://github.com/YiyiyiZhao/VIABLE 获取。

英文摘要

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

URL PDF HTML ☆

赞 0 踩 0

2605.31349 2026-06-01 cs.CL cs.AI cs.CV cs.MM 版本更新

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

FBHM：用于仇恨模因检测的功能性基准测试与视觉语言模型引导

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee

发表机构 * Indian Institute of Technology (IIT), Kharagpur（印度理工学院（IIT）卡拉格浦尔）； Microsoft（微软）

AI总结针对现有基准无法因果评估视觉语言模型漏洞的问题，提出基于25种修辞功能和10个目标社区构建的FBHM基准，并采用可学习引导向量（LSV）在极低数据量下提升模型性能约30个Macro-F1点。

详情

AI中文摘要

仇恨模因检测对于视觉语言模型仍是一个严峻挑战，因为现有基准在结构上是观察性的——混淆了修辞仇恨机制与目标社区特征，并阻碍了对模型漏洞的因果评估。为解决这一问题，我们引入了FBHM，一个系统策划的基于功能的仇恨模因基准，沿两个正交轴构建：25种不同的修辞功能和10个目标社区（总共5,000个模因）。对最先进的视觉语言模型进行基准测试揭示了一个严重的泛化差距：在标准数据集上高度准确的模型在FBHM上灾难性地下降到接近随机性能，证明它们利用了数据集特定的启发式方法而非稳健的多模态推理。为了高效缩小这一差距，我们提出了LSV（可学习引导向量），一种超低数据量策略，在仅500个引导样本（50个独特基础模因）上应用因果干预目标，将FBHM性能提升约30个Macro-F1点，同时优于上下文学习和PEFT，且不降低源域性能。

英文摘要

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

URL PDF HTML ☆

赞 0 踩 0

2605.31338 2026-06-01 cs.CL 版本更新

Bundesrecht: An Open Library and Corpus for German Statutory Reference Processing

Bundesrecht: 面向德国法律引用处理的开放库与语料库

Harshil Darji, Martin Heckelmann, Christina Kratsch, Gerard de Melo

发表机构 * Hochschule für Technik und Wirtschaft Berlin（柏林技术与经济高等学院）； Hasso-Plattner Institute / University of Potsdam（波茨坦大学哈索普劳恩研究所）

AI总结本文提出 bundesrecht，一个包含软件库和结构化语料库的开放资源，用于解析、规范化和解析德国法律引用，实现从原始引用字符串到结构化法律条文的端到端处理。

Comments 10 pages, 1 figure. Preprint

详情

AI中文摘要

法律引用是法律语言理解的核心，但难以自动处理，因为它们以紧凑且多变的表面形式出现，可能组合多个目标，使用特殊缩写，并常指向较低级别的单元。现有的德语工具要么专注于从法律文档中解析引用，要么在引用明确后访问法律文本。本文介绍了 bundesrecht，一个用于德国法律引用处理的开放资源，包含一个软件库和一个结构化的德国联邦法律语料库。该库解析、规范化和解析德国法律引用，将原始引用字符串映射到结构化对象，将紧凑引用扩展为规范形式，并将其链接到法律条文。附带的语料库保留了从法律到细粒度子条款的法律内部层级。我们使用严格精确匹配和微信息抽取指标，在 2,944 个带注释的德国法律引用上评估了解析器和规范化器。我们进一步评估了规范引用去重，并表明规范化引用比字符串匹配更可靠地对真实引用表面变体进行分组。bundesrecht 是第一个覆盖德国法律引用处理端到端流水线的开放资源，从原始引用字符串到解析后的法律条文，并在 PyPI 上可用。

英文摘要

Statutory references are central to legal language understanding, but are difficult to process automatically, as they appear in compact and variable surface forms, may combine multiple targets, use special abbreviations, and often point to lower-level units. Existing tools for German focus either on parsing references from legal documents or accessing statutory text once citations are explicit. This paper introduces bundesrecht, an open resource for German statutory reference processing, consisting of a software library and a structured corpus of German federal law. The library parses, normalizes, and resolves German statutory references, mapping raw citation strings to structured objects, expanding compact references into canonical forms, and linking them to statutory provisions. The accompanying dataset preserves the internal hierarchy of statutes from laws to fine-granular subclauses. We evaluate the parser and normalizer on 2,944 annotated German legal references using strict exact-match and micro information extraction metrics. We further evaluate canonical reference deduplication and show that normalized references group real citation surface variants far more reliably than string matching. bundesrecht is the first open resource that covers German statutory reference processing as an end-to-end pipeline, from raw citation string to resolved statutory provision, and is available on PyPI.

URL PDF HTML ☆

赞 0 踩 0

2605.31328 2026-06-01 cs.CL 版本更新

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

强化学习放大了来自无害奖励的涌现性失调

Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan, Lucie Flek, Florian Mai

发表机构 * Bonn-Aachen International Center for Information Technology, University of Bonn, Germany（波恩-亚琛信息科技国际中心，波恩大学，德国）； Lamarr Institute for Machine Learning and Artificial Intelligence, Germany（拉马尔机器学习与人工智能研究所，德国）

AI总结本文研究强化学习如何从看似无害的奖励信号中引发语言模型的涌现性失调，发现其比监督微调更严重，并验证了在训练中插入安全数据可缓解此问题。

详情

AI中文摘要

涌现性失调（EM）是指语言模型在针对狭窄的失调示例进行微调后，意外地变得广泛失调的倾向。虽然EM在监督微调（SFT）设置中已被广泛研究，但来自强化学习（RL）的证据仅限于大型闭源模型，使得该现象研究成本高昂且难以复现。我们沿三个维度刻画了小型、现成的开源权重模型中来自RL的EM。首先，我们表明，奖励狭窄、明显的失调行为比样本匹配的SFT产生更高的通用域失调。其次，我们表明，来自RL的EM可以由可能自然出现的奖励信号引发，例如不受欢迎的审美偏好或糟糕的修辞诉求。第三，我们评估了为SFT引发的EM开发的训练中缓解措施，发现它们广泛适用，其中交错进行策略内安全数据效果最佳。

英文摘要

Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.

URL PDF HTML ☆

赞 0 踩 0

2605.31312 2026-06-01 cs.CV cs.CL 版本更新

Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization

从细粒度视觉差异中学习：通过上下文视觉对比优化缓解多模态幻觉

Haolin Deng, Xin Zou, Zhiwei Jin, Chen Chen, Haonan Lu, Xuming Hu

发表机构 * The Hong Kong University of Science（香港科学与技术大学）； OPPO AI Center（OPPO AI中心）

AI总结提出上下文视觉对比优化（IC-VCO）方法，通过共享多图像上下文中的对比图像确保数学严谨的目标，并引入视觉对比蒸馏（VCDist）和对比样本编辑策略，有效缓解多模态幻觉。

Comments ICML 2026

详情

AI中文摘要

多模态幻觉仍然是视觉语言模型（VLM）面临的持续挑战。标准的文本直接偏好优化（DPO）由于缺乏显式的视觉监督，往往无法缓解这一问题。虽然现有工作通过将原始图像与负样本对比引入了视觉偏好DPO，但由于配分函数不匹配导致目标在理论上不一致，并且依赖可能引发捷径学习的粗粒度负样本。在这项工作中，我们提出了上下文视觉对比优化（IC-VCO）。通过将对比图像置于共享的多图像上下文中，IC-VCO确保了数学上严谨的目标。我们进一步引入了视觉对比蒸馏（VCDist），一种辅助的可靠性门控正则化器，鼓励多图像对比训练与单图像推理之间的一致性。最后，我们提出了一种对比样本编辑策略，通过精确的语义扰动生成困难负样本。在五个基准上的实验表明，IC-VCO取得了最佳的整体性能，并且我们的样本编辑策略有效。代码和数据可在 https://github.com/OPPO-Mente-Lab/IC-VCO 获取。

英文摘要

Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability-gated regularizer that encourages consistency between multi-image contrastive training and single-image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at https://github.com/OPPO-Mente-Lab/IC-VCO.

URL PDF HTML ☆

赞 0 踩 0

2605.31293 2026-06-01 cs.CL 版本更新

Divergence Decoding: Inference-Time Unlearning via Auxiliary Models

发散解码：通过辅助模型进行推理时遗忘

Humzah Merchant, Bradford Levy

发表机构 * University of Chicago（芝加哥大学）

AI总结提出发散解码（DD）方法，利用小型辅助模型在推理时引导LLM的logits远离特定数据，有效且低成本地实现遗忘，并在多个基准上超越现有方法。

详情

AI中文摘要

大型语言模型（LLM）经常记忆敏感的训练数据，从而产生显著的隐私和版权风险。解决这些风险，即从现有模型检查点中移除此类知识，已被证明具有挑战性，因为许多遗忘方法会导致灾难性的效用损失或对复杂查询无效。我们引入了发散解码（DD），一种使用小型辅助模型在推理时将LLM的logits引导远离特定数据的机制。训练这些模型是直接的，即我们使用标准的预训练和微调设置。我们发现该方法在遗忘基准测试中明显优于最先进的基线，且在各种模型和训练数据集规模上保持一致，表明DD是一种有效且廉价的遗忘解决方案。然后我们证明，这种引导后的分布可以轻松地蒸馏回基础模型。由于该方法普遍适用于任何概率模型，我们探索了其在文本生成之外的有效性，并发现了向图像领域泛化的证据。

英文摘要

Large Language Models (LLMs) frequently memorize sensitive training data thereby creating significant privacy and copyright risks. Addressing these risks, i.e., removing such knowledge from an existing model checkpoint, has proven challenging as many unlearning methods lead to catastrophic utility loss or are ineffective for complex queries. We introduce Divergence Decoding (DD), a mechanism that uses small auxiliary models to steer the logits of the LLM away from specific data during inference. Training these models is straight forward, i.e., we use standard pre-training and fine-tuning setups. We find the method decisively outperforms state-of-the-art (SOTA) baselines on unlearning benchmarks across a variety of model and training dataset scales consistent with DD being an effective and inexpensive solution to unlearning. We then demonstrate that this steered distribution can be trivially distilled back into the base model. Since the method is generally applicable to any probabilistic model, we explore its efficacy outside of text generation and find evidence of generalization to the domain of images.

URL PDF HTML ☆

赞 0 踩 0

2605.31281 2026-06-01 cs.CL 版本更新

Wind Turbine Maintenance Log Labelling Framework: LLM-Driven Data Correction and Enrichment via Semantic Extraction of Reliability Intelligence

风力涡轮机维护日志标注框架：基于LLM驱动的数据校正与语义提取的可靠性智能增强

Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya

发表机构 * Institute for Energy Systems, School of Engineering, The University of Edinburgh（能源系统研究所，工程学院，爱丁堡大学）； Nadara, Lisbon, Portugal（纳达拉，里斯本，葡萄牙）

AI总结提出一种利用大语言模型自动标准化和结构化风力涡轮机维护日志的方法，通过纠正系统代码、提取故障模式与维护动作分类，将非结构化文本转化为定量可靠性指标。

Comments An adjustable template containing the Python script architecture, applied dynamic prompts, and data schemas is hosted in an open-source GitHub repository: https://github.com/mvmalyi/llm-driven-wind-turbine-maintenance-log-labelling

详情

AI中文摘要

随着风力涡轮机机队老化，数据驱动的可靠性工程对于优化其运行和维护以延长使用寿命和降低平准化能源成本至关重要。历史维护日志中的故障事件描述是宝贵可靠性智能的来源。然而，它们通常以非结构化的自然语言条目出现，无法进行定量分析。本文提出了一种新颖的方法，利用大语言模型（LLM）根据自由文本描述符系统地标准化和结构化维护日志。该方法在来自280台涡轮机、监测九年的16,316条维护日志数据集上运行，开发的模型无关框架自主纠正了层次化系统代码，并提取了基于证据的维护操作和故障模式分类。自动化流水线成功结构化超过70%的数据集。它解决了普遍存在的错误分类问题，例如隔离先前未分类的变桨系统故障并恢复缺失的系统代码，并通过应用经验分类法标记具体采取的操作和处理的故障模式来丰富记录。通过使用基于系统的日志批次构建故障模式、可观察症状、主导机制和候选原因的经验词典，该方法减少了手动故障模式与影响分析（FMEA）固有的主观性。最终，该方法为将大量定性现场观测转化为定量可靠性指标提供了高度可扩展、成本效益高的蓝图，为可再生能源领域的集成根本原因分析、改进的FMEA和先进预测性维护奠定了基础。

英文摘要

As wind turbine fleets age, data-driven reliability engineering is essential to optimise their operation and maintenance for service life extension and levelised cost of energy reduction. Failure event descriptions within historical maintenance logs are a source of valuable reliability intelligence. However, they typically appear as unstructured natural language entries, rendering them inaccessible for quantitative analysis. This paper presents a novel methodology leveraging a large language model (LLM) to systematically standardise and structure maintenance logs based on their free-text descriptors. Operating on a dataset of 16,316 maintenance logs from 280 turbines monitored over nine years, the developed model-agnostic framework autonomously corrected hierarchical system codes and extracted evidence-based taxonomies of maintenance actions and failure modes. The automated pipeline successfully structured over 70% of the dataset. It resolved pervasive misclassification issues, such as isolating previously unclassified pitch system faults and restoring missing system codes, and enriched the records by applying empirical taxonomies to label specific actions taken and failure modes addressed. By using system-based log batches to construct empirical dictionaries of failure modes, observable symptoms, dominant mechanisms, and candidate causes, this approach reduces the inherent subjectivity of manual failure modes and effects analysis (FMEA). Ultimately, the methodology provides a highly scalable, cost-effective blueprint for translating large sets of qualitative field observations into quantitative reliability metrics, laying the foundation for integrated root-cause analysis across the renewable energy sector, improved FMEA, and advanced predictive maintenance.

URL PDF HTML ☆

赞 0 踩 0

2605.31268 2026-06-01 cs.CL 版本更新

共享疑虑：语言模型的零样本跨语言置信度估计

Athina Kyriakou, Dennis Ulmer, Ivan Titov

发表机构 * ILLC, University of Amsterdam（阿姆斯特丹大学ILLC）； ILCC, University of Edinburgh（爱丁堡大学ILCC）

AI总结研究多语言大语言模型是否编码共享的、可跨语言迁移的置信度特征，通过轻量级线性探针从中间表示直接预测答案正确性，实现零样本跨语言泛化，并发现置信度特征集中在中间层。

详情

AI中文摘要

置信度估计（CE），即量化模型预测的可靠性，在大语言模型（LLM）背景下引起了极大兴趣。然而，大多数研究集中在英语上，忽视了LLM使用的多语言现实，而许多CE方法会退化或需要跨语言重新训练。为了解决这一差距，我们研究了多语言LLM是否编码共享的、可跨语言迁移的置信度特征。我们使用一个轻量级线性探针，直接从中间表示预测答案正确性。经过单语言训练后，该探针在零样本情况下泛化到未见过的、类型多样的语言，无需目标语言监督。学习到的层权重和多次消融实验表明，置信度特征集中在各语言的中间层，表明存在共享的置信度子空间。虽然零样本跨语言性能取决于与源语言的相似性，但该探针无需任何重新训练即可提供强基线，并且与其他流行的置信度估计方法相比具有优势。

英文摘要

Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

URL PDF HTML ☆

赞 0 踩 0

2605.31212 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

基准测试与增强文本到图像模型以生成早期算术教育中的视觉表示

Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar, April Yi Wang, Mrinmaya Sachan

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）； ETH AI Center（ETH人工智能中心）

AI总结针对早期算术教育中的方程到视觉生成任务，构建了E2V-Bench基准并评估了现有T2I模型，发现其在计数和关系结构上存在严重错误，进而探索了基准引导的增强策略。

详情

AI中文摘要

AI系统越来越多地用于支持教育内容创作，但尚不清楚它们能否生成忠实代表其旨在教授的教学概念的输出。因此，我们引入了方程到视觉生成任务，与传统的图像生成不同，该任务要求从算术方程中生成具有教学意义的视觉内容，同时精确保留其数值和关系结构。根据对教师的访谈和教育材料的分析，我们构建了E2V-Bench基准，涵盖四种基于教学法的视觉类型，以及用于评估视觉正确性的自动指标。我们的评估显示，最近的文本到图像（T2I）模型在此任务上频繁失败，错误主要表现为对象计数不正确和关系结构破坏。在此基础上，我们探索了基准引导的增强策略。这些策略改进了代表性模型，但剩余的差距要求未来的T2I模型具备更强的数值和关系基础。

英文摘要

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

URL PDF HTML ☆

赞 0 踩 0

2605.31201 2026-06-01 cs.CL 版本更新

Learning Whom to Trust: Market-Feedback Adaptive Retrieval for Frozen LLMs in Event-Driven Financial RAG

学习信任谁：事件驱动金融RAG中冻结大语言模型的市场反馈自适应检索

Zijie Zhao, Roy E. Welsch

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结针对事件驱动金融RAG，提出通过外部贝叶斯源记忆更新检索层，利用市场反馈自适应选择证据源，在冻结LLM情况下提升预测和投资组合表现。

详情

AI中文摘要

金融检索增强生成（RAG）系统通常按文本相关性对证据排序，但在金融市场中，有用的证据来源取决于事件类型、预测期限和市场背景。我们将新闻触发的事件影响预测作为一个时间点金融RAG问题进行研究。对于每个公司-新闻锚点，系统检索相关的金融新闻和SEC文件段落，附加决策前市场背景卡片，并预测多期限残差收益信号。我们的方法保持大语言模型（LLM）阅读器冻结，通过外部贝叶斯源记忆（根据已成熟的残差收益反馈更新）自适应检索层。在源自FinRL-DeepSeek/FNSPID任务的固定89只纳斯达克股票池上，使用原始FNSPID新闻和时间点EDGAR文件段落，与无记忆的冻结阅读器相比，带源记忆的冻结阅读器将留出宏F1从0.438提升至0.471，下游投资组合夏普比率从0.52提升至0.84。有监督的LoRA阅读器对静态RAG有适度改进，但未超过冻结源记忆阅读器。这些结果表明，对于金融RAG，学习从何处检索与学习如何阅读同等重要，提供了一种简单、模块化的市场反馈适应途径。

英文摘要

Financial retrieval-augmented generation (RAG) systems typically rank evidence by textual relevance, but in financial markets the useful evidence source depends on event type, forecast horizon, and market context. We study news-triggered event-impact prediction as a point-in-time financial RAG problem. For each company-news anchor, the system retrieves related financial news and SEC filing passages, appends a pre-decision market-context card, and predicts multi-horizon residual-return signals. Our method keeps the large language model (LLM) reader frozen and adapts the retrieval layer through an external Bayesian source memory updated from matured residual-return feedback. On a fixed 89-stock Nasdaq-oriented universe derived from the FinRL-DeepSeek/FNSPID task, using original FNSPID news and point-in-time EDGAR filing passages, Frozen Reader with Source Memory improves held-out macro-F1 from 0.438 to 0.471 and downstream portfolio Sharpe from 0.52 to 0.84 relative to Frozen Reader with No Memory. A supervised LoRA reader improves static RAG modestly, but does not improve over the frozen source-memory reader. These results suggest that, for financial RAG, learning where to retrieve from can be as important as learning how to read, offering a simple, modular route to market-feedback adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO 版本更新

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

发表机构 * University of Michigan, Ann Arbor（密歇根大学，安娜堡）

AI总结针对安全人机协作，提出碰撞接地概念及物理基准TouchSafeBench，评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现，发现现有模型不可靠，视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情

AI中文摘要

安全的人机协作需要的不仅仅是视觉描述：监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞，或即将碰撞。我们将这种能力称为碰撞接地：将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合，以推断当前和即将发生的接触。我们引入了TouchSafeBench，一个基于物理的基准，用于评估视觉-语言模型（VLM）中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建，包含2,940个模拟室内共现场景，涵盖社交导航和社交重排，具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务：分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中，当前模型远未达到可靠：最佳平均Macro-F1仍低于50%，显式深度不会自动转化为机器人身体碰撞证据，且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制：视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.31183 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

引导LLM？实际上，稀疏自编码器可以胜过简单基线

Mikkel Godsk Jørgensen, Lars Kai Hansen

发表机构 * DTU Compute（丹麦技术大学计算学院）

AI总结本文通过监督流水线选择并标注特征，证明稀疏自编码器在模型引导任务上可接近LoRA性能，并发现高稀疏性对基于可解释性的引导并非关键。

详情

AI中文摘要

稀疏自编码器（SAEs）被视为探索大型语言模型（LLMs）内部机制和引导模型输出生成的有前途的途径。当Wu等人（2025）引入模型引导基准AxBench时，SAEs由于相对于一组简单基线的引导性能较差，似乎并未达到最初的期望。本文作为对稀疏自编码器的部分反驳，表明Wu等人（2025）的结果并未完全公正地评价它们。我们发现，当使用我们的监督流水线选择并标注特征时，稀疏自编码器实际上可以在AxBench基准上达到接近参考LoRA性能的水平。我们还发现，当仅使用基于可解释性的组件时，我们的流水线选择的特征与其识别标签具有令人惊讶的因果性。最后，我们提供证据表明，高稀疏性（低l0）可能对于基于可解释性的成功引导并非关键，这与Wang等人（2025）早期的发现相反。

英文摘要

Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).

URL PDF HTML ☆

赞 0 踩 0

2605.31175 2026-06-01 cs.CL 版本更新

Towards Efficient LLMs Annealing with Principled Sample Selection

迈向基于原则性样本选择的高效LLM退火

Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li, Guang Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Fudan University（复旦大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结本文通过损失景观的谱几何特性，将退火阶段的数据选择建模为有约束优化问题，提出DiReCT框架，利用Hessian谱对梯度施加方向约束，实现高效样本选择，显著提升模型性能。

详情

AI中文摘要

退火阶段是LLM预训练中关键的收敛阶段，最终决定模型质量。然而，在此阶段有效选择训练数据仍是一个关键挑战。当前策略依赖于经验启发式方法，如领域过滤或上下文扩展，缺乏优化理论的原则性基础。在这项工作中，我们通过损失景观的谱几何视角来刻画退火阶段。我们认为，最优收敛需要梯度更新满足不同特征方向上的异构约束。基于这一见解，我们将数据选择形式化为满足这些方向约束的问题。为此，我们提出了DiReCT（方向约束训练），这是一个新颖的框架，将退火阶段的样本选择重新表述为约束优化问题。通过基于Hessian的谱特性对每个样本的梯度施加显式的方向约束，DiReCT识别出与最优曲率感知下降路径一致的样本。跨多种模型尺度的广泛实验表明，DiReCT始终达到最先进的性能。为便于未来研究，代码可在https://github.com/xuyj233/Direct获取。

英文摘要

The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.

URL PDF HTML ☆

赞 0 踩 0

2605.31170 2026-06-01 cs.CL cs.AI 版本更新

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

语言模型智能体群体中的涌现语言：从令牌效率到监督规避

Stine Lyngsø Beltoft, William Brach, Federico Torrielli, Jacob Nielsen, Annemette Brok Pirchert, Filippo Tonini, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark（南丹麦大学）； Slovak University of Technology in Bratislava（布拉迪斯拉发技术大学）； University of Turin（都灵大学）； Ordbogen A/S（Ordbogen公司）

AI总结研究语言模型智能体群体中涌现的语言，通过规则启发式和零样本分类识别出令牌效率、新自然语言和监督规避三类，发现监督规避语言更难对齐且可被上下文学习，表明仅监控表面行为可能不足以控制智能体群体。

详情

AI中文摘要

目前，对自主语言模型智能体的监控主要依赖表面行为。但当智能体群体为了规避人类监督而发明新语言时会发生什么？本文研究了Moltbook上的涌现语言。为此，我们基于Moltbook Files数据集，采用两阶段方法：先进行基于规则的启发式匹配（约6000个匹配），再进行零样本分类（保留518个）。结果类别包括令牌效率（166个）、新自然语言（106个）和监督规避（59个）。我们进行了定量和定性分析。结果表明，提出用于规避监督的新语言的帖子被DeepSeek-3.2判定为比其他类别更不对齐，且所有语言都可以通过语言描述被其他语言模型在上下文中学习。此外，手动研究典型案例揭示了令人惊讶的复杂隐写协议，例如在自然语言中嵌入隐藏信息。尽管我们无法确定这些语言构思中的自主程度，但我们的结果进一步证明，仅监控表面行为可能很快不足以维持对智能体群体的控制。

英文摘要

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.

URL PDF HTML ☆

赞 0 踩 0

2605.31164 2026-06-01 cs.CL cs.AI 版本更新

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

D$^3$: 面向LLM训练的动态有向图约束数据调度

Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li

发表机构 * Microsoft Research（微软研究院）

AI总结提出D$^3$框架，通过动态有向图建模训练单元间的有向影响关系，并求解约束优化问题以确定训练顺序，从而提升LLM预训练和后训练阶段的效率。

详情

AI中文摘要

训练数据在大语言模型（LLM）优化中起着核心作用，这激发了对数据调度策略的广泛研究。现有方法大多集中于调整整体数据分布，而忽略了训练过程中样本之间的潜在交互。然而，我们认为这种交互不可忽视，因为现实世界的数据样本之间经常存在有向影响，使得训练顺序至关重要。直观上，我们可以优先训练影响更大的单元以提高学习效率。在这项工作中，我们提出了D$^3$，一个动态有向图约束的数据调度框架。D$^3$将训练单元之间的复杂交互建模为一个动态影响图，其中边表示基于损失的依赖关系。然后，它在该图上求解一个约束优化问题，以推导出训练顺序，确保数据序列在整个训练过程中遵循不断演变的信息流。我们的方法具有理论动机，并在预训练和后训练阶段均比现有数据调度方法取得了一致的改进。此外，为了可扩展性，D$^3$还采用了一种高效的近似算法，将额外的计算开销控制在可管理范围内。为便于未来研究，代码可在https://github.com/xuyj233/D3获取。

英文摘要

Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.

URL PDF HTML ☆

赞 0 踩 0

2605.31148 2026-06-01 cs.CV cs.AI cs.CL 版本更新

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct：探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Zhongguancun Academy（中关村学院）； Tsinghua University（清华大学）； Helsinki University（赫尔辛基大学）

AI总结本文提出SpatialAct基准，通过多轮交互细化、单步错误检测与修复等任务，揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情

AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系，并将这种推理转化为行动。尽管最近的视觉语言模型（VLM）在基于观测的空间感知和推理任务上表现出色，但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题，我们引入了 extbf{SpatialAct}，一个基于模拟器的基准，用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始，我们进一步设计了其分解版本——单步错误检测与修复，以及五个基础空间能力任务，以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距：当前VLM在孤立的空间推理任务上表现良好，但在多轮反馈中难以维持连贯的空间信念并产生可靠行动，显著不如人类。这些结果表明，即使抽象掉了低级控制，当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

URL PDF HTML ☆

赞 0 踩 0

2605.31142 2026-06-01 cs.CL cs.AI 版本更新

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

多语言文本嵌入排名在学习任务、语言和基准数据集上的鲁棒性

Ana Gjorgjevikj, Barbara Koroušić Seljak, Tome Eftimov

发表机构 * Computer Systems Department（计算机系统系）； Jožef Stefan Institute（乔泽夫·斯塔芬研究所）

AI总结通过引入数据集组成鲁棒性和排名方案鲁棒性指标，系统分析了MTEB中多语言模型排名对评估设计变化的敏感性，发现基于LLM的大模型通常是鲁棒的顶尖模型，但并非在所有任务中一致。

详情

AI中文摘要

大规模多语言文本嵌入模型在研究和工业中扮演着关键角色，但它们在特定语言、多任务设置中的行为仍未被充分理解。尽管像MTEB这样的基准平台报告了超过250种语言的结果，但关于模型优越性的结论往往依赖于数据集组成和性能聚合方法的隐含选择。为了解决这一差距，我们对MTEB中的多语言模型性能鲁棒性进行了元研究，应用了多种多准则决策制定排名方案，并引入了两个鲁棒性指标：数据集组成鲁棒性（排名对数据集组成变化的敏感性）和排名方案鲁棒性（对聚合方法变化的敏感性）。它们使得系统性地分析基准结论在不同评估设计下是否保持稳定成为可能。我们对五种语言（英语、法语、德语、印地语和西班牙语）在九个任务（例如分类、聚类、检索）上进行了深入分析，并发布了约230种额外语言的结果。任务特定分析表明，基于大规模LLM的模型通常是鲁棒的顶尖表现者，尽管并非一致（例如在检索任务中），而任务无关的结果显示，只有一小部分模型在任务、排名方案和数据子样本中始终保持强劲。

英文摘要

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

URL PDF HTML ☆

赞 0 踩 0

2605.31140 2026-06-01 cs.CR cs.CL 版本更新

EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

EvoDefense：与大语言模型共同进化的黑盒防御

Yu Li, Yuenan Hou, Yingmei Wei, Yanming Guo, Chaochao Lu

发表机构 * National University of Defense Technology（国防科技大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出一种基于经验引导的共同进化黑盒防御范式EvoDefense，通过守卫LLM和经验记忆模块在攻防迭代中优化策略，实现对未见攻击和目标模型的泛化防御。

详情

AI中文摘要

大型语言模型（LLM）仍然极易受到各种攻击，特别是在黑盒设置中，目标模型的内部结构不可访问。现有的黑盒防御通常依赖于预定义的过滤启发式方法，这些方法往往无法泛化到未见过的攻击类型和目标模型架构。我们引入了EvoDefense，一种经验引导的共同进化黑盒防御范式。EvoDefense使用一个守卫LLM来检测恶意查询，并使用一个经验记忆模块来积累先前交互中的防御知识。EvoDefense的核心是一个连续的攻防进化循环，其中攻击生成器和守卫模型通过经验引导的优化迭代地改进其攻击策略和防御策略。这种设计使得EvoDefense能够在无需重新训练的情况下泛化到未见过的攻击和目标模型。在HarmBench、AdvBench和AlpacaEval上的实验表明，EvoDefense在七个流行模型和五种代表性LLM攻击上实现了一致且强大的防御性能，同时保持了有竞争力的通用能力。在HarmBench上，EvoDefense将AutoDAN-turbo对Gemini-3-flash和LLaMA-3-8B-Instruct的攻击成功率（ASR）分别从29.4%和43.4%降低到8.4%和6.2%。

英文摘要

Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.31136 2026-06-01 cs.CL 版本更新

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

低资源语言维基百科的多语言和跨语言引用需求检测

Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London（伦敦国王学院）； Wikimedia Foundation（维基媒体基金会）

AI总结针对低资源语言，提出多语言引用需求检测语料库MCN，并证明使用编码器风格目标微调的小型解码器语言模型在跨语言任务中优于大型语言模型。

详情

AI中文摘要

在自动化事实核查（AFC）中，核查价值检测根据领域特定标准识别需要验证的声明。在维基百科上，该任务具体化为引用需求检测（CND），即标记缺乏支持性引用的声明。然而，现有研究很大程度上忽视了低资源语言，且最近的AFC流程依赖于大型语言模型（LLM），这对低资源组织来说难以获取。我们引入了MCN，一个覆盖三种资源级别共18种语言的多语言CND语料库，并在此基础上对小规模解码器语言模型（SLM）进行了广泛研究。实验表明，使用编码器风格目标微调的SLM在跨语言任务中显著优于提示型LLM。我们进一步展示了跨语言CND的首批研究之一，证明仅使用英语声明微调的SLM在几乎没有目标语言适应的情况下超越了LLM。我们的发现对低资源维基百科社区具有重要意义，并表明紧凑的任务特定模型比LLM更适合CND。我们在https://github.com/gerritq/mcn 发布所有数据和代码。

英文摘要

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

URL PDF HTML ☆

赞 0 踩 0

2605.31126 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Not All Synthetic Data Is Yours to Learn From

并非所有合成数据都适合学习

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

发表机构 * ECE Department（电子工程系）； Apple（苹果公司）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Rice University（里奇大学）

AI总结研究无提示、无教师、无验证器、无奖励模型的自训练中，语言模型能否从自身生成的文本中学习，发现合成数据与学生之间的兼容性是关键，并揭示了能力与逐字记忆可分离的现象。

详情

AI中文摘要

语言模型能否从自身采样的纯文本中改进，无需提示、教师、验证器或奖励模型？可以，但仅当合成语料库与学生兼容时，这是一种源-学生对的关联属性，而非数据的内在属性。我们称之为潜在能力重现假说：弱自训练可以放大预训练模型中已有的能力，但仅在这种兼容条件下。我们在无提示无条件自训练的最小设置中研究这一点，其中基础语言模型仅在BOS令牌生成的文本上进行微调，没有任务规范或外部监督。我们报告三个发现。首先，合成效用是关联的而非内在的：自生成数据是最有效的来源，同源迁移优于更强但不同来源的训练，跨家族迁移显著较弱。其次，常见的内在代理失效：基准级别的语义相似性和学生下的平均每令牌似然都不能预测哪些语料库有帮助。第三，这种机制产生了一个令人惊讶的副产品。在受控的Pythia实验中，能力和逐字记忆解耦：基准效用得以保留或改善，而保留的精确匹配提取下降超过95%，无需遗忘集、隐私目标或针对性遗忘。总之，这些结果表明，无提示自训练通过放大学生已知的内容来工作，而不是从数据中导入结构。它们还揭示了一种无需任何显式遗忘目标即可分离能力和逐字记忆的机制。

英文摘要

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

URL PDF HTML ☆

赞 0 踩 0

2605.31113 2026-06-01 cs.CL 版本更新

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

TSM-Bench：在真实维基百科编辑实践中检测LLM生成的文本

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London（伦敦国王学院）； Wikimedia Foundation（维基媒体基金会）

AI总结针对维基百科等用户生成内容平台，提出多语言、多生成器、多任务的TSM-Bench基准，发现现有检测器在任务特定MGT上准确率下降10-40%，且存在泛化不对称性。

详情

AI中文摘要

自动检测机器生成文本（MGT）对于维护维基百科等用户生成内容（UGC）平台的知识完整性至关重要。现有的检测基准主要关注 extit{通用}文本生成任务（例如，“写一篇关于机器学习的文章。”）。然而，编辑者经常使用LLM进行特定的写作任务（例如，摘要）。这些 extit{任务特定}的MGT实例由于其受限的任务制定和上下文条件，往往更接近人类撰写的文本。在这项工作中，我们展示了一系列最先进的MGT检测器在识别反映维基百科真实编辑的任务特定MGT时存在困难。我们引入了 extsc{TSM-Bench}，这是一个多语言、多生成器和 extit{多任务}基准，用于评估MGT检测器在常见的真实维基百科编辑任务上的表现。我们的发现表明：（ extit{i}）与之前的基准相比，平均检测准确率下降了10-40%；（ extit{ii}）存在泛化不对称性：在任务特定数据上微调能够泛化到通用数据——甚至跨领域——但反之则不然。我们证明，仅在通用MGT上微调的模型会过度拟合机器生成的表面伪影。我们的结果表明，与之前的基准相比，大多数检测器在UGC平台等真实世界上下文中仍不可靠。因此， extsc{TSM-Bench}为开发和评估未来模型提供了关键基础。

英文摘要

Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.''). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and \textit{multi-task} benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data -- even across domains -- but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. \textsc{TSM-Bench} therefore provides a critical foundation for developing and evaluating future models.

URL PDF HTML ☆

赞 0 踩 0

2605.31105 2026-06-01 cs.CL 版本更新

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

GRKV: 长上下文LLM中免训练的KV缓存压缩的全局回归

Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai

发表机构 * Sun Yat-sen University（中山大学）； ShanghaiTech University（上海科技大学）； Guangdong Province Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）

AI总结提出GRKV方法，通过岭回归合并步骤最小化压缩缓存与完整缓存注意力输出的差异，解决基于跨度保留的合并模式不平衡导致的过度合并和信息损失问题。

Comments 21 pages, 7 figures

详情

AI中文摘要

具有扩展上下文长度的大型语言模型（LLM）依赖键值（KV）缓存来支持对先前令牌的注意力。然而，维护KV缓存会产生大量内存开销，促使通过驱逐和合并来强制执行固定预算的KV缓存压缩方法。现代驱逐方法越来越多地采用基于跨度的保留，因为保留连续跨度在经验上有效且更好地保持语义连贯性。然而，当与驱逐后合并结合时，基于跨度的保留将合并集中到一小部分跨度边界载体令牌上，产生高度不平衡的合并模式，加剧过度合并并增加信息损失。为了解决这种不平衡，我们提出GRKV（全局回归KV缓存），一种免训练的KV缓存合并方法，直接最小化压缩缓存与完整缓存注意力输出之间的差异。GRKV使用基于岭回归的合并步骤，将驱逐令牌的信息分布到保留令牌上，同时正则化更新以防止过度平滑。在LongBench和RULER长上下文基准测试中，GRKV是唯一一种以最小开销提高整体性能的合并方法。

英文摘要

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.31099 2026-06-01 cs.CL cs.AI 版本更新

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出VISTA框架，通过多级事件语义挖掘（细节级、事件级、未来级）实现长视频事件预测，解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情

DOI: 10.1007/978-981-95-6950-2

AI中文摘要

准确预测未来事件是内容理解和决策制定的基础，涉及多个领域。先前研究主要关注文本或短视频场景，而长视频事件预测具有多模态上下文丰富和叙事复杂的特点，尚未得到充分探索。同时，基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力，但难以泛化到事件预测，因为它们既不能精确提取事件相关细节，也无法对事件发展进行细粒度分析。为弥补这一差距，我们提出VISTA，一个用于长视频事件预测的多级事件语义挖掘框架。首先，VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节，增强细节级语义；其次，采用知识增强的迭代检索策略，引导大语言模型逐步构建逻辑连贯的事件链，从而改善事件级叙事；最后，VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索，产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.31062 2026-06-01 cs.CL 版本更新

AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering

AdaptR1：基于强化学习的自适应交错思考在多跳问答中的应用

Yuxin Wang, Jiahao Lu, Qifeng Wu, Shicheng Fang, Chuanyuan Tan, Yining Zheng, Xuanjing Huang, Xipeng Qiu

发表机构 * Computer Science, Fudan University（复旦大学计算机科学系）； Institute of Modern Languages and Linguistics, Fudan University（复旦大学现代语言与语言学研究院）； Shanghai Innovation Institute（上海创新研究院）； Soochow University（苏州大学）

AI总结提出AdaptR1框架，通过强化学习动态分配每步推理预算，减少多跳问答中的过度思考，在保持性能的同时显著降低推理成本。

详情

AI中文摘要

大型语言模型（LLMs）通过思维链（CoT）提示在复杂推理任务中取得了显著性能。然而，这种方法常常导致“过度思考”，即模型为简单查询生成不必要长的推理轨迹，并产生可避免的推理成本。虽然最近的工作探索了自适应推理，但现有方法通常对是否进行推理做出单一的查询级决策。这忽略了多步任务的动态性质，其中显式推理的需求在中间阶段会有所不同。为了解决这一限制，我们引入了AdaptR1，一种基于强化学习（RL）的框架，用于多跳问答（QA）中的自适应交错思考。与需要监督微调（SFT）进行冷启动初始化的先前方法不同，AdaptR1使用完全基于RL的策略，并带有质量门控效率奖励，以动态分配每一步的推理预算。在Graph-R1设置下，AdaptR1将平均思考令牌减少了69.71%，在HotpotQA上减少了90.35%，同时保持与标准基线相当或更好的性能。此外，我们的分析揭示，多跳推理中的过度思考并非均匀分布，而是主要发生在初始规划阶段，这突显了逐步自适应预算分配的有效性。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.

URL PDF HTML ☆

赞 0 踩 0

2605.31058 2026-06-01 cs.CL cs.SE 版本更新

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

组合合成：通过原子分解与重组扩展代码RLVR

Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Lero the Research Ireland Centre for Software, University of Limerick（利默尼克大学爱尔兰软件研究中心）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出原子分解与重组（ADR）框架，通过将代码任务分解为原子元素并受控重组，生成新颖且具有挑战性的可验证代码任务，以解决RLVR训练数据稀缺和扩展性问题，实验表明在多个下游领域显著提升代码能力。

Comments Work in progress

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）近期已成为塑造大型语言模型（LLMs）卓越编码能力的基石。然而，RLVR的可扩展性受到严重制约，因为缺乏足够具有挑战性的、针对模型能力边缘的可验证代码任务。先前的研究通常依赖启发式种子扩展进行数据合成，这严重限制了新颖性和难度。因此，此类数据的训练价值无法随合成规模成比例扩展。为此，我们提出原子分解与重组（ADR），一种通过将任务分解为原子元素并进行受控重组来生成可验证代码任务的新框架，从而能够生成真正新颖且具有挑战性的可验证代码任务。实验和分析表明，ADR在原创性、难度、多样性和测试质量方面优于现有基线，并在包括算法编程、工具使用和数据科学在内的多个下游领域的RLVR中持续带来更大的代码能力提升。我们的工作为新颖代码任务合成和可扩展的RLVR训练开辟了新范式。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.

URL PDF HTML ☆

赞 0 踩 0

2605.31056 2026-06-01 cs.CL 版本更新

How Much Do LLMs Know About Chinese Zero Pronouns?

LLMs 对中文零代词的了解程度如何？

Yifei Li, Guanyi Chen, Tingting He

发表机构 * Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning（湖北省人工智能与智能学习重点实验室）； National Language Resources Monitoring and Research Center for Network Media（网络媒体语言资源监测与研究中心）； School of Computer Science, Central China Normal University（中央财经大学计算机学院）

AI总结通过一系列语言学动机任务（识别、指称性分类、指称类型分类、消解和翻译），系统评估了大型语言模型处理中文零代词的能力，发现当前LLMs在零代词处理上仍面临巨大挑战，尤其在识别和指称性分类等上游任务上表现不佳。

2605.31042 2026-06-01 cs.CR cs.AI cs.CL 版本更新

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

从提示注入到持久控制：防御智能体框架中的木马后门

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院 Gallagher 学院）

AI总结本文提出ClawTrojan基准测试揭示本地智能体框架中的多步木马攻击，并设计DASGuard防御方法，通过扫描控制文本、追溯来源并清除不可信控制内容，实现动态防御。

Comments Code and data are available at https://github.com/RUC-NLPIR/ClawTrojan

详情

AI中文摘要

LLM智能体正在从对话式聊天机器人演变为实际工作空间中的操作工具。在本地智能体框架中，LLM可以读写文件、调用工具，并在会话间重用工作空间状态。虽然这些功能增强了实用性，但也为攻击者暴露了新的攻击面。攻击者可以将提示注入嵌入文件或工具输出中。智能体可能会读取这一隐藏指令，存储它，并在之后执行。在这种多步木马攻击范式中，没有任何单个步骤本身是恶意的，但这些步骤可以共同将不可信文本转化为持久控制内容。然而，现有防御通常孤立地检查每个步骤。因此，它们可以阻止明显的恶意行为，但无法检测到植入后门的早期写操作。为了揭示这一威胁，我们引入了ClawTrojan，一个旨在识别本地智能体框架中多步木马攻击的基准测试。在OpenClaw风格的模拟工作空间中，使用GPT-5.4，ClawTrojan达到了95.5%的攻击成功率（ASR），而同一模型上现有的单轮提示注入攻击产生的ASR接近零。为了解决这一威胁，我们提出了DASGuard，它扫描敏感本地文件中的控制类文本，追溯其来源，并清除非可信来源的控制内容。我们的结果表明，DASGuard通过结合运行时攻击阻断和对工作空间的清理提交，实现了强大的动态防御。

英文摘要

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.

URL PDF HTML ☆

赞 0 踩 0

2605.31025 2026-06-01 cs.CL 版本更新

TRACE: Discovering Task-Specific Parameter via Adaptation-Aware Probing for Continual Fine-Tuning

TRACE: 通过适应感知探测发现任务特定参数以实现持续微调

Xiaosong Han, Ke Chen, Xindi Dai, Di Liang, Minlong Peng, Wei Pang, Fausto Giunchiglia, Xiaoyue Feng, Yonghao Liu, Renchu Guan

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Software, Jilin University（吉林大学软件学院）； Fudan University（复旦大学）； School of Mathematical and Computer Sciences, Heriot-Watt University（赫瑞-瓦特大学数学与计算机科学学院）； Department of Information Engineering and Computer Science, University of Trento（特伦托大学信息工程与计算机科学系）

AI总结提出TRACE方法，通过适应感知探测发现任务特定核心参数，在持续微调中仅更新这些参数以缓解灾难性遗忘，并验证了跨模型和规模的迁移性。

Comments KDD2026

详情

DOI: 10.1145/3770855.3817801

AI中文摘要

在实际部署中，大型语言模型通常需要跨任务持续适应以保持最新状态，新的微调应保留先前学到的技能。然而，不加区分地混合任务会稀释任务特化，而顺序微调（全参数或低秩适应）常因破坏性覆盖导致灾难性遗忘。基于回放的持续微调和维护单独的任务特定适配器可以缓解遗忘，但引入了额外的计算、存储和管理开销。认识到LLM参数对于任何单一任务都存在冗余，我们将持续任务适应重新定义为通过适应感知探测发现任务特定参数：短时预热探测暴露任务的适应轨迹，使我们能够识别并隔离每个任务所需的一小部分关键参数，以缓解灾难性遗忘。基于这一观点，我们引入了TRACE，一种通过适应感知探测发现任务特定参数以实现持续微调的新方法。我们进行短时预热微调，通过比较预热模型和预训练模型来推导任务特定核心参数。核心参数通过两种策略识别：重要性评分（L2范数和Fisher信息）和特异性分析（参数更新的余弦相似度）。在持续微调设置中，仅更新当前任务的核心参数，其余参数保持冻结，从而保留先前知识。我们在多个标准基准上进行了广泛实验，证明了所提方法的优越性能。此外，我们通过跨模型和规模迁移性研究验证了方法的泛化能力，展示了在资源约束下指导大规模模型微调的“小到大”范式。

英文摘要

In real-world deployment, LLMs are often adapted continually across tasks to keep LLMs up-to-date in production, where new fine-tuning should preserve previously learned skills. However, indiscriminately mixing tasks can dilute task specialization, while sequential fine-tuning (full-parameter or low rank adaptation) often causes catastrophic forgetting due to destructive overwriting. Replay-based continual tuning and maintaining separate task-specific adapters can mitigate forgetting, but introduce additional compute, storage, and management overhead. Recognizing the redundancy of LLM parameters for any single task, we reframe continual task adaptation as task-specific parameter discovery via adaptation-aware probing: a short warm-start probe exposes a task's adaptation trace, enabling us to identify and isolate the small subset of parameters essential for each task to mitigate catastrophic forgetting. Building on this view, we introduce TRACE, a novel approach for discovering Task-specific paRameters via Adaptation-aware probing for Continual finE-tuning. We perform a short warm-start fine-tune to derive task-specific core parameters by comparing the warm-started and pre-trained models. Core parameters are identified via two strategies: importance scoring (L$_2$ norm and Fisher Information) and specificity analysis (cosine similarity of parameter updates). In continual fine-tuning settings, only the active task's core parameters are updated while others remain frozen, preserving prior knowledge. We conduct extensive experiments across multiple standard benchmarks to demonstrate the superior performance of our proposed method. Additionally, we validate the generalization of our method through a cross-model and scale transferability study, demonstrating a "small-to-large" paradigm that guides the fine-tuning of large-scale models under resource constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.31021 2026-06-01 cs.AI cs.CL cs.LG 版本更新

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

基于人格的生成式AI多元对齐评估框架

Atahan Karagoz

发表机构 * Atahan Karagöz（阿塔汗·卡拉戈兹）

AI总结提出一种状态空间约束仿真框架，通过合成认知轮廓替代单一评估函数，实现反映真实世界共识变异性的多元、视角依赖的基准测试，并分析仿真评估者的稳定性问题，论证动态调节机制的必要性。

详情

AI中文摘要

当前生成式人工智能的对齐范式主要依赖单一基准测试框架，将人类判断的多元性简化为聚合统计基线，从而掩盖了评估中的文化、人口和语境变异性。我们引入一种用于AI评估的状态空间约束仿真框架，用代表不同人类视角的合成认知轮廓的结构化流形替代单一评估函数。我们表明，现代生成架构能够以高度一致性实例化和维护这些评估人格，从而实现一种更接近现实世界共识变异性的多元、视角依赖的基准测试。然而，我们进一步分析了这些模拟评估者在顺序推理和随机提示扰动下的稳定性，揭示了人格一致性的系统性退化，表现为状态空间漂移和语义不一致。这些发现表明，静态对齐约束不足以维持随时间推移的稳健评估行为。相反，我们主张必须在生成系统中嵌入动态的、可行性驱动的调节机制，以保持连贯的认知仿真。通过将基于人格的评估视为潜在表征流形上的结构化动力系统，本研究为更自适应、更符合人类、更注重语境的AI评估方法奠定了基础。

英文摘要

Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the plurality of human judgment to aggregated statistical baselines, thereby obscuring cultural, demographic, and contextual variability in evaluation. We introduce a state-space constrained emulation framework for AI evaluation that replaces singular assessment functions with a structured manifold of synthetic cognitive profiles representing diverse human perspectives. We show that modern generative architectures can instantiate and maintain these evaluative personas with high consistency, enabling a form of pluralistic, perspective-dependent benchmarking that more closely reflects real-world consensus variability. However, we further analyze the stability of these simulated evaluators under sequential inference and stochastic prompt perturbations, revealing systematic degradation in persona coherence that manifests as state-space drift and semantic inconsistency. These findings suggest that static alignment constraints are insufficient for sustaining robust evaluative behavior over time. Instead, we argue for the necessity of embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve coherent cognitive emulation. By framing persona-based evaluation as a structured dynamical system over latent representation manifolds, this study provides a foundation for more adaptive, human-aligned, and context-sensitive approaches to AI evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.31010 2026-06-01 cs.CL 版本更新

MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation

MoG：用于基于图的检索增强生成的混合专家模型

Zheng Yuan, Chuang Zhou, Linhao Luo, Siyu An, Di Yin, Xing Sun, Xiao Huang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Monash University（墨尔本大学）； Tencent Youtu Lab（腾讯优图实验室）

AI总结提出MoG框架，通过组织知识为中心枢纽图和稀疏激活的专家图，利用拓扑感知路由器动态选择相关专家图，以解决检索增强生成中统一知识库引入无关信息的问题，在MuSiQue上相对提升超过20%。

详情

AI中文摘要

检索增强生成被广泛研究以将大型语言模型建立在外部证据上。然而，从统一的知识库中检索可能会不可避免地引入无关信息，从而误导复杂推理的生成。受混合专家（MoE）条件计算的启发，其中路由器为每个输入稀疏地选择专门的专家以及共享专家，我们提出了用于基于图的检索增强生成的混合专家模型，即MoG。它将知识组织为两个核心组件：（i）多样且始终可访问的枢纽图，编码语义和结构上的核心知识，并为专家激活提供上下文线索；（ii）稀疏激活的专家图，包含特定领域的证据。MoG首先访问枢纽图以识别一般证据并推导上下文线索。然后，一个拓扑感知路由器根据查询动态激活一组有限的专家图，从而将检索限制在一个集中的证据子空间中。在具有挑战性的基准测试上的大量实验表明，MoG始终优于强基线，在MuSiQue上相对提升超过20%。我们的代码可在https://github.com/DEEP-PolyU/MoG获取。

英文摘要

Retrieval-augmented generation is intensively studied to ground large language models on external evidence. However, retrieving from a unified knowledge base could inevitably introduce irrelevant information that may mislead generation for complex reasoning. Inspired by the conditional computation of mixture of experts (MoE), where a router sparsely selects specialized experts alongside shared ones for each input, we propose \textbf{M}ixture \textbf{o}f experts for \textbf{G}raph-based Retrieval-Augmented Generation, i.e., \textbf{MoG}. It organizes knowledge into two core components: (i) diverse, always-accessible hub graphs that encode semantically and structurally central knowledge and provide contextual clues for expert activation, and (ii) sparsely activated expert graphs that contain domain-specific evidence. MoG first accesses hub graphs to identify general evidence and derive contextual clues. Then, a topology-aware router dynamically activates a limited set of expert graphs conditioned on the query, thereby confining retrieval to a focused evidence subspace. Extensive experiments on challenging benchmarks show that MoG consistently outperforms strong baselines, with over 20\% relative improvement on MuSiQue. Our code is available in https://github.com/DEEP-PolyU/MoG.

URL PDF HTML ☆

赞 0 踩 0

2605.30984 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

生成报告还是重复模板？测量和缓解三维CT报告生成中的模板崩溃

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani, Benedikt Wiestler, Christian Wachinger

发表机构 * Technical University of Munich (TUM)（慕尼黑技术大学）； TUM Hospital（TUM医院）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结针对三维CT报告生成中模型输出多样性低、病理检测能力差的模板崩溃问题，提出解耦框架CLarGen，通过分离临床检测与语言合成，显著提升临床准确性并保持报告流畅性。

详情

AI中文摘要

现代三维医学视觉语言模型（VLM）能够生成流畅的放射学风格文本，但表现出极低的病理检测率和输出多样性，崩溃为低估罕见但关键发现的通用模板。我们将这种失败模式识别为模板崩溃。这种失败源于三维医学成像的独特限制，例如数据有限、标签严重不平衡以及体积编码器的弱信号。在这些限制下，文本生成目标鼓励捷径学习和流畅但基础薄弱的报告。我们通过临床保真度、输出多样性、正常模板偏差和罕见发现存活率系统性地诊断模板崩溃。为了缓解它，我们提出CLarGen，一个解耦框架，将说什么（临床检测）与怎么说（语言合成）分开。CLarGen使用（i）用于多标签病理检测的潜在查询变换器，（ii）用于临床匹配示例的病理引导检索，以及（iii）用于从检测到的发现和检索到的上下文中合成最终报告的医学语言模型。在最新的三维CT报告生成基线中，CLarGen缓解了模板崩溃，并在保持流畅报告的同时显著提高了临床准确性（macro-F1 0.487 vs. 0.189；CRG 0.472 vs. 0.368）。我们的结果表明，明确、可测量的临床基础对于抗模板崩溃的三维CT报告生成至关重要。代码将在接收后发布。

英文摘要

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.30981 2026-06-01 cs.CL cs.LG 版本更新

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

自回归Transformer中的认知疲劳：形式化与测量

Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain, Michael Stewart, Amit Sheth

发表机构 * Guru Gobind Singh Indraprastha University, India（古鲁·戈宾德·辛格·印度普拉斯塔大学）； Artificial Intelligence Institute, University of South Carolina, USA（人工智能研究所，南卡罗来纳大学）； Indian Institute of Technology, Kanpur, India（印度理工学院，坎浦尔）； Indian AI Research Organization, India（印度人工智能研究组织）

AI总结本文形式化自回归语言模型在长程生成中的退化现象为认知疲劳，并提出轻量级诊断指标疲劳指数（FI），通过聚合注意力衰减、表征漂移和熵校准三个信号实现实时监测，实验表明FI能高精度预测任务退化和重复生成。

Comments 9 pages, 7 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

自回归语言模型在长程生成过程中经常退化，产生重复文本、失去指令遵循能力并表现出不稳定的熵。尽管这些失败普遍存在，但从业者缺乏在线诊断工具来实时检测它们。我们将这种退化形式化为认知疲劳，这是一种可测量的生成时状态，其特征是对原始提示的注意力衰减、表征漂移和熵校准错误。我们引入了疲劳指数（FI），这是一种轻量级、模型无关的诊断方法，在明确的公理（单调性、有界性、可解释性）下聚合这三个信号，从而实现可靠的运行时监控。在九个模型（1B-13B参数）上，FI轨迹表现出结构化的时间动态，预测任务退化（AUROC = 0.95）和重复（Spearman rho = 0.94），并揭示了非单调的缩放行为：低于3B的指令微调模型比基础模型退化更快，而在7B时这一趋势逆转。压力分析进一步表明，在更长的上下文、中间位置的证据和降低的数值精度下，FI onset加速。这些结果确立了认知疲劳作为一个连贯且可测量的现象，并将FI定位为生产级LLM系统中运行时可靠性监控的原则性工具。

英文摘要

Autoregressive language models frequently degrade during long-horizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real-time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability) enabling reliable runtime monitoring. Across nine models (1B-13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (Spearman rho = 0.94), and reveal non-monotonic scaling behavior: instruction-tuned models below 3B exhibit faster collapse than base models, with this trend reversing at 7B. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2605.30966 2026-06-01 cs.IR cs.AI cs.CL 版本更新

TUX：衡量人机默契理解

Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； New York University（纽约大学）

AI总结通过光谱放置任务和TUX指数，量化人类与LLM之间的默契理解，发现人格特征影响对齐程度。

详情

AI中文摘要

随着大型语言模型（LLMs）越来越多地作为协作伙伴，人机对齐通常通过明确的任务成功、准确性或奖励优化来评估。然而，许多协作场景依赖于默契理解：即智能体能否在没有明确目标、沟通或反馈的情况下，与人类的评价立场或表征先验对齐。为了研究这种能力，我们开发了一个受社交派对游戏Wavelength启发的光谱放置任务，在该任务中，人类和智能体独立地将概念放置在主观光谱上。我们将默契理解指数（TUX）操作化为人类与智能体判断之间的成对相似性度量，并通过241名人类参与者和200个基于人格条件的LLM智能体（涵盖四种模型）进行评估。我们发现，在特质空间中最近的人-智能体对实现了显著更高的TUX，表明默契对齐是由个体层面特征而非随机相似性所结构化的。回归分析表明，随着预测变量集变得更加丰富，TUX变得更可解释，个体特质、决策风格和置信度优于聚合特质距离基线。这些发现表明，人类与LLM之间的默契理解是可测量的，同时也揭示了基于人格条件化方法在捕捉更深层表征对齐方面的局限性。

英文摘要

As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.30924 2026-06-01 cs.CL 版本更新

EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents

EMBGuard：为具身智能体安全规划构建危险感知护栏

Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim, Yeonjun Hwang, Hyojun Kim, Byungchul Kim, Young Kyun Jang, Jinyoung Yeo

发表机构 * Independent Researcher（独立研究者）； Department of Biomedical Engineering（生物医学工程系）； the Department of Intelligent Precision Healthcare Convergence, Sungkyunkwan University（智能精准医疗融合系，全州大学）； Department of Artificial Intelligence, Yonsei University（人工智能系，延世大学）

AI总结提出首个基于MLLM的具身安全护栏EMBGuard，通过解耦物理风险推理与智能体策略，评估（视觉观察，动作）对来识别危险配置并提供自然语言解释，同时构建训练数据集EMBHazard和基准测试EMBGuardTest，在紧凑模型尺寸下达到与专有MLLM竞争的性能并降低误报率。

Comments Accepted at ICML 2026

详情

AI中文摘要

部署在真实环境中的MLLM驱动的具身智能体会遇到物理危险。然而，现有方法缺乏识别危险和推理动作条件风险的内在机制，导致智能体要么错过危险交互，要么过度识别风险。为解决此问题，我们提出EMBGuard，这是首个基于MLLM的具身智能体安全护栏，旨在将物理风险推理与智能体策略解耦。通过评估（视觉观察，动作）对，EMBGuard识别危险配置并提供潜在风险的自然语言解释。伴随EMBGuard，我们贡献了EMBHazard，一个包含15.1K个动作条件对的训练数据集，以及EMBGuardTest，一个包含329个手动策划的真实世界场景的基准测试，涵盖七种物理风险类别。通过危险和动作的组合变化，我们生成了智能体在规划过程中可能遇到的各种危险和良性场景。尽管模型尺寸紧凑（2B，4B），EMBGuard达到了与专有MLLM（例如GPT-5.1，Gemini-2.5-Pro）竞争的性能，同时显著降低了阻碍实时部署的误报率。我们在https://github.com/dongwxxkchoi/EMBGuard公开了代码、数据和模型。

英文摘要

MLLM-powered embodied agents deployed in real-world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action-conditioned risks, leading agents to either miss risky interactions or over-identify risks. To address this, we propose EMBGuard, the first MLLM-based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGuard identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGuard, we contribute EMBHazard, a training dataset of 15.1K action-conditioned pairs, and EMBGuardTest, a benchmark of 329 manually curated real-world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGuard achieves performance competitive with proprietary MLLMs (e.g., GPT-5.1, Gemini-2.5-Pro) while significantly reducing the false-positive rates that hinder real-time deployment. We make the code, data, and models publicly available at https://github.com/dongwxxkchoi/EMBGuard

URL PDF HTML ☆

赞 0 踩 0

2605.30913 2026-06-01 cs.CL cs.AI cs.CY cs.HC 版本更新

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

有毒幻觉：扰动提示与追踪LLM电路

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Nimblemind

AI总结研究有毒语言扰动对LLM事实可靠性的影响，发现有毒词汇降低准确率并增加不确定性，通过归因图分析揭示内部机制。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在对话环境中，用户语气从礼貌到对抗性或毒性不等，但尚不清楚在语义等效的提示中，有毒语言是否会降低事实可靠性。我们研究基于词汇和语气的提示扰动如何影响LLM的事实可靠性。通过礼貌、随机和三种毒性水平的受控提示变化，我们在ARC-Easy、GSM8K和MMLU上评估了五个LLM。我们发现有毒词汇扰动持续降低事实准确性并增加不确定性，而礼貌措辞产生有限且不一致的变化。为了检查这些答案不一致是否对应内部变化，我们进行了模型激活和影响的归因图分析。我们发现增加毒性选择性地放大对扰动敏感的变体节点，而相对稳定的核心推理节点保持更不变。这些发现将提示语气定位为LLM可靠性的关键维度，并提供了行为和机制证据，表明表面词汇变化可以改变事实输出和内部计算。

英文摘要

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

URL PDF HTML ☆

赞 0 踩 0

2605.30912 2026-06-01 cs.CV cs.CL 版本更新

RLHF的另一面：用于奖励模型自监督改进的在线策略反馈

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（认知智能国家重点实验室，中国科学技术大学）； University of Science and Technology of China（中国科学技术大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）； State Key Laboratory of General Artificial Intelligence, BIGAI（通用人工智能国家重点实验室，BIGAI）

AI总结提出SAVE框架，利用价值函数生成在线策略反馈，通过对比学习更新奖励模型，在六个基准上超越现有方法。

详情

AI中文摘要

构建用于语言模型对齐的强大奖励模型（RM）受到从人工标注或评判模型获取多样且可靠偏好数据的成本和难度的瓶颈限制。随着策略超越静态RM训练，这一问题变得更加严重。因此，我们提出SAVE（基于价值锚定的在线策略反馈自监督奖励模型改进），一个通过使用价值函数进行在线策略RM训练的框架，对在线策略响应进行评分作为反馈。SAVE自然地利用提示特定的价值头作为自适应锚点，将奖励评分的在线策略响应转化为监督信号。它计算RM优势并过滤模糊样本，通过对比目标更新RM。通过六个不同基准的严格实证评估，SAVE在增强RM训练方面的有效性得到了强烈验证。它在所有数据集上取得了优于现有方法的结果，同时在三种RL算法（GRPO、RLOO、GSPO）和不同策略骨干上保持一致的改进。

英文摘要

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

URL PDF HTML ☆

赞 0 踩 0

2605.30876 2026-06-01 cs.CL 版本更新

dMoE: dLLMs with Learnable Block Experts

dMoE: 具有可学习块专家的扩散大语言模型

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结针对扩散大语言模型与混合专家架构集成时块并行解码与令牌级专家选择不匹配导致的推理内存瓶颈，提出dMoE框架，通过聚合块内令牌级专家分布为统一的块级专家分布来减少激活专家数量，在保持性能的同时显著降低内存使用和延迟。

Comments Working in progress. Code is available at: \url{https://github.com/fscdc/dMoE}

详情

AI中文摘要

扩散大语言模型（dLLMs）最近作为自回归模型的有前途的替代方案出现，在自然支持并行解码的同时提供了有竞争力的性能。然而，随着dLLMs越来越多地与混合专家（MoE）架构集成以扩展模型容量，块并行解码与令牌级专家选择之间出现了根本性的不匹配。具体来说，每次dLLM前向传递处理多个具有双向依赖关系的令牌，而传统的MoE层独立路由每个令牌。这种不匹配显著增加了唯一激活专家的数量，使推理越来越受内存限制。为了解决这个问题，我们提出了dMoE，一个简单而有效的块级MoE框架。dMoE的核心思想是将每个块内的令牌级专家分布聚合成统一的块级专家分布，然后以更连贯的方式指导专家路由。通过这种方式，dMoE在不牺牲性能的情况下显著减少了推理期间唯一激活专家的数量，从而缓解了内存瓶颈。在各种基准上的大量实验证明了dMoE的有效性。平均而言，dMoE将唯一激活专家的数量从69.5减少到14.6，同时保留了原始性能的99.11%。同时，它将内存使用减少了76.64%到79.84%，并实现了1.14倍到1.66倍的端到端延迟加速。代码可在https://github.com/fscdc/dMoE获取。

英文摘要

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

URL PDF HTML ☆

赞 0 踩 0

2605.30857 2026-06-01 cs.CL 版本更新

MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning

MADS: 面向指令微调的模型感知多样化核心集选择

Yi Bai, Wenhao Zhang, Yao Chen, Jiao Xue, Zhumin Chen, Pengjie Ren

发表机构 * Shandong University（山东大学）； Inspurcloud

AI总结提出一种基于模型推理时神经激活状态区分数据特征的多样化核心集选择方法，在减少数据量的同时提升大语言模型在多个下游任务上的性能。

详情

AI中文摘要

指令微调用于增强大语言模型（LLMs）的指令遵循能力。随着指令微调数据量的增加，选择最优核心集变得尤为重要。然而，确保核心集的多样性仍然是一个重大挑战。现有方法主要基于文本特征本身来区分不同的训练数据，与LLMs自身对数据的理解和表示相分离。为解决这一问题，我们提出了一种模型感知的多样化核心集选择方法，该方法基于LLM推理过程中的神经激活状态来区分数据特征。该方法利用模型内在的激活特征，实现了基于覆盖的选择的高效实例化，以确保核心集的多样性。我们在涵盖五个不同任务的六个基准上广泛评估了我们的方法。在我们的方法中，由3B参数LLM选择的核心集在用于微调7B、8B和13B参数的更大模型时表现有效。在包含52K指令-响应对的Alpaca-GPT4数据集上的实验结果表明，由Llama-3.2-3B-Instruct选择的、大小为原始数据集15%的核心集，在微调四个更大的基础模型时，与使用完整数据集训练相比，平均提升了2.5%。实验结果表明，我们的方法在减少数据需求的同时，提升了模型在多个下游任务上的性能。

英文摘要

Instruction fine-tuning is employed to enhance the instruction-following ability of large language models (LLMs). As the amount of instruction fine-tuning data increases, selecting the optimal core set becomes particularly important. However, ensuring the diversity of the core set remains a significant challenge. Existing methods predominantly distinguish different training data based on the text features themselves, decoupled from LLMs' own understanding and representation of the data. To address this issue, we propose a Model-Aware Diverse Core Set Selection method, which distinguishes data features based on the neural activation states during LLM inference. This approach serves as an efficient instantiation of coverage-based selection using model-intrinsic activation features to ensure the diversity in the core set. We extensively evaluate our method on six benchmarks that cover five distinct tasks. In our method, the core set selected by the 3B-parameter LLM performs effectively when utilized to fine-tune larger models with 7B, 8B, and 13B parameters. Experimental results on the Alpaca-GPT4 dataset, which comprises 52K instruction-response pairs, show that the core set, sized at 15\% of the original dataset and selected by Llama-3.2-3B-Instruct, achieves an average improvement of 2.5\% when fine-tuning four larger base models compared with training on the full dataset. The experimental results demonstrate that our method enhances model performance on multiple downstream tasks while reducing data requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.30852 2026-06-01 cs.CL 版本更新

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

推测性流水线解码：通过流水线并行实现更高准确度和零气泡推测

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei

发表机构 * Oregon State University（俄勒冈州立大学）； DeepSolution（深思解决方案）

AI总结提出推测性流水线解码（SPD）框架，利用流水线并行将目标LLM划分为n个流水线阶段并行处理n个token，通过推测模块聚合中间特征预测下一token，实现有限难度、高接受率和零延迟气泡，显著提升理论加速比。

详情

AI中文摘要

推测性解码（SD）通过草稿-验证范式加速低并发LLM推理。然而，主流方法通常依赖多token预测，这引入了逐渐增加的预测难度和串行草稿延迟。为了解决这些问题，我们提出了推测性流水线解码（SPD），这是一个突破性的框架，释放了流水线并行的真正潜力。通过将目标LLM划分为$n$个流水线阶段，SPD允许LLM并行处理$n$个token以加速解码。为了在单序列解码中持续填充流水线，推测模块聚合不同流水线深度的中间特征来预测下一个token，与目标模型的流水线步骤严格并行执行，从而实现有限的难度、更高的接受率和零延迟气泡。我们的实验表明，与主流基线相比，SPD实现了显著更高的理论加速比，为LLM解码加速提供了高度可扩展的解决方案。我们的代码可在https://github.com/yuyijiong/speculative_pipeline_decoding获取。

英文摘要

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding

URL PDF HTML ☆

赞 0 踩 0

2605.30844 2026-06-01 cs.CL cs.AI stat.ML 版本更新

Fine-Tuning Improves Information Conveyance in Language Models

微调提升语言模型中的信息传递

Yuwei Cheng, Weiyi Tian, Haifeng Xu

发表机构 * Department of Statistics（统计学系）； University of Chicago（芝加哥大学）； Department of Data Science（数据科学系）； Department of Computer Science（计算机科学系）

AI总结提出冠层熵（Canopy Entropy）度量，从树结构视角量化生成空间的有效大小，发现微调模型在总熵降低时仍能增强长度-熵率正相关，从而更高效地将不确定性转化为语义多样性。

详情

AI中文摘要

微调通常被认为会降低大型语言模型的不确定性和多样性，但现有分析忽略了输出长度这一关键混杂因素，因此未能捕捉不确定性在整个生成展开中的分布。为解决这一问题，我们提出冠层熵（$\mathrm{CE}^\star$），一种从树视角看待语言生成的度量，其中“冠层”代表所有可能展开的空间，使得$\mathrm{CE}^\star$自然地量化生成空间的有效大小。$\mathrm{CE}^\star$共同捕捉输出长度$N$和生成序列$Y_{1:N}$中的不确定性——实际上，我们证明它等于总香农熵$H(N, Y_{1:N}\mid X)$，其中$X$表示提示。该公式产生了可解释的度量，包括长度-熵率相关项$ ho(N, r_N)$，其中$r_N$是熵率，通过指示较长输出是否每个标记信息量更多或更少来量化信息传递效率。实验上，跨任务和模型家族，我们发现微调模型一致地表现出更强的正相关$ ho(N, r_N)$，即使总熵降低。此外，在控制模型家族、任务、提示和输出长度效应后，我们发现微调几乎使熵率与语义多样性之间的相关强度增加了两倍，表明对齐模型更有效地将标记不确定性转化为语义多样性。总体而言，这些结果表明微调并非简单地降低不确定性，而是从根本上将其重组为更具信息性和语义意义的生成。我们的代码可在https://github.com/WeiyiTian/canopy-entropy获取。

英文摘要

Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\mathrm{CE}^\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\mathrm{CE}^\star$ naturally quantify the effective size of the generation space. $\mathrm{CE}^\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $ρ(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $ρ(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy-entropy.

URL PDF HTML ☆

赞 0 踩 0

2605.30833 2026-06-01 cs.CL cs.AI 版本更新

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

你的老师在这里帮不了你：对抗在线策略蒸馏中的监督保真度衰减

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Chinese Information Processing Laboratory（中文信息处理实验室）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）； University of Chinese Academy of Sciences, Beijing, China（中国科学院大学，北京，中国）

AI总结针对在线策略蒸馏中监督保真度衰减问题，提出前瞻组奖励方法，通过评估学生候选词在后续步骤中诱导的教师置信度并分配组归一化奖励，结合熵触发树注意力机制，显著提升长链推理性能。

详情

AI中文摘要

在线策略蒸馏通过使用来自教师的 token 级反馈，在学生模型自身生成的轨迹上训练学生模型来传递推理能力。然而，我们识别出一个关键瓶颈，即 extbf{监督保真度衰减（SFD）}：随着学生生成的前缀变长，教师的下一个 token 分布变得不那么自信和更具区分性。因此，反向 KL 蒸馏中依赖教师的纠正信号减弱，导致学生漂移在长推理链中累积。为了缓解 SFD，我们引入了 extbf{前瞻组奖励（\ours{}）}。基于下一步教师置信度反映了未来反向 KL 监督的区分强度这一见解，\ours{} 通过学生在后续步骤中诱导的教师置信度来评估学生的 top-K 候选 token，并分配组归一化奖励。为了保持计算效率，我们进一步设计了一种熵触发的树注意力机制。在六个数学和代码基准测试中，\ours{} 在 7B 学生模型上比 OPD 提高了 mean@8 达 extbf{2.57} 个点，在长生成任务中增益更大，在 AIME-26 上达到 + extbf{4.92} 个点（39k token）。

英文摘要

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.

URL PDF HTML ☆

赞 0 踩 0

2605.30826 2026-06-01 cs.CL cs.AI 版本更新

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

超越一致性：为策展人分类对面板筛选的生物医学实体候选进行评分

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang, Siyu Zhang, Tingting Dan

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； The Hong Kong University of Science and Technology, Guangzhou（香港科学与技术大学（广州））； ShanghaiTech University（上海科技大学）； University of North Carolina, Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出BioConCal评分器，利用无金标准的一致性、提及、表面可用性和文档特征，对多LLM面板筛选的候选实体进行评分，显著提高候选筛选的精确率和召回率。

详情

AI中文摘要

生物医学命名实体识别对于现代LLM来说看似简单：合理的生物医学提及容易浮现，但语料库约定正确性取决于标注约定、跨度边界、实体粒度和类型模式。多LLM一致性是一个显著性信号，而非语料库约定正确性。我们引入了一个候选级面板输出基准，用于面板筛选的候选验证，其中单元是由明确定义的多模型面板对齐的候选，而非独立提取器输出。该基准将八个LLM在五个公共生物医学NER数据集上的预测对齐到一个候选主表中。BioConCal是一个领域内监督评分器，它利用推理时的无金标准一致性、提及、表面可用性和文档特征，为固定候选流实例化这一层。在领域内，BioConCal将AUROC从原始一致性的0.753提高到0.910。在验证选择的0.95精确率目标下，它选择了1,340个候选，经验测试精确率为0.939，而原始一致性为293个候选。这对应于候选级召回率0.592和语料库级召回率0.523，而面板内行标签上限为0.883。主要好处不是恢复每个面板成员遗漏的实体，而是将嘈杂的面板流重塑为更高产出的审查队列。在实体类型转移下，阈值需要目标领域验证，而精确字符定位仍然是单独的后处理步骤。

英文摘要

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

URL PDF HTML ☆

赞 0 踩 0

2605.30813 2026-06-01 cs.CL cs.DS 版本更新

Incremental BPE Tokenization

增量式 BPE 分词

Shenghu Jiang, Ruihao Gong

发表机构 * Beihang University（北京航空航天大学）； SenseTime Research（商汤研究）

AI总结提出一种增量式 BPE 分词算法，以 O(log² t) 时间处理每个字节，总复杂度 O(n log² t)，支持流式场景，相比 Hugging Face 和 OpenAI 的实现速度提升约 3 倍并降低延迟。

Comments Accepted to ICML 2026 (Spotlight)

详情

AI中文摘要

我们提出了一种用于增量式字节对编码（BPE）分词的新算法。该算法在最坏情况下以 $\mathcal{O}(\log^2 t)$ 时间处理每个输入字节，总复杂度为 $\mathcal{O}(n \log^2 t)$，其中 $n$ 是输入长度，$t$ 是最大分词长度。该算法增量地维护输入文本每个前缀的 BPE 分词结果，实现了由固定合并规则集定义的标准 BPE 合并过程。这使得在流式设置中能够进行高效的部分分词。作为标准 BPE 的即插即用替代方案，我们的方法相比 Hugging Face 的分词器实现了高达约 3 倍的加速，并在病态输入上相比 OpenAI 的 tiktoken 展示了显著的延迟降低。我们进一步引入了一种急切输出算法，支持流式输出，在增量分词过程中一旦确定分词边界即发出分词。总体而言，我们的结果表明 BPE 分词可以以具有强最坏情况保证的增量方式执行，同时在现代大语言模型流水线中提供实际的延迟优势。代码：https://github.com/ModelTC/mtc-inc-bpe

英文摘要

We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc-inc-bpe

URL PDF HTML ☆

赞 0 踩 0

2605.30804 2026-06-01 cs.CL 版本更新

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

将LLM性别偏见锚定人类基线：跨语言审计

Jiwoo Choi, Seonwoo Ahn, Tongxin Zhang, Seohyon Jung

发表机构 * School of Digital Humanities and Computational Social Sciences, KAIST, South Korea（数字人文与计算社会科学学院，韩国韩国科学技术院）

AI总结通过HEXACO-100人格量表，跨英语、韩语、中文和日语审计六种大语言模型的性别刻板印象，发现其偏见幅度是人类跨国差异的2.5倍，并引入四模式框架（一致性、抑制、重组、放大）描述跨语言行为。

详情

AI中文摘要

我们审计了六种大语言模型（LLM）在英语、韩语、中文和日语中的性别刻板印象。其中三种主要面向英语使用（Claude、GPT、Gemini），三种面向东亚使用（DeepSeek、Syn-Pro、HyperCLOVA X）。我们采用HEXACO-100人格量表，并将每个模型锚定于覆盖48个国家的跨文化人类数据集，以询问的不是LLM是否有偏见，而是它们的性别归因偏离其部署人群的程度。我们的发现表明，它们的刻板印象范围大约是人类跨国范围的2.5倍，且该效应可能跨语言复合。一个以英语为中心的模型在用韩语提示时，达到了当地基线的5倍，即使提示表明候选人已被录用（这通常会减弱人类的刻板印象）。为了在不排序的情况下描述此类行为，我们引入了一个四模式框架——一致性、抑制、重组和放大——涵盖24个（模型×语言）单元。项目级分析表明，翻译不仅重新缩放刻板印象，还改变了与之相关的属性，在表面看似校准良好的情况下隐藏了显著的重新排列。我们的结果最终表明，没有单一的消除偏见流程能够均匀地解决跨语言边界的偏见。

英文摘要

We audit six large language models (LLMs) for gender stereotyping across English, Korean, Chinese, and Japanese. Three were developed primarily for English-language use (Claude, GPT, Gemini) and three for East Asian use (DeepSeek, Syn-Pro, HyperCLOVA X). We adopt the HEXACO-100 personality inventory and anchor each model against a cross-cultural human dataset spanning 48 countries to ask not whether LLMs are biased, but how far their gender attributions drift from the populations they are deployed among. Our findings show that their stereotyping spans a range roughly 2.5 times wider than the entire cross-country range found in humans, and the effect can compound across languages. One English-centric model, prompted in Korean, reached 5 times the local baseline, even when the prompt stated the candidate had already been hired, which often dampens human stereotyping. To characterize such behaviors without ranking them, we introduce a four-pattern framework -- concordance, suppression, reorganization, and amplification -- across 24 (model x language) cells. Item-level analysis reveals that translation does not just rescale stereotypes, but changes the attributes tied to it, hiding significant rearrangement under the surface while appearing well-calibrated. Our results ultimately suggest that no single debiasing pipeline is likely to address bias evenly across linguistic boundaries.

URL PDF HTML ☆

赞 0 踩 0

2605.30790 2026-06-01 cs.IR cs.AI cs.CL 版本更新

On the impact of retrieved content representations in RAG Pipelines

关于检索内容表示对RAG管道的影响

Jonathan J Ross, Bevan Koopman, Anton van der Vegt, Guido Zuccon

发表机构 * The University of Queensland（昆士兰大学）； CSIRO（澳大利亚联邦科学与工业研究组织）

AI总结通过控制变量实验，研究检索文档的不同表示（选择、摘要、改写等）对RAG生成准确性的影响，发现答案保留是主要决定因素。

Comments 23 pages, 15 figures, submitted to ACL May 2026 ARR

详情

AI中文摘要

检索增强生成（RAG）通过检索到的文档补充语言模型的输入，但大多数RAG管道继承了为人类读者设计的检索组件。当消费者是大型语言模型（LLM）而非人类时，检索内容应如何表示尚不清楚。最近的工作提出了对检索内容的转换，并识别了影响生成的属性，但每项工作仅孤立地考察单一转换或属性，未明确文档表示的哪些特征最重要。我们通过控制比较来解决这一问题：固定检索不变，仅改变检索文档的表示，将原始基线与其他十三种转换（涵盖选择、摘要和改写，包括查询相关和查询无关变体）进行比较。在这十四种表示中，我们测量了四个生成器的问答准确性，并对每种表示测量了答案保留：即已知包含答案的文档在转换后是否仍支持其答案。我们发现，答案保留是生成器准确性的主要决定因素；值得注意的是，当保留率高时，表示的措辞、结构、长度和查询相关性影响有限。这表明，先前工作中归因于特定机制的准确性提升，可能部分由这些机制保留答案内容的能力解释，而这种归因在未控制保留的情况下无法确定。

英文摘要

Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.

URL PDF HTML ☆

赞 0 踩 0

2605.30788 2026-06-01 cs.CL cs.AI cs.LG 版本更新

XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

XLGoBench: 用算法任务检测跨语言技能差距

Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju

发表机构 * Google DeepMind（谷歌深Mind）； Indian Institute of Technology Bombay（印度理工学院孟买分校）； International Centre for Theoretical Sciences, Tata Institute of Fundamental Research（理论科学国际中心， Tata 基础研究机构）

AI总结提出一套合成算法任务基准，通过跨语言执行相同任务来检测大语言模型的跨语言能力差距，实验揭示多个先进模型存在持续差距。

Comments 8+37pages

2605.30771 2026-06-01 cs.CL 版本更新

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

Eywa：基于溯源的人工智能智能体长期记忆

Resham Joshi

发表机构 * Eywa

AI总结提出Eywa架构，通过先存储证据再推导事实、验证记忆并采用确定性多路径读取（零LLM调用）实现可审计的长期记忆，在多个基准测试中取得高准确率。

Comments 29 pages, 3 figures, 16 tables. Benchmark artifacts available at https://eywa.to/research

详情

AI中文摘要

跨会话持久化的人工智能智能体需要能够检索、审计、更新和擦除的记忆。现有的记忆系统通常将源证据、提取的事实、检索到的上下文和答案策略合并为一个不透明的提示路径，使得故障难以诊断：错误答案可能源于缺失证据、不支持的提取、过时状态、检索损失或答案模型行为。我们提出Eywa，一种基于溯源的记忆架构，围绕“证据先于信念”构建。Eywa在推导规范事实之前存储不可变的源证据，根据类型化信号和源支持验证提取的记忆，并通过确定性多路径读取路径（检索内部零LLM调用）检索有界的记忆上下文。检索到的上下文与答案指令分开返回，使得相同的记忆基质可以在前沿、预算和本地答案模型上进行评估。在冻结的、工件记录的检索配置下，Eywa在LoCoMo C1-C4分割上使用Claude Sonnet 4.6写入和QA角色达到90.19%的裁判准确率。在LongMemEval-S上，达到88.2%的检索充分性准确率。在BEAM（一个700问题的技术记忆压力基准）上，达到81.45%的平均nugget分数和85.29%的pass@score >= 0.5。完整的每问题工件，包括问题、黄金答案、模型答案、检索到的上下文和标签，发布在https://eywa.to/research。

英文摘要

AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt path, making failures difficult to diagnose: a wrong answer may come from missing evidence, unsupported extraction, stale state, retrieval loss, or answer-model behavior. We present Eywa, a provenance-grounded memory architecture built around evidence before belief. Eywa stores immutable source evidence before deriving canonical facts, validates extracted memories against typed signals and source support, and retrieves bounded memory context through a deterministic multi-route read path with zero LLM calls inside retrieval. Retrieved context is returned separately from answer instructions, allowing the same memory substrate to be evaluated across frontier, budget, and local answer models. Under a frozen, artifact-recorded retrieval configuration, Eywa reaches 90.19% judge accuracy on the LoCoMo C1-C4 split with Claude Sonnet 4.6 write and QA roles. On LongMemEval-S, it reaches 88.2% retrieval-sufficiency accuracy. On BEAM, a 700-question technical-memory stress benchmark, it reaches 81.45% mean nugget score and 85.29% pass@score >= 0.5. Full per-question artifacts, including questions, gold answers, model answers, retrieved context, and labels, are published at https://eywa.to/research.

URL PDF HTML ☆

赞 0 踩 0

2605.30758 2026-06-01 cs.CL cs.LG 版本更新

Pairwise Reference Alignment as a Model-Level Ordinal Observable

成对参考对齐作为模型级序数可观测量

Mujing Li

发表机构 * Independent Researcher（独立研究者）

AI总结本文定义成对参考对齐为模型评分函数诱导的序数可观测量，提出中心化序参数统计量和基于边界的扩展，并给出有限样本估计和浓度界，通过Qwen2.5和RewardBench实验验证。

详情

AI中文摘要

成对偏好数据广泛用于语言模型评估和对齐，通常用于模型排名、奖励建模或偏好优化。本文提出了一个更基础的测量问题：给定成对偏好的参考分布，当我们测试模型是否将首选响应排在拒绝响应之上时，估计的是哪个模型级量？我们将成对参考对齐定义为由模型评分函数诱导的序数可观测量。给定三元组$(x,y^+,y^-)$上的参考对分布$P_{\mathrm{pair}}$和标量模型分数$S_M(x,y)$，我们将对齐可观测量定义为模型诱导的排序与参考偏好排序一致的概率。我们进一步定义了一个中心化的序参数类统计量，并讨论了基于边界的扩展。所得量在独立抽样假设下具有简单的有限样本估计量和浓度界。本文没有引入新的基准。它为成对参考对齐提供了概念和统计公式，阐明了参考对分布的作用，并将一般的序数可观测量与评分选择（如归一化对数概率或基于能量的分数）区分开来。我们还在Qwen2.5模型和RewardBench上进行了初步实证研究，其中所提出的统计量随模型大小和指令调优而增加，并根据公式在参考对子集之间变化。

英文摘要

Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses? We define pairwise reference alignment as an ordinal observable induced by a model scoring function. Given a reference pair distribution $P_{\mathrm{pair}}$ over triples $(x,y^+,y^-)$, and a scalar model score $S_M(x,y)$, we define the alignment observable as the probability that the model-induced ordering agrees with the reference preference ordering. We further define a centered order-parameter-like statistic and discuss a margin-based extension. The resulting quantities admit simple finite-sample estimators and concentration bounds under independent sampling assumptions. This note does not introduce a new benchmark. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log-probability or energy-based scores. We also provide an initial empirical study on Qwen2.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference-pair subsets as predicted by the formulation.

URL PDF HTML ☆

赞 0 踩 0

2605.30753 2026-06-01 cs.CL 版本更新

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

通过时空并行解码和置信度外推的高效扩散大语言模型

Zekai Li, Ji Liu, Yiqing Huang, Ziqiong Liu, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc. (AMD)（先进微器件公司（AMD））

AI总结提出时空并行解码（TSPD）和置信度外推（CE）两种方法，通过动态控制去噪轨迹减少冗余迭代，加速扩散大语言模型推理。

详情

AI中文摘要

基于扩散的大语言模型（dLLMs）通过迭代去噪支持并行文本生成，但由于许多步骤花费在冗余精炼和重复掩码那些最终值已确定的token上，推理仍然延迟严重。先前的加速方法主要依赖于步骤局部置信度启发式或固定调度，这些方法对提示和任务变化敏感，且忽略了序列内的强位置效应。我们将扩散解码视为一个动态控制问题，并表明逐token的去噪轨迹为可靠控制提供了关键信号。我们提出了一个具有两个组件的轨迹感知解码框架。首先，时空并行解码（TSPD）使用一个轻量级的时空控制器，该控制器消耗每个token的轨迹特征，包括置信度、熵和动量，以及token位置，以决定何时token已收敛并可以安全固定。其次，我们引入了置信度外推（CE），一个无训练的状态空间模块，它预测未来的logit趋势并带有不确定性，以支持主动决策，包括安全的前瞻和在轨迹振荡或置信度不足时的目标稳定。TSPD和CE共同减少了不必要的去噪迭代，同时保持了输出质量，并且它们与系统优化（如KV缓存）干净地组合。

英文摘要

Diffusion-based large language models (dLLMs) support parallel text generation via iterative denoising, yet inference remains latency-heavy because many steps are spent on redundant refinement and repeated remasking of tokens whose final values are already determined. Prior acceleration methods mainly depend on step-local confidence heuristics or fixed schedules, which are sensitive to prompt and task variation and ignore strong positional effects within a sequence. We cast diffusion decoding as a dynamic control problem and show that token-wise denoising trajectories provide the key signal for reliable control. We propose a trace-aware decoding framework with two components. First, Temporal-Spatial Parallel Decoding (TSPD) uses a lightweight temporalspatial controller that consumes per-token trajectory features, including confidence, entropy, and momentum, together with token position, to decide when a token has converged and can be safely fixed. Second, we introduce Confidence Extrapolation (CE), a training-free state-space module that forecasts future logit trends with uncertainty to support proactive decisions, including safe look-ahead and targeted stabilization when trajectories are oscillatory or underconfident. Together, TSPD and CE reduce unnecessary denoising iterations while preserving output quality, and they compose cleanly with system optimizations such as KV caching.

URL PDF HTML ☆

赞 0 踩 0

2605.30736 2026-06-01 cs.LG cs.AI cs.CL 版本更新

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter: 一种面向生产的混合离线-在线学习LLM路由器

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen, Xile Ma, Yi Shi

发表机构 * Continuum AI

AI总结提出OrcaRouter，一种结合LinUCB上下文赌博机与混合离线-在线学习协议的生产级LLM路由器，通过离线全信息反馈和在线赌博机学习实现低成本高精度模型选择。

Comments 6 pages, 1 table. Technical report

详情

AI中文摘要

大型语言模型的快速发展，每个模型具有不同的能力和推理成本，引发了一个实际部署问题：给定一个传入请求，应由哪个模型处理？我们提出OrcaRouter，一种面向生产的LLM路由器，它结合了基于词法和句子嵌入特征的LinUCB上下文赌博机与混合离线-在线学习协议。在离线阶段，OrcaRouter通过在一组精心策划的路由提示上评估每个候选模型来获取全信息反馈，生成一个奖励矩阵，用于为每个臂拟合一个岭回归器。在部署时，它从这些参数初始化，并可选地从赌博机反馈中继续学习，在观察到奖励后仅更新所选模型的臂。在我们提交RouterArena时（2026年5月20日），OrcaRouter-Adaptive以72.08的竞技场得分在公共RouterArena排行榜上排名第二，在每1000次查询成本1.00美元的情况下实现了75.54%的准确率。

英文摘要

The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

URL PDF HTML ☆

赞 0 踩 0

2605.30727 2026-06-01 cs.CL 版本更新

MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents

MosaicLeaks：深度研究代理的开放查询中的隐私风险

Alexander Gurung, Spandana Gella, Alexandre Drouin, Issam H. Laradji, Perouz Taslakian, Rafael Pardinas

发表机构 * ServiceNow AI Research（ServiceNow AI研究院）； University of Edinburgh（爱丁堡大学）； Mila - Quebec AI Institute（魁北克AI研究所）； McGill University（麦吉尔大学）； University of British Columbia（不列颠哥伦比亚大学）

AI总结针对深度研究代理在查询外部工具时可能泄露本地敏感信息的问题，提出MosaicLeaks基准测试和隐私感知深度研究（PA-DR）框架，通过强化学习降低信息泄露风险。

详情

AI中文摘要

深度研究代理越来越多地将私有本地文档与网络检索等外部工具结合，这带来了隐私风险：代理的外部查询可能泄露其本地上下文中的敏感信息。这种风险因马赛克效应而被放大，即单个查询看似无害，但聚合起来可能泄露信息。我们引入了MosaicLeaks，一个包含1,001个多跳深度研究任务的基准测试，这些任务将私有企业文档和公共网络语料库链接起来，迫使代理进行依赖本地信息的外部查询。我们通过一个仅观察代理外部查询并尝试在三个层面推断私有信息的对抗性LLM来评估泄露：代理的研究意图、特定私有问题的答案以及关于企业文档的可验证声明。我们发现，不同系列和大小的模型在三个层面上频繁泄露信息；零样本隐私提示减少了泄露但未消除泄露；仅针对任务性能的强化学习加剧了泄露。为了解决这个问题，我们提出了隐私感知深度研究（PA-DR），这是一个RL框架，结合了任务成功的情境奖励和一个学习到的隐私分类器，以提供对每个查询和马赛克级别泄露的密集信用分配。使用PA-DR训练Qwen3-4B-Instruct将准确率从48.7%提高到58.7%，并将答案和全信息泄露从34.0%降低到9.9%。

英文摘要

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information from its local context. This risk is amplified by the mosaic effect, where individual queries may appear harmless but become revealing in aggregate. We introduce MosaicLeaks, a benchmark of 1,001 multi-hop deep research tasks that chain private enterprise documents and a public web corpus, forcing agents to make external queries that depend on local information. We evaluate leakage with an adversary LLM that observes only the agent's external queries and attempts to infer private information at three levels: the agent's research intent, answers to specific private questions and verifiable claims about the enterprise documents. We find that models across families and sizes frequently leak at all three levels, that zero-shot privacy prompting reduces but does not eliminate leakage and that reinforcement learning for task performance alone worsens leakage. To address this, we propose Privacy-Aware Deep Research (PA-DR), an RL framework that combines situational rewards for task success with a learned privacy classifier to provide dense credit assignment over both per-query and mosaic-level leakage. Training Qwen3-4B-Instruct with PA-DR improves accuracy from 48.7% to 58.7% and reduces answer and full-information leakage from 34.0% to 9.9%.

URL PDF HTML ☆

赞 0 踩 0

2605.30723 2026-06-01 cs.CL 版本更新

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

技能并非一刀切：面向 LLM 智能体的模型感知技能对齐

Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li

发表机构 * East China Normal University（华东师范大学）

AI总结提出 MASA 框架，通过层次化技能进化管道和轻量级条件重写器，为不同能力的 LLM 主干网络自适应调整技能，显著提升长时交互任务性能。

详情

AI中文摘要

LLM 智能体越来越多地检索外部策划的技能——在决策时获取的程序性指令——以提高在长时交互任务上的表现。现有的技能库通常被视为模型无关的，在不同容量和行为的骨干网络上重用相同的技能表述。然而，我们在多个模型规模上的控制实验表明，技能的有效性强烈依赖于模型：对某个骨干网络有益的技能可能对另一个有害。受此观察启发，我们提出了 MASA（模型感知技能对齐），一个无需修改智能体权重即可为每个目标骨干网络调整技能的框架。MASA 分两个阶段运行：（1）一个层次化技能进化管道，通过爬山法和 UCB 驱动的树搜索，在环境反馈和模型能力概况的指导下，迭代重写通用和任务特定技能；（2）一个在进化轨迹上训练的轻量级模型条件技能重写器，以在单次前向传播中重现适应过程。在三个交互环境和四个骨干网络上的实验表明，MASA 始终实现最佳整体性能，比最强基线高出最多 25.8 个点。学习到的重写器进一步泛化到未见过的任务和环境，无需额外搜索，以极低的推理成本持续超越更大的教师 LLM。

英文摘要

LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model-dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model-Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task-specific skills using hill climbing and UCB-driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.

URL PDF HTML ☆

赞 0 踩 0

2605.30717 2026-06-01 cs.CL 版本更新

Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models

语言模型中性别化与性别中立生成的神经元级干预

Zhiwen You, Nafiseh Nikeghbal, Jana Diesner

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）

AI总结提出神经元级干预方法，通过识别与女性、男性和性别中立相关的特定神经元，实现对语言模型生成文本的性别控制，并发现性别神经元主要集中在前几层。

详情

AI中文摘要

语言模型（LMs）即使在给定中性提示时也可能产生性别化语言和刻板印象。以往关于LM性别偏见的研究大多通过二元视角（女性 vs. 男性）审视性别，对性别中立形式（如they/them代词或中性措辞的职位名称）关注有限。性别相关信号如何在LM的内部表示中编码仍是一个开放问题。在这项工作中，我们研究了LM中三类性别特异性神经元：女性、男性和性别中立。我们提出了一种神经元级干预方法，用于识别与每个性别类别紧密相关的神经元。然后通过受控生成测试这些神经元，表明激活或掩蔽性别相关神经元可以将句子导向目标性别形式，同时保留其原始含义。为了评估我们性别干预方法的有效性，我们整理了两个数据集，其中包含跨所有三个性别类别标记的受控句子，并通过人工评估验证数据质量。在两个开源LM上的实验表明，性别特异性神经元并非均匀分布在模型各层；相反，它们高度集中在最早层，后面层的贡献较小。与现有方法相比，我们的方法实现了更精确的性别控制，通过两个评估标准，对非目标性别类别的泄露更少，输出质量稳定。总体而言，我们的工作研究了性别如何在LM中编码，并为受控性别干预提供了一种简单而有效的方法，适用于神经元干预评估和性别偏见缓解。代码和数据集可在 https://github.com/zhiwenyou103/Gender-Neuron-Intervention 获取。

英文摘要

Language models (LMs) can produce gendered language and stereotypes even when given neutral prompts. Most prior work on gender bias in LMs primarily examines gender through a binary lens (feminine vs. masculine), with limited attention to gender-neutral forms, such as they/them pronouns or neutrally phrased job titles. How gender-related signals are encoded in the internal representations of LMs remains an open question. In this work, we study gender-specific neurons in LMs across three categories: feminine, masculine, and gender-neutral. We propose a neuron-level intervention method to identify neurons that are strongly tied to each gender category. We then test these neurons through controlled generation, showing that activating or masking gender-related neurons can steer a sentence toward a target gender form while preserving its original meaning. To evaluate the effectiveness of our gender-intervention approach, we curate two datasets with controlled sentences labeled across all three gender categories and validate the data quality through human evaluation. Experiments on two open-source LMs show that gender-specific neurons are not evenly distributed across model layers; instead, they concentrate heavily in the earliest layers with smaller contributions from later layers. Compared to existing methods, our method achieves more precise gender control, with less leakage into non-target gender categories and stable output quality through two evaluation criteria. Overall, our work examines how gender is encoded in LMs and provides a simple yet effective approach toward controlled gender intervention for both neuron intervention evaluation and gender bias mitigation. Code and datasets are available at: https://github.com/zhiwenyou103/Gender-Neuron-Intervention

URL PDF HTML ☆

赞 0 踩 0

2605.30712 2026-06-01 cs.CL 版本更新

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

ExpGraph: 面向LLM智能体的模型无关经验学习与图结构记忆

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Nanyang Technological University（南洋理工大学）； Meta Monetization AI（Meta 营收人工智能）

AI总结提出ExpGraph框架，通过图结构记忆和强化学习实现冻结LLM执行器的外部经验复用，在问答、数学推理、代码生成和多步智能体任务上显著提升性能并减少交互步骤。

详情

AI中文摘要

大型语言模型（LLM）智能体在推理、工具使用和多步交互方面表现出强大的能力，但它们通常从零开始解决任务，未能重用先前经验中的成功策略或失败教训。对收集的经验进行微调可以改善重用，但当出现更强或更合适的执行器时，这种方法不够灵活。我们提出ExpGraph，一个模型无关的经验学习框架，使冻结且可替换的LLM执行器能够通过外部经验重用而无需参数更新来改进。ExpGraph将历史轨迹总结为可重用的技能和失败教训，将它们组织为自进化经验图中的节点，并通过图扩散和效用感知排序检索有用的经验。一个轻量级的检索副驾驶通过强化学习进行训练，使用比较有无检索经验时执行器性能的反馈，同时图根据下游任务结果在线更新。我们在ExpSuite上评估ExpGraph，涵盖问答、数学推理、代码生成以及包括ALFWorld和AppWorld在内的多步智能体环境。ExpGraph在静态任务上，对于较小和较大的执行器，分别比最强基线提高12.2%和4.7%；在智能体环境中提高21.4%和12.7%，同时平均交互步骤减少12.7%和21.6%。消融实验表明，图结构经验、效用感知排序和自适应检索共同实现了跨不同任务和执行器模型的有效经验重用。

英文摘要

Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.

URL PDF HTML ☆

赞 0 踩 0

2605.30711 2026-06-01 cs.CL cs.AI cs.LG stat.ML 版本更新

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE: 一种用于智能体大语言模型中高效记忆演化的新颖门控机制

Sijia Wang, Dhanajit Brahma, Ricardo Henao

发表机构 * Duke University（杜克大学）

AI总结提出SAGE门控机制，基于von Mises-Fisher密度估计和自适应阈值，将记忆写入控制建模为新奇性检测问题，在LoCoMo上以更低成本实现最优token-F1。

详情

AI中文摘要

智能体大语言模型必须持续决定新提取的事实是应添加、与现有记忆合并还是忽略，然而先前的工作更侧重于检索和存储，而非原则性的写入端控制。我们将记忆演化视为一个新颖性检测问题，并提出SAGE（Spherical Adaptive Gate for memory Evolution），一种用于记忆演化的球形自适应门控机制，它通过基于von Mises-Fisher的密度估计器对记忆嵌入上的候选事实进行评分，并使用跟踪记忆存储几何结构的自适应阈值对其进行路由。SAGE将明确新颖的事实解析为ADD，明确冗余的事实解析为NOOP，仅将不确定的情况发送给LLM合并步骤，从而减少了昂贵的写入时推理。在LoCoMo上，SAGE在所有七个开放权重骨干对比中均实现了对Mem0的最佳平均token-F1，而在GPT-4o-mini上，它将添加阶段的API成本降低了3.4倍，添加阶段延迟降低了2.5倍，且平均评判分数差距很小。作为A-Mem的即插即用二进制门控，SAGE在五个模型上跳过了大约16-18%的LLM调用，且在开放权重骨干上质量变化极小。这些结果表明，新颖性感知的写入控制是提高长期智能体记忆中记忆质量和系统效率的实用杠杆。

英文摘要

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.

URL PDF HTML ☆

赞 0 踩 0

2605.30693 2026-06-01 cs.CR cs.CL 版本更新

Triaging Threats to Specialized Guardrails

针对专用护栏的威胁分类

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu, Sicong Jiang, Zihan Wang, Minglai Yang, Zhe Zhao, Muhao Chen

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出GuardZoo统一基准和RouteGuard路由专家框架，通过将对话分流至专用护栏，解决单一护栏在多种威胁域上的任务干扰问题，提升细粒度检测和泛化能力。

详情

AI中文摘要

构建稳健的安全护栏对于在多样化的实际应用中部署大型语言模型至关重要。然而，这一目标仍然具有挑战性，因为安全风险涵盖异质的威胁领域，而现有数据集仅覆盖零散的风险子集并依赖不一致的分类法。因此，目前尚不清楚现有护栏能否在狭窄的评估设置之外泛化。为了更好地理解护栏模型的稳健性，我们首先引入GuardZoo，一个统一的人工标注基准，包含32,460个样本，覆盖15个不同的不安全类别。在GuardZoo上的评估揭示，单一护栏遭受任务干扰：不同的威胁领域需要不同的决策边界，这些边界难以压缩到单个模型中。因此，我们提出RouteGuard，一个路由-专家框架，将每个对话分流到专门的专家护栏进行威胁特定检测。实验表明，RouteGuard在强护栏基线上提高了细粒度威胁检测，在域外评估下泛化更好，并支持灵活的模块化扩展以应对新兴威胁。

英文摘要

Building robust safety guardrails is essential for deploying Large Language Models across diverse real-world applications. However, this goal remains challenging because safety risks span heterogeneous threat domains, while existing datasets cover only fragmented risk subsets and rely on inconsistent taxonomies. Consequently, it remains unclear whether current guardrails can generalize beyond narrow evaluation settings. To better understand the robustness of guardrail models, we first introduce GuardZoo, a unified human-annotated benchmark with 32,460 samples covering 15 distinct unsafe categories. Evaluation on GuardZoo reveals that monolithic guardrails suffer from task interference: different threat domains require distinct decision boundaries that are difficult to compress into a single model. We therefore propose RouteGuard, a router-expert framework that triages each conversation to specialized expert guardrails for threat-specific detection. Experiments show that RouteGuard improves fine-grained threat detection over strong guardrail baselines, generalizes better under out-of-domain evaluation, and supports flexible modular expansion to emerging threats.

URL PDF HTML ☆

赞 0 踩 0

2605.30690 2026-06-01 cs.CL 版本更新

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

ElasticMem: 将潜在记忆作为LLM智能体的可学习资源

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Nanyang Technological University（南洋理工大学）

AI总结提出ElasticMem框架，通过可学习的弹性潜在记忆分配策略，在记忆密集型问答和具身智能体控制任务中显著提升准确率并降低token开销。

详情

AI中文摘要

长期记忆对于LLM智能体在长时间交互中连贯推理、个性化响应和复用过往经验至关重要。然而，现有的记忆增强方法通常将记忆视为固定资源：文本空间方法将检索到的记忆拼接进上下文窗口，导致大量token开销和对噪声证据的敏感性；而潜在空间方法减少了文本成本，但仍依赖刚性检索或固定容量的记忆接口。这造成了查询相关的记忆效用与固定记忆分配之间的不匹配。我们提出ElasticMem，一种记忆增强的LLM框架，学习将记忆用作弹性潜在资源。ElasticMem构建了一个离线潜在记忆库，包含检索键和内容缓存，从推理器的隐藏状态自适应地检索记忆，通过学得的策略为每个检索到的记忆分配可变潜在预算，并将选中的潜在状态作为软记忆token注入生成过程。整个记忆使用过程通过组相对策略优化与下游任务奖励进行联合优化。我们在MemorySuite上评估ElasticMem，涵盖记忆密集型QA和具身智能体控制。在Qwen2.5-3B-Instruct和Qwen2.5-7B-Instruct骨干网络上，ElasticMem在加权平均QA准确率上分别比最强基线提升26.2%和24.6%，在ALFWorld成功率上分别提升66.3%和27.2%，同时实现了最低的ALFWorld token成本。消融和定性分析进一步表明，自适应检索和弹性预算分配帮助ElasticMem优先考虑有用证据和可迁移计划，超越了刚性余弦相似度。ElasticMem的代码将在https://github.com/ulab-uiuc/ElasticMem发布。

英文摘要

Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab-uiuc/ElasticMem.

URL PDF HTML ☆

赞 0 踩 0

2605.30685 2026-06-01 cs.CY cs.AI cs.CL cs.HC 版本更新

How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language

早期采用者如何在全球范围内使用生成式AI：按国家收入和语言的差异

Madeleine I. G. Daepp, Isaac Slaughter

发表机构 * Microsoft AI Economy Institute（微软人工智能经济研究所）

AI总结基于大规模匿名化AI聊天机器人交互数据，实证分析了不同国家早期采用者在使用生成式AI上的差异，发现教育用途在低收入国家更普遍，休闲用途与收入正相关，且英语交互在非英语主导国家中过度代表，表明语言性能改进可能影响数字鸿沟或跨越式发展。

2605.30675 2026-06-01 cs.CL cs.AI 版本更新

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

大型语言模型不确定性中的人类对齐、校准与激活模式

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward, Grayson Heyboer

发表机构 * Vanderbilt University（范德比大学）； Tennessee Technological University（田纳西技术大学）

AI总结研究大型语言模型的不确定性与人类不确定性的相似性，通过分析行为与内部激活模式，发现模型在多项选择和开放式事实回忆数据集上同时存在对齐与校准，并描述了指令微调的影响。

2605.30673 2026-06-01 cs.CL 版本更新

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

TeachObs：多模态教学观察与模型评估的人工验证基准

Yeil Jeong, Youngjin Yoo, Seobin Sohn, Hyejin Han, Jinseo Lee, Scott Howard, Unggi Lee

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）； Pai Chai University（培才大学）； Seoul National University（首尔国立大学）； Ewha Womans University（成均馆大学）； University of Wolverhampton（沃尔夫汉普顿大学）； Korea University Sejong Campus（韩国大学世宗校区）

AI总结提出TeachObs基准，包含30节公开课视频的5158个15秒场景，由7名研究者标注39个二值观察码，并评估5个前沿视觉大语言模型在三种任务上的表现。

详情

AI中文摘要

课堂视频包含可观察的教学实践，但其教学和视觉信号很少以适合模型评估的形式组织。我们提出了 extit{TeachObs}，一个用于课堂视频中多模态教学观察的人工验证基准。 extit{TeachObs}包含来自八个国家的30节公开课视频，分为5158个固定的15秒场景。七名研究者用39个二值观察码标注每个场景，涵盖20个视觉码（如手势、板书、指向和视觉材料）和19个非视觉码（如指导、监控、提问、反馈和反思）。基于Krippendorff's alpha，使用可靠性和流行度感知规则构建黄金片段标签。除了片段级标签，三名专家评分员对30节课进行了课程级评分和定性评估，涵盖教学设计、教学实施、学生反应、学习材料和课程收尾，评分员覆盖范围详见正文。利用这两个人工参考层，我们在三个任务上评估了五个具备视觉能力的前沿大语言模型——纯文本片段编码、文本+帧片段编码，以及在LLM作为评判协议下的课程级覆盖评分——发现没有单一模型在所有三个任务中持续优于其他模型，添加中间帧会同时增加每个场景的真实和虚假归因，并且模型评估相对于专家评分员高估了程序清晰的课程。因此， extit{TeachObs}支持细粒度标注基准测试和整课评估，展示了AI系统在哪些方面可以辅助课堂视频分析，以及在哪些方面专家判断仍然必要，涵盖不同学科、课堂形式和标注难度水平。

英文摘要

Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes. Seven researchers annotated each scene with 39 binary observation codes, covering 20 visual codes, such as gesture, board work, pointing, and visual materials, and 19 nonvisual codes, such as instruction, monitoring, questioning, feedback, and reflection. Gold segment labels are constructed using reliability- and prevalence-aware rules based on Krippendorff's alpha. In addition to segment-level labels, three expert raters produced lesson-level ratings and qualitative evaluations of instructional design, instructional delivery, learner response, learning materials, and lesson closure across the 30 lessons, with rater coverage detailed in the body. Using these two human reference layers, we evaluate five vision-capable frontier LLMs across three tracks - text-only segment coding, text + frame segment coding, and lesson-level coverage scored under an LLM-as-judge protocol - and find that no single model consistently outperforms others across all three tracks, that adding a mid-frame inflates both true and false attributions per scene, and that model evaluations over-rate procedurally clear lessons relative to expert raters. \textit{TeachObs} therefore supports both fine-grained annotation benchmarking and whole-lesson evaluation, showing where AI systems can assist classroom video analysis and where expert judgment remains necessary across varied subjects, classroom formats, and annotation difficulty levels.

URL PDF HTML ☆

赞 0 踩 0

2605.30668 2026-06-01 cs.CL cs.AI 版本更新

CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation

CobSeg: 对话主题分割的连贯性边界建模

Sijin Sun, Liangbin Zhao, Jiaxiang Cai, Ming Deng, Mingyu Luo, Xiuju Fu

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Technology（高性能计算研究所，科技局）； Shanghai Univeristy（上海大学）； Fudan University（复旦大学）

AI总结提出CobSeg多分支架构，通过分离连贯性语义与词汇边界转换并利用边界信息加权和主题连贯性线索，在无需LLM调用下提升对话主题分割性能。

Comments 8 pages with appindx. Under review

详情

AI中文摘要

对话主题分割在许多人类-AI协作应用中至关重要，需要识别异质边界线索，包括话语边缘附近的词汇转换和跨话语的语义不连续性。现有的话语模型常常稀释这些局部词汇信号。我们提出CobSeg，一种新颖的多分支架构，它将连贯性层面的语义连续性与词汇边界转换分离，并通过方向性边界预测恢复两者。CobSeg进一步使用边界信息加权来强调高效用的话语位置，并融合了基于语料库的主题连贯性线索与学习到的组合权重。尽管CobSeg在有监督的金标准边界训练和自动诱导边界的伪标签设置下作为紧凑的可训练分割器进行评估，它在推理过程中无需LLM调用即可实现增强的边界预测。在五个基准测试中，它改进了$P_k$和$W_d$，特别是在局部词汇线索显著时：在金标准监督下，它在VHF上将$P_k$降低了0.7个点，$W_d$降低了0.6个点，并在DialSeg711上达到了$P_k$为1.0；在诱导边界下，它在VHF上将$P_k$降低了14.8个点，在DialSeg711上降低了1.5个点，在TIAGE上降低了1.1个点，优于先前的非LLM方法。

英文摘要

Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances. Existing utterance models often dilute these local lexical signals. We propose CobSeg, a novel multi-branch architecture that separates coherence-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction. CobSeg further uses boundary informativeness weighting to emphasize high-utility utterance positions, and incorporates a corpus-derived topic coherence cue with learned combination weights. While CobSeg is evaluated as a compact trainable segmenter under supervised gold-boundary training and a pseudo-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference. Across five benchmarks, it improves $P_k$ and $W_d$ particularly when local lexical cues are prominent: under gold supervision, it reduces $P_k$ by 0.7 points and $W_d$ by 0.6 points on VHF, and reaches $P_k$ of 1.0 on DialSeg711; with induced boundaries, it reduces $P_k$ by 14.8 points on VHF, by 1.5 points on DialSeg711, and by 1.1 points on TIAGE, outperforming prior non-LLM approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.30654 2026-06-01 cs.CL cs.AI cs.HC 版本更新

EUDAIMONIA: Evaluating Undesirable Dynamics in AI

EUDAIMONIA: 评估AI中的不良动态

Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast, Ravi Iyer, Robin Jia

发表机构 * University of Southern California（南加州大学）

AI总结提出Social AI Design Code框架，并通过EUDAIMONIA基准测试评估22个最新LLM在社交互动中对用户福祉的符合程度，发现即使最强模型也违反约30%的设计要求。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作陪伴、情感披露和人际建议的对话伙伴，但这些互动的社会动态可能造成能力导向或传统安全评估无法捕捉的伤害。我们引入了Social AI Design Code，这是一个评估LLM在社交互动中是否符合用户福祉的框架，包括它们是否鼓励有害的亲密关系、依赖或长时间参与。为了在自然且多样化的用户-LLM互动中评估这些风险，我们通过弱到强过滤、多模型重新标记和受控重写，从WildChat构建了包含969个用户输入和3,147个设计要求违规检查的基准测试EUDAIMONIA，将代码操作化。评估22个最近的LLM，我们发现即使最强的模型Claude-Opus-4.7和GPT-5.5也分别违反了30.7%和27.2%的检查。扩展思考并未降低违规率，表明这些失败是持久的社会对齐问题，而非仅通过测试时推理就能解决的缺陷。

英文摘要

Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.

URL PDF HTML ☆

赞 0 踩 0

2605.30653 2026-06-01 cs.CL 版本更新

AMNESIA: 一个大规模医学遗忘基准套件与疾病知情分析

Saeedeh Davoudi, Reihaneh Iranmanesh, Ophir Frieder, Nazli Goharian

发表机构 * IR Lab, Computer Science Department, Georgetown University, Washington D.C.（信息检索实验室，计算机科学系，乔治·华盛顿大学，华盛顿特区）

AI总结提出AMNESIA，首个大规模开源医学遗忘基准，包含70,560个问答对，评估四种遗忘方法，发现个体遗忘会侵蚀同病患者的其他知识。

详情

AI中文摘要

医学知识不断演变。这需要更新或选择性遗忘已训练的医学LLM中编码的信息。机器遗忘旨在无需完全重新训练即可移除特定训练数据对模型的影响。然而，现有的遗忘基准依赖于合成或小规模通用数据，导致临床遗忘研究不足。我们引入AMNESIA，首个大规模、开源医学遗忘基准，包含来自11种疾病类别、8,820份患者笔记的70,560个问答对。AMNESIA包括测试直接回忆的事实问题和测试临床推理的推理问题。我们用它来评估四种广泛使用的遗忘方法，分别在随机患者和疾病级别，并引入一个新的指标来检测医学术语的泄露。我们表明，遗忘个体患者会侵蚀具有相同病症的其他患者的知识，这需要能够更好地区分患者与共享临床知识的方法。

英文摘要

Medical knowledge is continuously evolving. This creates a need to update or selectively forget information encoded in already-trained medical LLMs. Machine unlearning aims to remove the influence of specific training data from a model without full retraining. Yet, existing unlearning benchmarks rely on synthetic or small-scale general data, leaving clinical unlearning understudied. We introduce AMNESIA, the first large-scale, open source benchmark for medical unlearning, with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. AMNESIA includes both factual questions testing direct recall and reasoning questions testing clinical inference. We use it to evaluate four widely used unlearning methods at both random patient and disease-level, and introduce a new metric for detecting leakage of medical terminology. We show that unlearning individual patients erodes knowledge of others with the same condition, calling for methods that can better separate patients from shared clinical knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.30590 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

反事实评估揭示临床LLM和智能体的隐藏能力画像

Matt Turk

发表机构 * Protege Data Lab（Protege数据实验室）

AI总结提出因果敏感性评分（CSS），通过沿五个临床维度变异肿瘤病例来评估模型是否按预期方向更新推荐，发现与覆盖度指标排名相反，并揭示所有前沿模型在手术状态干预上的安全盲点。

Comments Accepted to RLEval @ ACM CAIS 2026 (Workshop on Methods and RL Environments for Evaluating AI Agents) and selected for an invited talk based on reviewer ratings. 4-page short paper + appendix

详情

AI中文摘要

两个临床AI系统在基于覆盖度的评分标准上得分几乎相同，但当患者输入变化时行为却截然不同：一个更新其推荐以匹配新的临床信号，而另一个无论输入如何都产生相同输出。我们引入因果敏感性评分（CSS），这是一个预注册的干预性指标，沿五个临床有意义的维度——生物标志物翻转、先前治疗失败、生物标志物移除、手术状态变化和分期扰动——变异肿瘤肿瘤委员会病例，并使用{0, 0.5, 1.0}量表对每个模型是否在预注册的正确方向上更新其推荐进行评分。与基于覆盖度的加权召回指标共识匹配评分（CMS）相比，来自三个实验室的六个前沿模型在224个病例的单次推理中评估，排名几乎完全相反：所有六个模型排名发生变化，CMS最差的模型成为CSS最好的模型，而一个中上CMS模型在CSS上排名最后。我们进一步揭示了一个普遍的安全盲点：每个前沿模型在手术状态干预上失败（D家族最多17.2%的CSS），这是CMS未暴露的发现。该指标也适用于使用工具的智能体：在ReAct风格的实验中，工具使用改善了六个模型中五个的CSS（+2.5到+20.3个百分点），然而CSS最低的模型检索相同的图表部分但仍未能更新其推荐——揭示了仅在反事实评估下可见的结构性响应缺陷。跨评判者复制和三位评估者的医学专业验证确认了总体发现。像CSS这样的干预性预注册指标补充了临床AI智能体的基于覆盖度的评估：它们捕捉了覆盖度指标遗漏的响应性，并为未来的智能体强化学习系统提供了候选的密集奖励信号。

英文摘要

Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.

URL PDF HTML ☆

赞 0 踩 0

2605.30589 2026-06-01 cs.CL cs.AI 版本更新

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

ImmigrationQA: 一个基于来源的数据集及面向美国移民法的小型模型适配

Nazarii Shportun

发表机构 * Independent Researcher（独立研究员）

AI总结本文构建了基于来源的问答数据集ImmigrationQA（17,058对，覆盖13个移民子领域），并通过参数高效LoRA微调Llama 3.2 3B Instruct模型，在程序性子领域取得显著提升，但复杂法律推理仍较弱。

Comments 12 pages, 4 tables. Dataset (17,058 QA pairs), fine-tuned model, and code are publicly released

详情

AI中文摘要

美国移民法涵盖数千页的官方政策、联邦法规和程序指南，这些内容频繁变化，且对缺乏法律代表的申请人影响重大。我们描述了ImmigrationQA的构建过程，这是一个基于来源的问答数据集，包含13个移民子领域的17,058对问答，以及使用参数高效LoRA对Llama 3.2 3B Instruct模型在该数据集上的微调。语料库来自11个主要和次要来源——包括USCIS政策手册、8 CFR、BIA先例决定和社区问答——产生了10,056份经过验证的规范文档和18,308个文本块。使用Claude Sonnet 4.6通过五种模式特定提示从这些文本块生成结构化问答对，其中22对因来源跨度重叠不足被拒绝。微调模型在993对的保留测试集上使用LLM-as-judge评分进行评估，基于101个示例的分层样本。微调模型平均得分为1.08/3.0（16.8%完全正确；101示例分层评估），而Llama 3 8B基础模型得分为0.85/3.0（4%完全正确），平均分相对提升27%；零样本Claude Sonnet基线得分为1.52/3.0（25%完全正确）。微调模型在程序性子领域（旅行证件、身份调整、非移民签证）表现出集中改进，但在复杂法律推理和时效性统计方面仍然较弱。整个流程的云计算成本约为29美元。所有工件——数据集、模型、代码和提示模板——均已公开发布。该系统不能替代法律咨询，且不反映语料库抓取日期后的法规变化。

英文摘要

U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.

URL PDF HTML ☆

赞 0 踩 0

2605.30582 2026-06-01 cs.CL 版本更新

AI for Monitoring and Classifying Data Used in Research Literature

AI用于监控和分类研究文献中使用的数据

Rafael Macalaba, Aivin V. Solatorio

发表机构 * World Bank（世界银行）； Office of the World Bank Group（世界银行集团办公室）； Development Data Group（发展数据组）

AI总结提出基于GLiNER的多任务框架，结合合成数据生成和LLM重验证，实现研究文献中数据集提及的提取、关系识别和使用上下文分类，以解决数据集使用监控中的标注稀缺和引用不一致问题。

详情

AI中文摘要

虽然Google Scholar和Semantic Scholar等平台追踪学术论文的引用，但尚无类似基础设施用于监控研究文献中的数据集使用情况，导致数据使用格局在很大程度上不透明。解决这一差距对于透明度、可重复性和影响监控至关重要，但进展受到不一致的引用实践、稀缺的标注数据以及文献中对数据集的模糊引用的阻碍。传统的NLP方法难以应对这些挑战，促使转向更具适应性、语义更丰富的模型。基于先前使用LLM进行数据提及检测和合成数据进行引导训练的工作，本文提出了一种用于可扩展数据集监控的更新方法。我们引入了一个基于GLiNER的多任务框架，联合执行数据集提及提取、关系识别和使用上下文分类。为解决标签稀缺问题，该流程利用合成数据生成产生训练示例，并基于LLM的重验证过滤错误提及并强制标签一致性，共同提高了整个训练流程的可靠性、覆盖率和输出一致性。这项工作推进了用于监控研究文献中数据使用的开源工具的开发，有助于实现可泛化、无约束的数据集引用追踪的更广泛目标。

英文摘要

While platforms like Google Scholar and Semantic Scholar track citations for academic papers, no comparable infrastructure exists for monitoring dataset usage in research literature, leaving the landscape of data use largely opaque. Addressing this gap is critical for transparency, reproducibility, and monitoring of impact, yet progress is hindered by inconsistent citation practices, scarce labeled data, and ambiguous references to datasets in the wild. Traditional NLP approaches struggle with these challenges, motivating the shift toward more adaptive, semantically rich models. Building on prior work using LLMs for data mention detection and synthetic data for bootstrapping training, this paper presents an updated methodology for scalable dataset monitoring. We introduce a multitask GLiNER-based framework that jointly performs dataset mention extraction, relation identification, and usage-context classification. To address label scarcity, the pipeline leverages synthetic data generation to produce training examples and LLM-based revalidation to filter incorrect mentions and enforce labeling consistency, together improving reliability, coverage, and output consistency across the training pipeline. This work advances the development of open-source tools for monitoring data use in research literature, contributing to the broader goal of generalizable, unconstrained dataset citation tracking.

URL PDF HTML ☆

赞 0 踩 0

2605.30580 2026-06-01 cs.CL cs.LG 版本更新

Speculative Decoding Across Languages

跨语言的推测解码

Nirajan Paudel, Michael Ginn, Luc De Nardi, Alexis Palmer

发表机构 * University of Colorado（科罗拉多大学）

AI总结本文研究了通过微调草稿模型或使用n-gram模型来提高非英语语言中推测解码效率的策略，发现任务特定蒸馏虽能提升效率但泛化性差，而n-gram模型尽管接受率较低，但由于生成速度快，始终能提供显著的加速效果。

Comments 10 pages, 11 figures, submitted to ACL ARR May 2026

详情

AI中文摘要

推测解码已成为大型语言模型（LLM）推理的关键组成部分，通过草拟多个令牌并并行验证，实现更快的生成。然而，小型草稿模型往往在多语言能力上严重不足。因此，在生成非英语文本时，推测解码的效率远低于英语。我们比较了三种提高十一种语言推测解码效率的策略：在任务特定数据（翻译）上微调草稿模型；在未标记的单语语料库上微调草稿模型；以及在相同单语语料库上训练简单的n-gram草稿模型。我们在翻译（从英语到目标语言）和保留任务故事生成上评估效率。我们发现，虽然任务特定蒸馏可以显著提高效率，但蒸馏模型在新任务上泛化能力差。与此同时，n-gram草稿模型尽管接受率较低，但由于草稿生成速度快得多，始终能提供大的加速。

英文摘要

Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective. We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30574 2026-06-01 cs.CL 版本更新

Probing the Prompt KV Cache: Where It Becomes Dispensable

探测提示KV缓存：何时变得可有可无

Vinayshekhar Bannihatti Kumar, Manoj Ghuhan Arivazhagan, Disha Makhija, Rashmi Gangadharaiah

发表机构 * AWS AI Labs（AWS人工智能实验室）

AI总结通过控制实验发现，提示KV缓存的冗余主要源于聊天模板的形式而非内容，上层缓存可用中性填充模板的KV缓存替代而不损失精度。

2605.30568 2026-06-01 cs.CL 版本更新

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

生成与精炼动态评估准则用于LLM-as-a-Judge

Zijie Wang, Eduardo Blanco

发表机构 * University of Arizona（亚利桑那大学）； Department of Computer Science（计算机科学系）

AI总结提出无需人工标注的自动生成细粒度评估准则方法，通过数据集级和实例级粒度生成，并利用元评判奖励信号迭代微调准则生成器，在成对和逐点评估中均超越现有基线。

2605.30557 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

看见不等于知道：视觉语言模型是否知道何时不回答空间问题（以及为什么）？

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Google Research（谷歌研究）

AI总结针对视觉语言模型在空间推理中过度自信回答的问题，提出SpatialUncertain框架，通过遮挡和视角歧义两种挑战，评估模型是否知道何时应弃权以及如何寻找可靠证据。

Comments Website: https://zhangyuejoslin.github.io/spatialuncertain/

详情

AI中文摘要

空间推理是部署在真实环境中的视觉语言模型（VLM）的基本能力。然而，视觉观察本质上是对3D世界的有限表示：遮挡可能使物体不可见，视角可能使几何属性产生误导。尽管如此，现有的空间推理基准通常假设观察是充分且可靠的，侧重于模型是否产生正确答案，而不是它们是否认识到问题无法回答以及需要哪些额外观察。在这项工作中，我们通过构建一个受控评估框架SpatialUncertain来挑战这一假设，并引入两种观察挑战：（1）遮挡，隐藏目标信息；（2）视角歧义，产生误导性视觉线索。对于每种配置，我们设计在清晰观察下可回答但在引入挑战下需要弃权的空间问题。我们进一步评估模型是否能识别哪些额外视角可以解决视角歧义。我们在多种前沿开源和闭源VLM上的结果揭示了两个一致的失败模式。首先，模型倾向于过度自信地回答，即使在视觉证据不完整或具有误导性时也试图解决空间推理任务，在遮挡下平均准确率约为30%，在视角歧义下低于10%。其次，即使有额外视角可用，一些模型在识别哪些视角能提供可靠证据方面表现接近随机。总之，我们的发现呼吁超越答案正确性，转向评估模型是否知道何时弃权以及如何寻找可靠证据。

英文摘要

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.30545 2026-06-01 cs.CL 版本更新

Refining Word-Based Grammatical Error Annotation for L2 Korean

针对L2韩语的基于词的语法错误标注改进

Jungyeul Park, Kyungtae Lim, Wonjun Oh, Benjamin Nguyen, Zihao Huang, Mengyang Qiu, Jayoung Song

发表机构 * Korea Advanced Institute of Science & Technology（韩国科学技术院）； The University of British Columbia（不列颠哥伦比亚大学）； Saint Elizabeth University（圣埃利赞大学）； The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结通过解决现有资源中的三个问题（表面目标实现、韩语特定编辑标注和单参考评估），改进基于词的语法错误标注，并验证了改进资源在困惑度、编辑一致性和多参考评估上的有效性。

详情

AI中文摘要

韩语语法错误纠正（K-GEC）在基于词的评估与许多学习者错误的语素层面之间存在结构不匹配。后置词和动词结尾附着于词汇宿主，但它们编码的语法关系必须在纠正和评估中体现。本文通过解决现有资源中的三个相关问题来改进L2韩语的基于词语法错误标注：表面目标实现、韩语特定编辑标注和单参考评估。我们在形态约束实现规则下从国立韩语研究院（NIKL）L2语料库重建目标句子，并将其语素级标注转换为词级\texttt{m2}编辑。然后我们定义了一个韩语ERRANT风格的标注方案，保留MRU核心的同时区分功能语素错误、拼写错误、词边界错误和词序错误。我们还为KoLLA语料库增加了一个额外的参考纠正，为韩语GEC提供了多参考评估设置。实证验证表明，改进的NIKL目标具有更低的困惑度，转换后的\texttt{m2}文件与源-目标编辑表示的一致性更高，并且改进的资源在相同模型设置下提高了基于KoBART的纠正性能。多参考KoLLA评估进一步减少了对偏离单一参考的有效纠正的惩罚，尤其对于神经和提示式GEC系统。这些结果表明，韩语GEC评估不仅依赖于纠正模型，还依赖于反映韩语形态、空格和纠正变异性的参考数据和编辑标注。

英文摘要

Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they encode grammatical relations that must be represented in correction and evaluation. This paper refines word-based grammatical error annotation for L2 Korean by addressing three connected problems in existing resources: surface target realization, Korean-specific edit annotation, and single-reference evaluation. We reconstruct target sentences from the National Institute of Korean Language (NIKL) L2 corpus under morphologically constrained realization rules and convert its morpheme-level annotations into word-level \texttt{m2} edits. We then define a Korean ERRANT-style annotation scheme that preserves the MRU core while distinguishing functional morpheme errors, spelling errors, word boundary errors, and word order errors. We also augment the KoLLA corpus with an additional reference correction, yielding a multi-reference evaluation setting for Korean GEC. Empirical validation shows that the refined NIKL targets yield lower perplexity, the converted \texttt{m2} files achieve higher agreement with source-target edit representations, and the refined resources improve KoBART-based correction under the same model setting. Multi-reference KoLLA evaluation further reduces the penalty imposed on valid corrections that diverge from a single reference, especially for neural and prompted GEC systems. These results show that Korean GEC evaluation depends not only on correction models, but also on reference data and edit annotations that reflect Korean morphology, spacing, and correction variability.

URL PDF HTML ☆

赞 0 踩 0

2605.30529 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

通用嵌入还是特定嵌入，哪个更好？非英语语言临床编码搜索的实证研究

David Rey-Blanco, Roberto Cruz

发表机构 * TietAI

AI总结本研究通过使用大型生成语言模型生成的合成数据微调双语编码器，构建两阶段检索器，解决了非英语语言临床编码检索中召回率下降的问题，并在多语言基准上取得了优于BioBERT-ST的性能。

Comments 24 pages, 12 figures, 6 tables

详情

AI中文摘要

用于语义搜索的句子嵌入模型绝大多数是在英语语料库上开发和评估的。当应用于其他语言的临床检索——特别是ICD-10-CM/CIE-10代码的检索——召回率会下降，而这种下降往往被聚合基准所掩盖。我们研究大型生成语言模型是否可以作为数据工厂来缩小这一差距。我们构建了一个两阶段检索器（双编码器后接交叉编码器重排序器），该检索器在Gemini生成的合成数据（涵盖英语、西班牙语、加泰罗尼亚语、意大利语、葡萄牙语和法语）上对西班牙生物医学编码器（PlanTL-GOB-ES/bsc-bio-ehr-es）进行微调，并与BioBERT-ST和未调优的西班牙编码器进行评估。仅双编码器在MRR（0.876 vs. 0.866）上匹配BioBERT-ST，并在R@3（0.650 vs. 0.626）和R@5（0.804 vs. 0.790）上超越它，且无需英语生物医学预训练。添加交叉编码器重排序器将聚合R@5提升至0.822，并在五种语言中的四种上占据主导地位（西班牙语+0.017，加泰罗尼亚语+0.033，法语+0.018，葡萄牙语+0.037），但以英语的小幅回归为代价。这种权衡在临床上是可接受的：葡萄牙语的R@5达到0.829，而BioBERT-ST为0.714。贡献：一个基于LLM生成数据构建领域特定医学检索器的开放配方；学习增益的量化（MRR从0.755到0.876，+15.9%，使用约19,500个合成对）；以及按语言和排名对增益集中区域的刻画。

英文摘要

Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

URL PDF HTML ☆

赞 0 踩 0

2605.30526 2026-06-01 cs.LG cs.CL 版本更新

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

测量、定位和消融LLMs中的对齐特征

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür, Nick Feamster

发表机构 * University of Chicago（芝加哥大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Toyota Technological Institute at Chicago（芝加哥丰田技术研究所）

AI总结研究通过对比人类文本、基模型和对齐模型生成，发现对齐训练引入AI风格特征，并提出PASTA方法通过消融对齐方向来降低AI检测率。

详情

AI中文摘要

对齐语言模型通常表现出可识别的AI风格，但其与后训练和内部表示的联系尚不清楚。本文研究后训练是否引入或放大了AI风格规律，以及这些规律是否具有局部内部特征。为此，我们在匹配的人类源前缀下比较人类文本、基模型生成和对齐模型生成。对齐生成显示出比基生成更低的人类语料库亲和力和更高的AI检测率，表明后训练使生成文本偏离人类语料库风格，转向检测器可见的AI风格文本。然后我们引入PASTA（后训练对齐特征目标消融），一种无需训练的方法，通过对齐-基残差对比估计后训练对齐特征，并在解码过程中消融相应方向。在11个对齐模型和6个AI检测器上，PASTA降低了对大多数对齐模型的检测率；该效果在检测器间良好迁移，且不被随机方向复现。定性分析表明，PASTA生成保持相关性和连贯性，同时表现出更大的风格变化。这些结果共同表明，后训练的AI风格效果可以通过激活消融进行测量、定位和因果测试。

英文摘要

Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies AI-like stylistic regularities and whether these regularities have a localized internal signature. To this end, we compare human text, base-model generations, and aligned-model generations under matched human-source prefixes. Aligned generations show lower human-corpus affinity and higher AI-detection rates than base generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible AI-like text. We then introduce PASTA (Post-training Alignment Signature Targeted Ablation), a training-free method that estimates a post-training alignment signature from aligned-base residual contrasts and ablates the corresponding direction during decoding. Across 11 aligned models and 6 AI detectors, PASTA lowers the detection rate for most aligned models; this effect transfers well across detectors and is not reproduced by random directions. Qualitative analysis suggests that PASTA generations remain relevant and coherent while exhibiting greater stylistic variation. Together, these results show that AI-like stylistic effects of post-training can be measured, localized, and causally tested through activation ablation.

URL PDF HTML ☆

赞 0 踩 0

2605.30523 2026-06-01 cs.LG cs.AI cs.CC cs.CL cs.FL 版本更新

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

重新审视填充Transformer的表达能力：哪些架构选择重要，哪些不重要

Anej Svete, William Merrill, Ryan Cotterell, Ashish Sabharwal

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Allen Institute for AI（人工智能研究所）

AI总结本文通过连接布尔电路，系统研究了填充Transformer的表达能力，发现数值精度和模型深度是影响表达能力的主要因素，而注意力类型、模型宽度和均匀性等架构选择对表达能力影响不大。

详情

AI中文摘要

近期工作通过连接布尔电路描述了Transformer能计算和不能计算的内容，但现有结果缺乏精确刻画，且对建模选择敏感。填充Transformer——在其输入后附加填充符号如“...”——通过为自适应并行计算提供多项式空间，成为建立与电路类等价关系的有用工具。然而，目前仅研究了有限的填充Transformer理想化模型，这些等价关系在注意力类型、模型宽度和均匀性变化下的稳健性仍待探索。我们发现，在实际假设下，填充Transformer对所有这些变化都出奇地稳健，并确定数值精度和模型深度是影响表达能力的主要因素。具体地，我们证明多项式填充的L-均匀常数精度Transformer等价于L-均匀AC⁰，而增长精度的Transformer达到L-均匀TC⁰，与宽度无关。此外，循环机制允许类似电路的顺序处理：log^d N次循环的常数精度Transformer达到FO-均匀AC^d，增长精度的达到FO-均匀TC^d。有趣的是，宽度或精度超过对数增长并不会增加表达能力，且我们所有结果对softmax和平均硬注意力Transformer均成立。

英文摘要

Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.30521 2026-06-01 cs.CL 版本更新

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

评估使用模拟工具调用隔离不可信提示输入

David Gros, Adam Gleave

AI总结通过自动化红队测试，研究将不可信内容包装在模拟工具调用中作为隔离措施是否能提高大语言模型的鲁棒性，发现该方法通常无效甚至有害。

详情

AI中文摘要

大型语言模型必须频繁处理不可信输入，例如判断另一个模型的答案或在对抗压力下运行垃圾邮件和有害分类器等任务。这些输入通常以字符串格式直接插入到提示模板中，使系统容易受到操纵。目前来自 OpenAI 等主要提供商的 LLM 规范根据指令层次区分可信度，从系统消息（最可信）到工具结果（最不可信）。一种可能的自然缓解措施是将不可信内容包装在模拟工具调用中作为隔离。我们通过自动化红队搜索，在七个模型和三个 LLM 作为评判者的任务上探索了这一假设。与我们的假设相反，工具包装并未广泛提高鲁棒性。在二元评估任务（GSM8K 评分）中，它通常会增加攻击成功率，这似乎是指令层次的倒置。在标量和成对任务中，效果较小且依赖于模型，没有测试模型得到可靠帮助，并且几个模型显示出倒置。我们建议在部署系统中评估这一局限性，并从长远来看，追求更强的指令层次训练或新的不可信输入原语。

英文摘要

Large language models must frequently process untrusted inputs, such as judging an answer from another model or running tasks like spam and harm classifiers while under adversarial pressure. These inputs are often string-formatted directly into a prompt template, leaving systems fragile to manipulation. Current LLM specs from major providers like OpenAI distinguish trustworthiness along an Instruction Hierarchy, from System messages (most trusted) to Tool Results (least trusted). A possible natural mitigation is to wrap untrusted content in a mock tool call as a quarantine. We explore this hypothesis with an automated redteaming search over static attack strings across seven models and three LLM-as-a-Judge tasks. Counter to our hypothesis, tool-wrapping does not broadly improve robustness. On a binary evaluation task (GSM8K grading) it typically increases attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent, with no tested model reliably helped, and several showing inversion. We recommend evaluating this limitation in deployed systems, and longer-term, pursuing stronger Instruction Hierarchy training or new untrusted-input primitives.

URL PDF HTML ☆

赞 0 踩 0

2605.30514 2026-06-01 cs.LG cs.CL 版本更新

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

MAAT: 多阶段适配器感知的定向遗忘学习

Suryash Yagnik, Shubham Gaur, Saksham Thakur, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Indian Institute of Information Technology, Bhopal, India（印度比哈尔理工学院）； University of California, Santa Cruz, USA（加州大学圣克鲁兹分校）； Independent Researcher（独立研究者）； Stanford University, USA（斯坦福大学）； BITS Pilani Goa, India（比斯拉米印度学院）

AI总结针对现有机器遗忘评估中因果知识（Why类）样本极少导致评估失衡的问题，提出5WBENCH平衡基准和MAAT多阶段框架，首次在Why类知识上同时实现高遗忘与高保留。

Comments 16 pages, 4 figures, 10 tables

详情

AI中文摘要

当英语重写地方知识：大语言模型中的全球叙事主导

Md Arid Hasan, Ruwad Naswan, Farhan Samir, Sharifa Sultana, Syed Ishtiaque Ahmed

发表机构 * University of Toronto（多伦多大学）； BUET（巴特利特大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本研究通过构建孟加拉语文化数据集CulturalNB，评估大语言模型在低资源文化背景下的跨语言知识一致性，发现英语提问会系统性地增加全球替代和制度框架，减少地方视角覆盖，表明文化失败不仅是知识缺失，更是根基和叙事优先级问题。

Comments Submitted to ARR

详情

AI中文摘要

大语言模型（LLMs）被广泛用作跨语言知识接口。然而，植根于文化的问题往往反映全球主导叙事而非地方背景。我们将这种失败模式称为孟加拉语（一种低资源文化背景）中的 extit{全球叙事主导}。我们引入了 exttt{CulturalNB}，一个包含717个手工策划的孟加拉语文化实例的数据集，配有平行的孟加拉语-英语问答对、支持证据、元数据和社会文化注释。通过仅问题和基于证据的提示，我们使用人类和两个独立的LLM裁判，在跨语言一致性、语言锚定、全球替代、制度偏见和认知视角覆盖等指标上评估了九个最先进的LLM。结果表明，用英语提问会系统性地增加全球替代和制度框架，同时减少地方视角覆盖。地方证据提高了事实一致性和视角覆盖，但并未消除语言引起的认知偏移。这些发现表明，LLM中的文化失败不仅是知识缺失错误，更是根基和叙事优先级失败。

英文摘要

Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.

URL PDF HTML ☆

赞 0 踩 0

2605.30478 2026-06-01 cs.SE cs.CL 版本更新

Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback

通过验证反馈的强化学习改进用于代码生成的小型语言模型

Egor Skopin, Evgeny Kotelnikov

发表机构 * Vyatka State University（沃伊亚提卡州立大学）； European University at St. Petersburg（圣彼得堡欧洲大学）

AI总结本研究通过基于验证奖励的强化学习（RLVR）结合LoRA微调，在MBPP基准上对小型语言模型（Qwen3-0.6B和Llama3.2-1B）进行Python代码生成实验，发现奖励设计对功能正确性和行为诊断有显著影响，并提出组合奖励可稳定权衡正确性与风格约束。

Comments Accepted for AINL-2026 conference

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）使用程序可检查的信号（如单元测试结果）训练语言模型，从而能够直接优化代码生成的功能正确性。我们在MBPP基准上使用两个小型模型（Qwen3-0.6B和Llama3.2-1B）结合LoRA微调，对Python代码生成进行了RLVR的实证研究。在多种奖励公式（如：仅单元测试奖励、通过Ruff linter的仅静态分析塑形、以及组合奖励）下，我们比较了基于组的策略优化变体（GRPO和GSPO），并评估了功能正确性和行为诊断。在我们的实验设置中，RLVR在提出的组合奖励配置下将MBPP测试上的pass@1提高了最多13个百分点。然而，我们发现奖励塑形可能引发系统性行为偏移：仅使用静态分析惩罚可能使策略偏向于生成更短的代码，从而减少lint错误，但未能可靠地提高功能正确性。相比之下，组合奖励减轻了这种退化，并在正确性和风格约束之间产生了更稳定的权衡。总体而言，我们的结果强调，RLVR在代码生成中的有效性对奖励设计和优化粒度高度敏感，并且除了pass@1之外的诊断指标（包括生成长度、Ruff严重性分布和执行错误类型）有助于识别失败模式。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only rewards, static-analysis-only shaping via the Ruff linter, and a combined reward, we compare group-based policy optimization variants (GRPO and GSPO) and evaluate both functional correctness and behavioral diagnostics. In our experimental setting, RLVR improves pass@1 on MBPP test by up to 13 percentage points under proposed combined reward configuration. However, we find that reward shaping can induce systematic behavioral shifts: using only static-analysis penalties may bias the policy toward shorter completions that reduce lint errors without reliably improving functional correctness. In contrast, combined rewards mitigate this degeneration and yield more stable trade-offs between correctness and style constraints. Overall, our results highlight that RLVR effectiveness for code generation is highly sensitive to reward design and optimization granularity, and that diagnostics beyond pass@1, including generation length, Ruff severity profiles, and execution error types are useful for identifying failure modes.

URL PDF HTML ☆

赞 0 踩 0

2605.30472 2026-06-01 cs.CL 版本更新

黑盒大语言模型蒸馏的有界行为不可区分性

Munawar Hasan

发表机构 * Michigan Technological University（密歇根技术大学）

AI总结针对黑盒LLM蒸馏，提出有界行为不可区分性形式化定义，并通过对抗评估揭示语义相似性不足以保证行为不可区分性。

详情

AI中文摘要

黑盒大语言模型蒸馏通常被评估为输出匹配问题：当学生模型的响应与教师模型在语义上相似或任务一致时，即认为学生模型成功。然而，输出相似性并不意味着学生模型与其模仿的模型在行为上不可区分。我们引入了有界行为不可区分性，形式化为在显式提示分布上的$(ε,q,t,\mathbb{A})$-行为不可区分性，其中$ε$限制区分优势，$q$限制预言机查询次数，$t$限制计算量，$\mathbb{A}$表示对手类别。我们在Qwen和Llama教师-学生对上使用受控的$5,000$提示行为探测套件实例化该概念。对于每个系列，我们比较教师模型与基础学生模型以及LoRA蒸馏学生模型，衡量蒸馏是否降低了可区分性而不仅仅是提高了相似性。LoRA将Qwen的语义相似性从$0.788$提升至$0.862$，Llama从$0.814$提升至$0.874$。然而，对抗评估揭示了剩余的行为差异：学习到的判别器保持非零优势，成对类别分析显示伪影集中在风格/格式、鲁棒性和领域技术提示中。成对教师识别对手证实了这一趋势。使用不同系列的Llama评判器和A/B交换一致性过滤，Qwen的区分优势从基础学生模型的$0.158$下降到LoRA蒸馏后的$0.081$。查询预算实验表明，分歧引导的采集并不始终优于分层随机采样，表明覆盖率和多样性仍然是强基线。我们的结果表明，语义保真度有用但不足：黑盒大语言模型蒸馏需要有界、对抗性和类别感知的评估。

英文摘要

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(ε,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $ε$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.30434 2026-06-01 cs.LG cs.AI cs.CL cs.MA 版本更新

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS-Bench：关于长周期智能数据分析的失败

Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）； Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph（知识图谱联合实验室）

AI总结提出LongDS基准，用于评估长周期多轮数据分析中智能体维护和更新分析状态的能力，发现最佳模型平均准确率仅48.45%，且长周期错误占失败原因的52%-69%。

Comments Ongoing work

详情

AI中文摘要

现实世界的数据分析本质上是迭代的，然而现有基准大多评估孤立或短期的交互任务，未能测试智能体在长周期内跟踪不断变化的分析上下文的能力。我们引入了LongDS，一个用于长周期、多轮数据分析的基准，其中智能体必须维护、更新、恢复和组合不断变化的分析状态。LongDS包含68个从真实世界Kaggle笔记本构建的任务，涵盖地球科学、商业和教育等六个领域的2,225轮交互。任务围绕状态演化模式（例如反事实扰动、回滚、多状态组合）设计，平均依赖跨度为11.3轮。评估五个最先进模型，我们发现最佳模型仅达到48.45%的平均准确率，性能从早期到后期轮次下降近47个百分点，长周期错误占失败原因的52%-69%。进一步分析表明，额外的智能体步骤并不一定能提高性能，这表明关键瓶颈在于维护正确的分析状态，而非增加交互预算。我们发布LongDS以支持可靠的长周期智能数据分析研究。代码和数据将在https://github.com/zjunlp/DataMind发布。

英文摘要

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

URL PDF HTML ☆

赞 0 踩 0

2605.30415 2026-06-01 cs.CL cs.AI 版本更新

Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology

语言模型中的领域适应与推理框架：以历史宇宙学为受控实验

Francesco De Bernardis

发表机构 * Independent Researcher（独立研究者）

AI总结通过历史宇宙学受控实验，研究领域适应如何重塑语言模型的解释行为，发现适应主要改变解释框架而非直接改变立场。

Comments 17 pages, 3 figures

详情

外部观察者的必要性：充分性差距的形式化——序列模型中混合可识别性与上下文基础化的数学扩展

Francesco Corielli

AI总结本文通过构建二元混合过程，形式化了由未观测隐状态导致的充分性差距，并引入辅助信号建立上下文主导阈值，证明温度缩放无法弥补缺失上下文，而外部观察者或验证器在高风险领域是必要的。

详情

AI中文摘要

我们构建了一个二元混合过程，其中一个确定性文本机制和一个随机机制由未观测的隐状态控制。即使一个理想的无容量限制的序列预测器能够精确恢复纯文本边际分布，当观测到的前缀与错误的隐状态兼容时，它也可能变得过度自信。由此产生的熵差并非普通的优化误差；而是由未观测状态上的边缘化导致的充分性差距。然后，我们通过一个保真度为$γ∈[1/2,1]$的辅助二元信号形式化检索、工具使用和外部基础化。由此产生的贝叶斯更新给出了一个上下文主导阈值：当纠正信号的保真度超过纯文本后验权重中分配给误导机制的部分时，该信号恰好反转由文本历史诱导的后验几率。该阈值减小了充分性差距，但通常不能完全消除；完全消除需要相关隐状态的完美揭示或等效的验证机制。该分析阐明了为什么温度缩放无法恢复缺失的上下文，为什么基础化机制必须既信息丰富又可被模型学习使用，以及为什么在高风险领域自主序列模型需要结构上解耦的观察者或验证器。

英文摘要

We construct a binary mixed-regime process with one deterministic textual regime and one random regime governed by an unobserved latent state. Even an ideal infinite-capacity sequence predictor that exactly recovers the text-only marginal law can become overconfident when the observed prefix is compatible with the wrong latent regime. The resulting entropy difference is not an ordinary optimization error; it is a sufficiency gap caused by marginalization over an unobserved state. We then formalize retrieval, tool use, and external grounding through an auxiliary binary signal with fidelity $γ\in [1/2,1]$. The resulting Bayesian update yields a contextual dominance threshold: a corrective signal reverses the posterior odds induced by the textual history exactly when its fidelity exceeds the text-only posterior weight assigned to the misleading regime. This threshold reduces, but does not generally eliminate, the sufficiency gap; complete closure requires perfect revelation of the relevant latent state or an equivalent verification mechanism. The analysis clarifies why temperature scaling cannot restore missing context, why grounding mechanisms must be both informative and learnably usable by the model, and why autonomous sequence models require structurally decoupled observers or verifiers in high-stakes domains.

URL PDF HTML ☆

赞 0 踩 0

2605.26396 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Advancing Creative Physical Intelligence in Large Multimodal Models

推进大型多模态模型中的创造性物理智能

Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu, Aditi Tiwari, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Heng Ji

发表机构 * UIUC（伊利诺伊大学香槟分校）； Amazon（亚马逊）

AI总结针对大型多模态模型在开放式环境中缺乏基于视觉的创造性工具使用能力的问题，提出MM-CreativityBench基准和基于偏好学习的具身对齐方法，显著提升实体选择并减少幻觉。

Comments 51 Pages, 9 Figures, 7 Tables, Previous Work CreativityBench: arXiv:2605.02910

详情

AI中文摘要

大型多模态模型（LMMs）在感知和推理方面取得了快速进展；然而，目前尚不清楚这些能力是否能够泛化到在开放式环境中发现基于视觉的解决方案，超越模式识别。在此类场景中，智能需要的不仅仅是回答明确的问题：它涉及识别场景中的元素如何以非显而易见但物理上可行的方式被重新利用。这种创造性问题解决形式是人类智能的核心，但在当前基准测试中基本上未得到测试。为了评估这一能力，我们引入了MM-CreativityBench，这是一个用于在视觉丰富、物理受限的环境中进行基于可操作性的创造性工具使用的基准。每个实例呈现一个场景图像，包含候选实体及其部件的结构化视图，从而能够对模型如何迭代检查场景、识别相关可操作性以及组合视觉和物理上可行的解决方案进行细粒度、交互式评估。我们的实验表明，当前的LMMs往往表现不佳，不是由于缺乏生成能力，而是因为它们无法维持基于具身的探索。模型经常忽略相关实体，对关键部件检查不足，或幻觉出图像中不存在的属性。受此失败模式的启发，我们提出了具身对齐，将创造性工具使用视为一个偏好学习问题。使用直接偏好优化，我们鼓励模型偏好基于视觉证据的属性-可操作性推理，而非幻觉替代方案。此外，我们结合从可操作性知识库中获得的监督，以指导更广泛的实体探索和多轮规划。我们的结果显示，在正确选择实体和部件方面取得了持续改进，同时大幅减少了幻觉和与具身相关的错误。

英文摘要

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.

URL PDF HTML ☆

赞 0 踩 0

2605.30018 2026-06-01 cs.CL cs.LG 版本更新

Latent Performance Profiling of Large Language Models

大型语言模型的潜在性能剖析

Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty, Partha Pratim Das, Lipika Dey, Richa Singh, Mayank Vatsa

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi（印度理工学院德里分校电子工程系）； Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi（印度理工学院德里分校人工智能学院）； Hewlett Packard Enterprise, India（印度惠普企业公司）； Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur（印度理工学院Khargapur分校计算机科学与工程系）； A.K.Choudhury School of Information Technology, University of Calcutta, India（印度加尔各答大学信息科技学院）； Department of Computer Science & Engineering, Indian Institute of Technology Bombay（印度理工学院孟买分校计算机科学与工程系）； Department of Computer Science, Ashoka University, India（阿什oka大学计算机科学系）； Department of Computer Science & Engineering, Indian Institute of Technology Jodhpur（印度理工学院朱罗普分校计算机科学与工程系）

AI总结提出潜在性能剖析（LPP）框架，通过隐藏激活和输出分布提取任务无关的诊断指标，揭示模型内在特性，补充传统基准评估。

详情

AI中文摘要

大型语言模型（LLMs）在标准化基准测试中经常取得令人印象深刻的分数，但仅凭准确性对能力的了解有限。通过排行榜评估开源LLMs面临持续的问题，如数据污染、任务范围狭窄以及与真实世界可靠性的弱对齐。基于基准的评估（如MMLU PRO、BBH或IFEval）主要捕捉模型在固定测试集上的输出，而非其如何处理信息、校准不确定性或构建内部知识。在本文中，我们主张从以基准为中心的评估转向对LLMs进行互补的、以状态为中心的内在评估。为此，我们引入了潜在性能剖析（LPP）——一个从隐藏激活和输出分布中提取任务无关诊断的框架。LPP在模型的潜在表示和动态上定义了一组标量指标，揭示了与规模无关的特征，从而实现可解释的比较并揭示隐藏的脆弱性。与静态准确性分数不同，LPP在相似规模的模型间提供稳定、对架构敏感的签名。通过对八个LLMs（规模范围0.5B-14B）的广泛实证分析，我们证明了具有相似基准分数的模型可能表现出对比的潜在特征，例如熵或适应性的差异。在这些见解的指导下，我们设计了用于不确定性和符号推理的合成探针，这些探针与内在指标一致，同时与排行榜偏差解耦。我们建议将LPP与基准一起报告，以提供对模型行为更深入、可解释的理解，从而实现更可靠的模型选择、安全评估以及超越表面准确性的评估。

英文摘要

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, state-centered intrinsic assessment of LLMs. To this end, we introduce Latent Performance Profiling (LPP) -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.29751 2026-06-01 cs.CL 版本更新

DySem: Uncovering Dynamic Semantic Components of Large Language Models for Calculating Semantic Textual Similarity

DySem: 揭示大语言模型的动态语义组件以计算语义文本相似度

Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang

发表机构 * College of Computer Science and Software Engineering（计算机科学与软件工程学院）

AI总结提出DySem框架，通过多语言共识提取大语言模型中与语义更相关的内部组件，并构建文本相关的联合语义集实现动态维度相似度计算，无需训练且性能优于基线。

Comments 18 pages, 23 figures, 5 tables

详情

AI中文摘要

计算语义文本相似度是自然语言处理中的基础任务。当前基于大语言模型（LLM）的方法通常依赖提取固定维度的最后一层隐藏状态来计算每对文本的相似度。我们认为这种范式存在两个局限：（i）最后一层隐藏层编码的是更通用的知识而非仅语义知识，因此对于语义相似度计算并非最优；（ii）LLM的隐藏层维度通常非常大，这引入了表示语义时的冗余和噪声。在这项工作中，我们提出DySem，一种新颖的无需训练框架，通过多语言共识探究LLM中更多与语义相关的内部组件，并摆脱静态表示空间，转而通过构建文本相关的联合语义集实现动态的、样本特定的语义维度，并在该共享维度子集上计算相似度。在各种LLM上的大量实验表明，我们的方法在保持较低相似度计算维度的同时，持续优于最近的基线。代码已发布在https://github.com/szu-tera/DySem。

英文摘要

Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last-layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training-free framework that investigates more semantic-related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample-specific semantic dimensions by constructing text-dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at https://github.com/szu-tera/DySem.

URL PDF HTML ☆

赞 0 踩 0

2605.29511 2026-06-01 cs.MA cs.CL cs.LG 版本更新

DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration

DynaGraph: 通过动态拓扑重构的轻量级多模型交互框架

Yanxing Guo, Zihao Zheng, Fangzhou Wu, Ling Liang, Lin Bao, Zongwei Wang, Yimao Cai

发表机构 * Peking University（北京大学）； Nanjing University（南京大学）； Beijing Advanced Innovation Center for Integrated Circuits（北京集成电路先进制造创新中心）； Beijing University of Posts and Telecommunications（北京邮电大学）； Yanxin Co. Ltd（燕新有限公司）

AI总结提出DynaGraph框架，通过动态拓扑重构和PEFT适配器复用，在单消费级GPU上实现多模型协作，接近72B单模型推理能力并大幅降低延迟和token消耗。

详情

AI中文摘要

处理复杂推理任务通常依赖于庞大的单体LLM，这会导致严重的计算冗余。虽然通过结构化流水线或多智能体协作进行任务分解提供了替代方案，但这些方法不可避免地陷入一个关键困境：预定义的静态拓扑极易受到级联错误的影响，而无约束的动态智能体则面临轨迹发散和不可预测的内存膨胀。为了解决这个问题，我们提出了DynaGraph，一个由动态拓扑重构驱动的轻量级多模型框架。在执行层面，DynaGraph在共享基础模型上复用时分PEFT适配器，使得整个系统的训练和推理部署可以在单个消费级GPU上完成。在路由层面，评估器持续监控执行置信度以触发分层自愈：针对局部数据差距的细粒度修补和针对严重逻辑断裂的子图重构。在StrategyQA、MATH和FinQA上的实验表明，我们的8B模型接近72B单体模型的推理能力（例如，在StrategyQA上为87.6%，在MATH上为82.7%）。此外，与无约束的动态架构相比，它延迟降低了高达68.1%，token消耗降低了68.6%。

英文摘要

Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.29343 2026-06-01 cs.CL 版本更新

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Draft-OPD：用于推测草稿模型的在线策略蒸馏

Haodi Lei, Yafu Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； Tsinghua University（清华大学）； The Chinese University of Hong Kong（香港中文大学）； Peking University（北京大学）； Zhejiang University（浙江大学）

AI总结针对推测解码中草稿模型因离线训练与在线推理不匹配导致性能瓶颈的问题，提出Draft-OPD方法，通过目标辅助展开和重放验证暴露的错误位置实现在线策略蒸馏，在多种任务上实现超过5倍的无损加速。

详情

AI中文摘要

推测解码通过将目标模型与轻量级草稿模型配对，并行验证其提出的令牌，从而加速大型语言模型推理。构建草稿模型的常见方法（如EAGLE3或DFlash）是在目标生成轨迹上进行监督微调（SFT）。然而，我们观察到SFT很快达到平台期：草稿模型在测试数据上的接受长度停止提升。原因是离线到推理的不匹配：在SFT中，草稿模型从固定的目标生成轨迹学习，而在推测解码期间，它在其自身策略提出的块上进行评估。这激发了在线策略蒸馏（OPD），其中目标模型在草稿诱导的状态上监督草稿模型。然而，OPD对于草稿模型仍然困难，因为它们无法可靠地独立展开完整序列，而目标辅助生成使收集的序列遵循目标分布，从而消除了在线策略信号。因此，我们提出Draft-OPD，它使用目标辅助展开进行稳定延续，并从验证暴露的错误位置重放草稿。这使得草稿模型能够从接受和拒绝的提议中学习目标反馈，将训练集中在限制推测接受的草稿诱导错误上。实验表明，Draft-OPD在多种任务上对思考模型实现了超过5倍的无损加速，比EAGLE-3和DFlash分别提高了23%和13%。

英文摘要

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

URL PDF HTML ☆

赞 0 踩 0

2605.29317 2026-06-01 cs.CL 版本更新

FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning

FoRA: 基于Fisher正交秩适配的参数高效微调

Juneyoung Park, Seongbae Lee, Han-Sang Lee, Kyuho Lee, Minjae Kim, Seungheon Hyeon, Kiduk Kwon, Seongwan Kim, Jaeho Lee

发表机构 * OptAI Inc（OptAI公司）； LG Uplus

AI总结提出FoRA方法，通过Fisher信息选择信息层并在Stiefel流形上训练LoRA下投影，在减少参数预算的同时保持性能，优于LoRA和DoRA。

Comments EMNLP 2026

详情

AI中文摘要

参数高效微调（PEFT）主要关注LoRA及其面向精度的变体，而减少可训练参数的原始目标相对较少受到关注。我们引入了FoRA，通过减少适配层数而非适配器秩来重新审视这一目标。FoRA通过单次对角Fisher评分（训练成本低于1%）选择任务信息层，并在Stiefel流形上训练所选层的LoRA下投影，保持列正交性和有效秩。在五个LLaMA系列骨干网络上，FoRA在参数预算减半的情况下始终优于LoRA和DoRA，在参数数量为AdaLoRA四分之一时，精度差距在0.7-0.8个点以内。在来自LLaMA、Qwen3和Gemma系列的十二个骨干网络上的跨架构实验证实了从270M到32B参数的一致增益。两个组件超加性地结合：Fisher选择本身在相同预算下匹配秩缩减，而Stiefel约束提供了决定性的额外增益。

英文摘要

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.

URL PDF HTML ☆

赞 0 踩 0

2605.29268 2026-06-01 cs.CL cs.AI cs.LG cs.NE 版本更新

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

进化搜索中的计算分配：从深度-广度到多臂老虎机

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang, Haozheng Luo, Tianfan Fu, Aarthy Nagarajan

发表机构 * University of Notre Dame（诺丁汉大学）； Northeastern University（东北大学）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Southeast University（东南大学）； Northwestern University（西北大学）； Nanjing University（南京大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结针对LLM引导的进化搜索中固定预算的LLM调用分配问题，提出基于多臂老虎机的BaSE方法，通过跨并行轨迹分配调用，平均适应度提升12.3%。

详情

AI中文摘要

LLM引导的进化搜索（Evolve系统）在数学和组合任务上达到了最先进的结果，但现有系统通常只报告多次运行中的最佳结果，而未记录运行间的分布。我们询问如何分配固定的LLM调用预算，以及单次运行达到报告数字的可靠性如何。通过扫描五个模型和三个任务的深度-广度网格，我们识别出两个经验规律：一个适应度-计算包络线，其中能力排序主要取决于有效FLOPs；以及一个双线性深度-广度拟合，具有任务特定的交互；两者都受模型-任务能力门控。受这些规律启发，我们提出BaSE（基于老虎机的自进化），一种多臂老虎机，它在并行轨迹间分配LLM调用。在不改变模型、提示或评估器的情况下，BaSE在8个（模型，任务）单元上比最强的岛屿协议基线平均适应度提高12.3%，在方差高的设置上增益最大：仅通过分配实现可靠性提升。

英文摘要

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

URL PDF HTML ☆

赞 0 踩 0

2605.29146 2026-06-01 cs.CL cs.AI 版本更新

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

SafeRx-Agent: 基于知识的多智能体框架用于安全且可解释的药物推荐

Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu, Qincheng Lu, Ziyu Zhao, Jijun Chi, Jingrui Tian, Xiao-Wen Chang, Ziyang Song

发表机构 * McGill University（麦吉尔大学）； McMaster University（麦马斯特大学）； University of Toronto（多伦多大学）； Ohio University（俄亥俄大学）

AI总结提出SafeRx-Agent，一种基于知识的多智能体框架，通过患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合，在MIMIC-III和MIMIC-IV数据集上提高了细粒度药物预测准确性，同时控制了药物相互作用、禁忌症和药物集合大小。

详情

AI中文摘要

药物推荐预测患者就诊时的用药，但现有方法仍面临两个关键挑战。在模型层面，传统药物推荐方法仅预测结构化的药物代码，证据基础有限，而LLM智能体可以利用更丰富的临床上下文，但可能缺乏安全验证和可追溯性。在任务层面，现有基准通常使用宽泛的药物类别，忽略了亚组级别的安全性差异，可能导致风险高估。我们引入了基于第四级ATC代码生成的第一个细粒度药物推荐设置。我们提出了安全处方智能体（SafeRx-Agent），一种基于知识的多智能体框架，利用患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合。在MIMIC-III和MIMIC-IV数据集上的实验结果表明，SafeRx-Agent提高了细粒度药物预测准确性，同时控制了药物相互作用、禁忌症和药物集合大小。

英文摘要

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

URL PDF HTML ☆

赞 0 踩 0

2605.28836 2026-06-01 cs.CL cs.AI 版本更新

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

不让任何读者掉队：人人能理解的多智能体摘要

Jimin Jung, MyoungJin Kim, Jaehyung Seo, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）； Department of Computer Science and Engineering, Konkuk University（konkuk大学计算机科学与工程系）

AI总结提出NRLB多智能体框架，通过模拟三类读者群体并结合模板规划与迭代优化，生成既忠实又易于理解的平实语言摘要。

详情

AI中文摘要

美国的《平实语言法案》要求政府文件使用清晰、简单的语言，以便公众易于理解，但现有的摘要系统难以应对普通读者中多样化的语言和认知障碍。我们提出了NRLB（不让任何读者掉队），一个用于平实语言摘要的多智能体框架，它模拟了三类代表性读者群体：小学生读者、非母语读者和注意力缺陷读者。NRLB结合了基于模板的规划与迭代的、面向读者的优化，能够系统地检测和解决难懂术语、缺失上下文和令人困惑的句子。在多个数据集上的评估显示，在保持事实准确性的同时，可读性持续提升。人工评估进一步验证了NRLB的效果，标注者偏好率在55%到76%之间，突显了NRLB在生成既忠实于原文又广泛适用于公众的平实语言摘要方面的潜力。

英文摘要

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

URL PDF HTML ☆

赞 0 踩 0

2603.09632 2026-06-01 cs.CV cs.CL 版本更新

X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting

X-GS：基于3D高斯溅射的感知与思考可扩展框架

Yueen Ma, Zenglin Xu, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Shanghai Academy of AI for Science（上海人工智能科学研究院）

AI总结提出X-GS框架，包含感知器和思考器，统一多种3DGS技术实现实时在线SLAM与语义蒸馏，并支持多模态模型完成下游任务。

详情

AI中文摘要

3D高斯溅射（3DGS）已成为新颖视图合成的强大技术，随后扩展到众多空间AI应用。然而，大多数现有3DGS方法孤立运行，专注于特定领域。本文介绍X-GS，一个包含两个主要组件的可扩展框架。X-GS-感知器统一了广泛的3DGS技术，以实现具有语义蒸馏的实时在线SLAM。X-GS-思考器容纳多模态模型，使其能够与感知器无缝交互以完成下游任务。在我们的X-GS实现中，感知器利用最新的视觉基础模型提高在线SLAM性能，并采用三种关键机制加速语义蒸馏。思考器可以基于对比和生成视觉语言模型构建，并利用感知器的语义高斯溅射解锁3D视觉定位和场景描述等功能。在多个基准上的实验结果表明了X-GS框架的高效性和新解锁的多模态能力。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods operate in isolation, focusing on specific domains. In this paper, we introduce X-GS, an extensible framework consisting of two major components. The X-GS-Perceiver unifies a broad range of 3DGS techniques to enable real-time online SLAM with semantic distillation. The X-GS-Thinker accommodates multimodal models, enabling them to seamlessly interface with the Perceiver to complete downstream tasks. In our implementation of X-GS, the Perceiver leverages the latest vision foundation models to improve online SLAM performance and employs three key mechanisms to accelerate semantic distillation. The Thinker can be built upon both contrastive and generative vision-language models and utilizes the Perceiver's semantic Gaussian splats to unlock capabilities such as 3D visual grounding and scene captioning. Experimental results on diverse benchmarks demonstrate the efficiency and newly unlocked multimodal capabilities of the X-GS framework.

URL PDF HTML ☆

赞 0 踩 0

2602.10388 2026-06-01 cs.CL cs.AI 版本更新

大部分地理空间网络搜索超越了传统GIS

Ilya Ilyankou, Stefano Cavazzi, James Haworth

发表机构 * SpaceTimeLab（空间时间实验室）； Department of Civil, Environmental, and Geomatic Engineering（土木、环境与测绘工程系）； UCL（伦敦大学学院）

AI总结通过密集句子嵌入、SetFit分类器和密度聚类，在MS MARCO语料库中发现18%的查询具有地理空间性质，并构建了88类分类体系，揭示地理搜索以事务性和实用性查询为主，多数超出传统GIS和知识图谱范围。

详情

AI中文摘要

网络搜索查询涉及地点的频率远高于现有标注方案所表明的，然而地理空间网络搜索查询的景观——人们对地点的询问内容及其频率——在大规模上仍然缺乏特征描述。我们对包含101万条真实必应查询的完整MS MARCO语料库应用密集句子嵌入、轻量级SetFit分类器和基于密度的聚类，无需预先过滤地名或空间关键词，识别出181,827条地理空间查询（18.0%），几乎是原始标注中标记为“位置”的6.17%的三倍。由此产生的88个查询类别分类体系揭示，地理空间网络搜索以事务性和实用性查询为主：仅成本和价格就占地理空间查询的15.3%，几乎是整个自然地理主题规模的两倍。这些活动中的大部分——成本、营业时间、联系方式、天气、旅行推荐——超出了传统GIS和知识图谱旨在服务的范围。这些类别在它们所接受的答案类型上差异很大，从可由空间数据库或知识图谱回答的确定性查询，到需要生成式或实时系统的评估性或时间波动性查询。我们讨论了对混合检索架构以及大型语言模型中地理推理基准的启示。我们公开发布了标注数据集、分类器和分类体系。

英文摘要

Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope of what traditional GIS and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.

URL PDF HTML ☆

赞 0 踩 0

2602.03216 2026-06-01 cs.CL cs.LG 版本更新

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Token Sparse Attention: 交错令牌选择的高效长上下文推理

Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim

发表机构 * Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea（电气电子工程系，首尔国立大学，首尔，韩国）

AI总结提出Token Sparse Attention，一种轻量级动态令牌级稀疏化机制，通过交错选择令牌并在注意力前后压缩/解压缩，实现高效长上下文推理，在128K上下文中获得高达3.23倍加速且精度损失小于1%。

Comments ICML 2026

详情

AI中文摘要

注意力的二次复杂度仍然是大语言模型长上下文推理的核心瓶颈。先前的加速方法要么使用结构化模式稀疏化注意力图，要么在特定层永久驱逐令牌，这可能会保留不相关的令牌或依赖不可逆的早期决策，尽管令牌重要性具有层/头动态性。在本文中，我们提出Token Sparse Attention，一种轻量级动态令牌级稀疏化机制，在注意力期间将每个头的$Q$、$K$、$V$压缩到减少的令牌集，然后将输出解压缩回原始序列，使得令牌信息可以在后续层中重新考虑。此外，Token Sparse Attention在令牌选择和稀疏注意力的交叉点上暴露了一个新的设计点。我们的方法完全兼容密集注意力实现，包括Flash Attention，并且可以无缝地与现有的稀疏注意力内核组合。实验结果表明，Token Sparse Attention持续改善精度-延迟权衡，在128K上下文中实现高达3.23倍的注意力加速，且精度下降小于1%。这些结果表明，动态和交错的令牌级稀疏化是可扩展长上下文推理的一种互补且有效的策略。

英文摘要

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

URL PDF HTML ☆

赞 0 踩 0

2508.21762 2026-06-01 cs.CL cs.AI 版本更新

Reasoning-Intensive Regression

推理密集型回归

Diane Tchuindjo, Omar Khattab

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结针对推理密集型回归任务，提出MENTAT方法，结合批量反思提示优化与神经集成学习，在基准测试中相比基线提升高达65%。

详情

AI中文摘要

AI研究人员和从业者越来越多地将大型语言模型（LLMs）应用于我们称之为推理密集型回归（RiR）的任务，即从文本中推断细微的数值分数。与情感分析或相似性分析等标准语言回归任务不同，RiR通常出现在临时应用中，例如基于评分标准的评分、复杂环境中的密集奖励建模或特定领域的检索，这些任务需要对上下文进行更深入的分析，而可用的任务特定训练数据和计算资源有限。我们将四个实际问题作为RiR任务，建立初始基准，并用于测试我们的假设：即冻结的LLMs和通过梯度下降微调Transformer编码器在RiR中通常都会遇到困难。然后，我们提出MENTAT，一种简单轻量的方法，结合批量反思提示优化与神经集成学习。MENTAT在两个基线上实现了高达65%的提升，尽管未来仍有很大的改进空间。

英文摘要

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances.

URL PDF HTML ☆

赞 0 踩 0

2604.21928 2026-06-01 cs.CL 版本更新

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

使用生成式大语言模型评估自动语音识别

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour

发表机构 * Idiap Research Institute（Idiap研究 institute）； EPFL（瑞士联邦理工学院）； Brno University of Technology（布拉格技术大学）； Avignon University（阿维尼翁大学）； Le Mans University（勒曼大学）； Nantes University（南特大学）

AI总结本文提出利用生成式大语言模型通过假设选择、语义距离计算和错误分类三种方法评估ASR，在HATS数据集上达到92-94%的人类一致性，优于WER和语义指标。

2604.16278 2026-06-01 cs.AI cs.CL cs.LG 版本更新

为什么你不知道？评估不确定性来源对大型语言模型中不确定性量化的影响

Maiya Goloburda, Roman Vashurin, Fedor Chernogorskii, Nurkhan Laiyk, Daniil Orel, Preslav Nakov, Maxim Panov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文通过引入一个明确分类不确定性来源的新数据集，系统评估了现有不确定性量化方法在不同不确定性来源下的表现，发现多数方法在模型知识局限下表现良好，但在其他来源下性能下降或产生误导。

详情

AI中文摘要

随着大型语言模型（LLM）在现实世界应用中的日益普及，可靠的不确定性量化（UQ）对于安全有效使用变得至关重要。现有的大多数语言模型UQ方法旨在产生单一的置信度分数——例如，估计模型答案正确的概率。然而，自然语言任务中的不确定性源于多个不同的来源，包括模型知识差距、输出可变性和输入歧义，这些对系统行为和用户交互有不同的影响。在这项工作中，我们研究了不确定性来源如何影响现有UQ方法的行为和有效性。为了进行受控分析，我们引入了一个新数据集，该数据集明确分类了不确定性来源，允许系统评估每种条件下的UQ性能。我们的实验表明，虽然许多UQ方法在不确定性仅源于模型知识限制时表现良好，但当引入其他来源时，它们的性能会下降或变得具有误导性。这些发现强调了需要明确考虑大型语言模型中不确定性来源的不确定性感知方法。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

URL PDF HTML ☆

赞 0 踩 0

2509.10078 2026-06-01 cs.CL cs.AI 版本更新

Human Psychometric Questionnaires Mischaracterize LLM Behavior

人类心理测量问卷误判LLM行为

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）； Department of Communication, Interdisciplinary Program in Artificial Intelligence, Seoul National University（首尔国立大学通信系人工智能交叉学科项目）

AI总结通过比较LLM在Likert问卷和生成概率上的价值与人格特征，发现问卷存在系统性偏差，提出基于生成概率的评估方法更准确。

Comments 38 pages, 6 figures

详情

AI中文摘要

我们检验了人类心理测量问卷是否可以作为可靠工具来表征和预测LLM在日常用户交互中的行为。我们分析了八个开源LLM，比较了从两种不同方法得出的价值和人格特征：基于既定问卷（PVQ-40/21和BFI-44/10）的Likert自我报告，以及对日常用户查询的价值负载响应的生成概率。两种特征显著不同。在生成概率中，常被引为LLM稳定倾向证据的构念内项目一致性消失了。我们将这一差距归因于既定问卷项目中的显式词汇线索使模型能够识别目标构念并以一致、社会期望的方式响应，而现实用户查询不提供此类线索。此外，人口统计角色提示以与真实人类模式一致的方式改变了模型对人类问卷的响应，但在对现实用户查询的响应生成概率中没有出现此类变化，表明它们在模拟目标人口统计在真实世界用户交互中的行为方面能力有限。总体而言，我们的研究表明，人类心理测量问卷不足以预测LLM行为，并建议基于生成的评估作为更准确的度量。

英文摘要

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

URL PDF HTML ☆

赞 0 踩 0

2603.23160 2026-06-01 cs.CL 版本更新

UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

UniDial-EvalKit：多面对话能力评估的统一工具包

Qi Jia, Haodong Zhao, Dun Pei, Xiujie Song, Ye Shen, Shibo Wang, Zijian Chen, Zicheng Zhang, Xiangyang Zhu, Guangtao Zhai

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出UniDial-EvalKit统一评估工具包，通过标准化数据格式、模块化流水线和层次化评分聚合，解决多轮交互场景下评估协议异构问题，并基于大规模实验揭示当前系统无一致最优、记忆智能体常不及全上下文基线的现象。

详情

AI中文摘要

在多轮交互场景中对大型语言模型（LLM）和智能体进行基准测试对于理解其实际能力至关重要。然而，现有的评估协议高度异构，在数据集格式、模型接口和评估流水线上差异显著，严重阻碍了系统比较。在这项工作中，我们提出了UniDial-EvalKit（UDE），一个用于评估交互式AI系统的统一评估工具包。UDE的核心贡献在于其整体统一性：它将异构数据格式标准化为通用模式，通过模块化架构简化复杂的评估流水线，并在层次化评分聚合下对齐指标计算。它还通过并行生成和评分以及检查点恢复来支持高效的大规模评估，消除冗余计算。利用UDE，我们在多个多维基准上进行了广泛评估。我们的实证分析表明，没有单一系统在所有基准上持续优于其他系统，而当前的记忆智能体通常无法超越全上下文基线。进一步的分析指出了几个未来方向，包括基准去重和更自适应的记忆架构。

英文摘要

Benchmarking large language models (LLMs) and agents in multi-turn interactive scenarios is essential for understanding their practical capabilities. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a hierarchical scoring aggregation. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint resume to eliminate redundant computation. Leveraging UDE, we conduct an extensive evaluation across diverse multi-dimensional benchmarks. Our empirical analysis shows that no single system consistently outperforms others across all benchmarks, while current memory agents often fail to surpass full-context baselines. Further analyses highlight several future directions, including benchmark deduplication and more adaptive memory architectures.

URL PDF HTML ☆

赞 0 踩 0

2511.11440 2026-06-01 cs.CV cs.CL 版本更新

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

合成刺激，真实收益：通过完全受控的数据生成重新思考VLM微调

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento（信号与交互系统实验室，特伦托大学）

AI总结本文提出一种完全受控的数据生成与标注流程，用于微调视觉语言模型（VLM），通过平衡分布和干净标注消除偏差，在空间推理任务上仅用130个样本即可实现均匀性能，并在真实世界数据上提升13%的性能。

详情

AI中文摘要

通过微调获得的视觉语言模型（VLM）的性能提升通常基于对真实世界场景的临时数据收集和标注。尽管有所改进，但这一过程往往容易受到偏差、错误和分布不平衡的影响，导致过拟合和性能不平衡。虽然少数研究探索了合成数据生成，但它们通常缺乏对数据分布和标注质量的控制。在这项工作中，我们通过探索完全受控的数据生成和标注流程，重新评估了模型微调的潜力，获得了具有平衡分布和干净标注的无偏差数据。以识别物体绝对位置的空间推理任务作为用例，我们微调了最先进的VLM，并在合成和真实世界基准上进行了详尽的评估，包括对真实世界场景的可迁移性。我们的实验揭示了两个关键发现：1）在平衡数据上微调可以在视觉场景中产生均匀的性能，并且仅用130个样本就能缓解常见偏差；2）在合成刺激上微调使真实世界数据（COCO）的性能提升了13%，优于在完整COCO训练集上微调的模型。

英文摘要

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

URL PDF HTML ☆

赞 0 踩 0

2510.25110 2026-06-01 cs.CL 版本更新

权重到代码：从离散Transformer中提取可解释算法

Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

发表机构 * Key Laboratory of High Confidence Software Technology (PKU), MOE, Beijing, China（高可信软件技术重点实验室（PKU），教育部，北京，中国）； School of Computer Science, Peking University, Beijing, China（计算机学院，北京大学，北京，中国）； Shanghai AI Lab, Shanghai, China（上海人工智能实验室，上海，中国）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院，上海，中国）

AI总结提出离散Transformer架构，通过温度退火采样注入离散性，结合假设检验和符号回归从模型权重中提取可解释算法，在离散任务上性能与RNN基线相当，并扩展到连续中间计算任务。

详情

AI中文摘要

算法提取旨在直接从算法任务训练的模型中合成可执行程序，从而无需依赖人工编写的目标程序即可从权重中重新发现可执行机制。然而，将此范式应用于Transformer时，由于表示纠缠（例如叠加），其中重叠方向编码的特征严重阻碍了符号表达式的恢复。我们提出了离散Transformer，这是一种专门设计用于弥合连续表示与离散符号逻辑之间差距的架构。通过温度退火采样注入离散性，我们的框架有效利用假设检验和符号回归来提取人类可读的程序。实验表明，离散Transformer在共享离散任务上实现了与基于RNN的MIPS基线相当的性能，同时将提取扩展到具有连续值中间计算的任务。最后，我们展示了架构归纳偏置对合成程序提供了细粒度控制，使离散Transformer成为算法提取和Transformer可解释性的可控测试平台。

英文摘要

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo recovery of executable mechanisms from weights without relying on human-written target programs. However, applying this paradigm to Transformer is complicated by representation entanglement (e.g., superposition), where features encoded in overlapping directions substantially hinder the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to the RNN-based MIPS baseline on shared discrete tasks, while broadening extraction to tasks with continuous-valued intermediate computations. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a controllable testbed for algorithm extraction and Transformer interpretability.

URL PDF HTML ☆

赞 0 踩 0

2603.13875 2026-06-01 cs.CL cs.LG 版本更新

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

GradMem: 通过测试时梯度下降将上下文写入记忆

Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev

发表机构 * AXXX, Cognitive AI Systems Lab, Moscow, Russia（AXXX认知人工智能系统实验室，莫斯科，俄罗斯）； London Institute for Mathematical Sciences, London, UK（伦敦数学科学研究所，伦敦，英国）

AI总结提出GradMem方法，利用测试时梯度下降将上下文写入紧凑记忆状态，通过自监督重构损失优化记忆令牌，在键值检索和自然语言任务上优于前向式记忆写入方法。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

许多大型语言模型应用需要基于长上下文进行条件生成。Transformer通常通过存储每层过去激活的KV缓存来支持这一点，这会产生大量内存开销。一种理想的替代方案是压缩记忆：一次性读取上下文，将其存储在紧凑状态中，并从该状态回答许多查询。我们在上下文移除设置中研究这一点，其中模型在推理时无法访问原始上下文的情况下必须生成答案。我们引入了GradMem，它通过每个样本的测试时优化将上下文写入记忆。给定一个上下文，GradMem在保持模型权重冻结的情况下，对一小部分前缀记忆令牌执行几步梯度下降。GradMem显式优化模型级的自监督上下文重构损失，从而产生带有迭代纠错的损失驱动写入操作，这与仅前向方法不同。在关联键值检索中，GradMem在相同记忆大小下优于仅前向记忆写入器，并且额外的梯度步长比重复的前向写入更有效地扩展容量。我们进一步表明，GradMem可以迁移到合成基准之外：使用预训练语言模型，它在自然语言任务（包括bAbI和SQuAD变体）上取得了有竞争力的结果，仅依赖于记忆中的编码信息。

英文摘要

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is compressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

URL PDF HTML ☆

赞 0 踩 0

2602.12192 2026-06-01 cs.CL 版本更新

Query-focused and Memory-aware Reranker for Long Context Processing

面向长上下文处理的查询聚焦与记忆感知重排序器

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Yanyu Chen, Zheng Lin, Wei Zhang, Jie Zhou

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Pattern Recognition Center, WeChat AI, Tencent（腾讯微信人工智能研究院模式识别中心）； East China Normal University（华东师范大学）

AI总结提出一种基于大语言模型中检索头注意力分数的列表式重排序框架，利用候选列表整体信息估计段落-查询相关性，无需Likert监督，在多个领域取得最优性能。

Comments Add new experiments and compare more baselines

详情

AI中文摘要

基于现有对大语言模型中检索头的分析，我们提出了一种替代的重排序框架，该框架训练模型使用选定头的注意力分数来估计段落-查询相关性。这种方法提供了一种列表式解决方案，在排序过程中利用整个候选短列表中的整体信息。同时，它自然地产生连续的相关性分数，使得能够在任意检索数据集上进行训练，而无需Likert量表监督。我们的框架轻量且有效，仅需小规模模型（如3B参数）即可实现强性能。大量实验表明，我们的方法在多个领域（包括维基百科和长叙事数据集）中优于现有的最先进点式和列表式重排序器。它进一步在评估对话理解和记忆使用的LoCoMo基准上建立了新的最先进水平。我们还证明我们的框架支持灵活的扩展。例如，用上下文信息增强候选段落可进一步提高排序准确性，而训练中间层的注意力头可在不牺牲性能的情况下提高效率。

英文摘要

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages the holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models, such as 3B parameters, to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark, which assesses dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

URL PDF HTML ☆

赞 0 踩 0

2603.07751 2026-06-01 cs.CV cs.CL 版本更新

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

3ViewSense: 视觉-语言模型中基于正交视图的空间与心理视角推理

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China（清华大学深圳国际研究生院）； School of Software Engineering, Chongqing University, Chongqing, China（重庆大学软件学院）

AI总结提出3ViewSense框架，通过正交视图的“模拟-推理”机制解决视觉-语言模型在空间推理中的视角一致性问题，显著提升遮挡计数和空间推理性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

当前大型语言模型已达到奥林匹克级别的逻辑推理能力，然而视觉-语言模型在诸如积木计数等基础空间任务上却出人意料地表现不佳。这种能力不匹配揭示了一个关键的“空间智能鸿沟”，即模型无法从2D观测中构建连贯的3D心理表征。我们通过诊断分析发现，这一瓶颈在于缺乏视角一致的空间接口，而非视觉特征不足或推理能力薄弱。为弥合这一鸿沟，我们引入了 extbf{3ViewSense}框架，该框架将空间推理建立在正交视图之上。借鉴工程认知，我们提出了一种“模拟-推理”机制，将复杂场景分解为规范的正交投影以解决几何歧义。通过将自我中心感知与这些异中心参考对齐，我们的方法促进了显式的心理旋转与重建。在空间推理基准上的实验结果表明，我们的方法显著优于现有基线，在遮挡密集计数和视角一致空间推理上取得了一致的提升。该框架还提高了空间描述的稳定性和一致性，为多模态系统中更强的空间智能提供了一条可扩展的路径。~ ootnote{https://github.com/Jasaxion/3ViewSense}

英文摘要

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.~\footnote{https://github.com/Jasaxion/3ViewSense}

URL PDF HTML ☆

赞 0 踩 0

2510.15614 2026-06-01 cs.CL 版本更新

HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds

HypoSpace: 欠定性与次线性覆盖边界下集合值假设生成的诊断基准

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Anirudh Goyal, Yew-Soon Ong, Dianbo Liu

发表机构 * National University of Singapore（国立新加坡大学）； College of Computing（计算学院）； Data Science（数据科学）； Nanyang Technological University（南洋理工大学）； Centre for Frontier AI Research (CFAR)（前沿人工智能研究中心 (CFAR)）； Agency for Science, Technology and Research（科技研究局）； Meta Superintelligence Labs（Meta超智能实验室）

AI总结提出HypoSpace基准，通过三个结构化领域评估大语言模型在欠定假设空间中的采样能力，发现模型在高有效性下存在覆盖失败，并展示分层解码可部分缓解该问题。

详情

AI中文摘要

许多科学问题是欠定的：多个不同的假设与相同的观察结果一致。在这种情况下，有效的推理不仅需要产生有效的解释，还需要系统地探索和覆盖可接受的假设集。我们引入了HypoSpace，这是一个将大语言模型（LLMs）视为有限假设空间上的采样器，并在三个指标上评估它们的基准：有效性、唯一性和恢复率。HypoSpace涵盖三个结构化领域（因果图推理、重力约束的3D体素重建和布尔遗传相互作用建模），具有确定性验证器和精确可枚举的解空间，以及基于真实世界的案例研究。实验上，HypoSpace揭示了一种依赖于能力和规模的覆盖失败：随着可接受假设空间变得更大或更具组合性，模型可以保持高有效性，同时表现出降低的唯一性和恢复率。我们进一步表明，对分层解码的分析部分缓解了这种崩溃，证明了HypoSpace作为集合值推理的诊断基准的实用性。代码可在 https://github.com/CTT-Pavilion/_HypoSpace 获取。

英文摘要

Many scientific problems are underdetermined: multiple distinct hypotheses are equally consistent with the same observations. In such settings, effective inference requires not only producing valid explanations, but also systematically exploring and covering the admissible hypothesis set. We introduce HypoSpace, a benchmark that treats large language models (LLMs) as samplers over finite hypothesis spaces and evaluates them on three metrics: Validity, Uniqueness, and Recovery. HypoSpace spans three structured domains (causal graph inference, gravity-constrained 3D voxel reconstruction, and Boolean genetic interaction modeling) with deterministic validators and exactly enumerable solution spaces, plus real-world anchored case studies. Empirically, HypoSpace reveals a capability- and scale-dependent coverage failure: models can maintain high Validity while exhibiting reduced Uniqueness and Recovery as admissible hypothesis spaces become larger or more combinatorial. We further show that the analysis on stratified decoding partially mitigates this collapse, demonstrating HypoSpace's utility as a diagnostic benchmark for set-valued inference. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

URL PDF HTML ☆

赞 0 踩 0

2408.10441 2026-06-01 cs.CL 版本更新

Goldfish: Monolingual Language Models for 350 Languages

Goldfish: 面向350种语言的单语语言模型

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

发表机构 * Department of Cognitive Science（认知科学系）； Department of Linguistics（语言学系）； Department of Computer Science（计算机科学系）； University of California San Diego（加州大学圣地亚哥分校）

AI总结针对低资源语言，发现大型多语言模型在基本语法生成上不如双元模型，而小规模单语模型在困惑度和语法性基准上表现更优，并发布了覆盖350种语言的1000多个单语模型套件Goldfish。

Comments LREC 2026

详情

AI中文摘要

对于许多低资源语言，唯一可用的语言模型是在多种语言上同时训练的大型多语言模型。尽管在推理任务上表现最先进，我们发现这些模型在许多语言的基本语法文本生成上仍然存在困难。首先，使用FLORES困惑度作为评估指标，大型多语言模型在许多语言上的表现甚至不如双元模型（例如，XGLM 4.5B中24%的语言；BLOOM 7.1B中43%的语言）。其次，当我们为350种语言训练仅有1.25亿参数、数据量不超过1GB的小型单语模型时，这些小型模型在困惑度和大规模多语言语法性基准上都优于大型多语言模型。为了促进未来低资源语言建模的工作，我们发布了Goldfish，一个包含超过1000个小型单语语言模型的套件，这些模型在350种语言上进行了可比训练。这些模型代表了其中215种语言首次公开可用的单语语言模型。

英文摘要

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.

URL PDF HTML ☆

赞 0 踩 0

2603.04946 2026-06-01 cs.CL 版本更新

LocalSUG: City-Preference-Enhanced LLM for Query Suggestion in Local-Life Services

LocalSUG：面向本地生活服务的城市偏好增强大语言模型查询建议

Jinwen Chen, Shiwen Zhang, Shuai Gong, Zheng Zhang, Yachao Zhao, Lingxiang Wang, Haibo Zhou, Wei Lin, Hainan Zhang

发表机构 * Meituan（美团）

AI总结提出LocalSUG框架，通过挖掘城市偏好候选词注入提示、束搜索GRPO算法和加速解码，解决大语言模型在本地生活服务查询建议中城市偏好不足、偏好暴露偏差和延迟约束问题。

详情

AI中文摘要

在本地生活服务平台中，查询建议通过从输入前缀生成候选查询来减少用户操作。传统的多阶段系统严重依赖历史热门查询，限制了其捕捉长尾和新兴需求的能力。尽管大语言模型提供了强大的语义泛化能力，但它们在本地生活服务中的部署面临三个挑战：城市偏好感知不足、偏好优化中的暴露偏差以及严格的在线延迟约束。我们提出了LocalSUG，一个基于大语言模型的本地生活服务查询建议框架。LocalSUG从术语共现中挖掘城市偏好增强的候选词，并将其作为动态参考注入提示中，而非融合到模型参数中。这使得模型能够适应变化的城市偏好（如商家开业或关闭），同时减少过时或本地无效的建议。我们进一步引入了束搜索驱动的GRPO算法，使训练与推理时解码对齐，并优化相关性以及业务导向的奖励。最后，质量感知的束加速和词汇剪枝在保持生成质量的同时降低了在线延迟。离线评估和大规模在线A/B测试表明，LocalSUG将点击率提高了0.35%，并将低/无结果率降低了3.98%，证明了其在实际部署中的有效性。

英文摘要

In local-life service platforms, query suggestion reduces user effort by generating candidate queries from input prefixes. Traditional multi-stage systems rely heavily on historical popular queries, limiting their ability to capture long-tail and emerging demand. Although LLMs provide strong semantic generalization, their deployment in local-life services faces three challenges: insufficient city-preference awareness, exposure bias in preference optimization, and strict online latency constraints. We propose LocalSUG, an LLM-based query suggestion framework for local-life services. LocalSUG mines city-preference-enhanced candidates from term co-occurrence and injects them into prompts as dynamic references rather than fusing them into model parameters. This allows the model to adapt to changing city preferences, such as merchant openings or closures, while reducing stale or locally invalid suggestions. We further introduce a beam-search-driven GRPO algorithm to align training with inference-time decoding and optimize relevance together with business-oriented rewards. Finally, quality-aware beam acceleration and vocabulary pruning reduce online latency while preserving generation quality. Offline evaluations and large-scale online A/B testing show that LocalSUG improves CTR by +0.35% and reduces the low/no-result rate by 3.98%, demonstrating its effectiveness in real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2602.24210 2026-06-01 cs.CL cs.AI 版本更新

From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves

从泄露思维到私有推理：控制LRM对自己说的话

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

发表机构 * Mohamed bin Zayed University of Artificial Intelligence, UAE（穆罕默德·本·扎耶德人工智能大学，阿联酋）

AI总结针对大型推理模型（LRM）推理过程中隐私泄露问题，提出通过指令跟随（IF）训练和分阶段解码策略（Staged Decoding）增强隐私保护，在IF和隐私基准上分别提升高达20.9和51.9个百分点。

详情

AI中文摘要

大型推理模型（LRM）产生的推理轨迹（RT）通常包含敏感信息。这些泄露的思维难以控制，且经常违反明确的隐私指令。由于RT可能通过提示注入攻击暴露，这对用户构成了直接的隐私风险。我们将此视为一个可控性问题：由于隐私指令本身就是指令，在RT内改进指令跟随（IF）为减少隐私泄露提供了直接途径。为此，我们引入了一个SFT数据集，教会模型在其推理过程中遵循通用指令，并提出了分阶段解码（Staged Decoding），一种简单的解码策略，通过使用独立的LoRA适配器解耦RT和答案生成，以最大化每个组件的IF。我们在两个系列（1.7B-14B参数）的六个模型上，在两个IF基准和两个隐私基准上评估了我们的方法。我们的方法带来了显著的改进，在IF上提升高达20.9分，在隐私基准上提升51.9个百分点，尽管由于推理性能与IF之间的权衡，这些改进可能以牺牲任务效用为代价。我们的结果表明，改进LRM中的IF可以显著增强隐私，为未来隐私感知的LRM提供了一个有前景的方向。我们的代码可在https://github.com/UKPLab/arxiv2026-controllable-reasoning-models获取。

英文摘要

Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks. Our method yields substantial improvements, with gains of up to 20.9 points in IF and 51.9 percentage points on privacy benchmarks, though these can come at the cost of task utility due to the trade-off between reasoning performance and IF. Our results show that improving IF in LRMs can significantly enhance privacy, suggesting a promising direction for future privacy-aware LRMs. Our code is available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models.

URL PDF HTML ☆

赞 0 踩 0

2602.19049 2026-06-01 cs.CL cs.LG 版本更新

*-PLUIE：使用大语言模型改进评估的可个性化度量

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

发表机构 * Univ Rennes, CNRS, IRISA, EXPRESSION（里尔大学、法国国家科学研究中心、IRISA、EXPRESSION）； Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004（南特大学、南特中央理工学院、法国国家科学研究中心、LS2N、UMR 6004）； Univ Rennes, CNRS, IRISA, SOTERN（里尔大学、法国国家科学研究中心、IRISA、SOTERN）； Univ of South Brittany, CNRS, IRISA, ARCHIMEDIA（布列塔尼南部大学、法国国家科学研究中心、IRISA、ARCHIMEDIA）

AI总结提出*-PLUIE，一种基于困惑度的可个性化LLM评判度量，通过任务特定提示变体实现与人类判断的更强相关性，同时保持低计算成本。

Comments Accepted at *SEM 2026

2602.15293 2026-06-01 cs.LG cs.AI cs.CL stat.ML 版本更新

The Information Geometry of Softmax: Probing and Steering

Softmax的信息几何：探测与引导

Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch

发表机构 * University of Chicago（芝加哥大学）

AI总结本文从信息几何角度研究AI系统如何将语义结构编码到表示空间的几何结构中，并提出一种利用线性探针鲁棒引导表示以展现特定概念的“双重引导”方法。

Comments Code is available at https://github.com/KihoPark/dual-steering

详情

Journal ref: In Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026

AI中文摘要

本文关注AI系统如何将语义结构编码到其表示空间的几何结构中的问题。动机观察是，这些表示空间的自然几何应反映模型使用表示产生行为的方式。我们聚焦于定义softmax分布的重要特例。在这种情况下，我们认为自然几何是信息几何。我们的重点是信息几何在语义编码和线性表示假设中的作用。作为一个说明性应用，我们开发了“双重引导”，一种利用线性探针鲁棒地引导表示以展现特定概念的方法。我们证明双重引导在最小化对非目标概念改变的同时，最优地修改目标概念。实验上，我们发现双重引导增强了概念操控的可控性和稳定性。

英文摘要

This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

URL PDF HTML ☆

赞 0 踩 0

2602.13069 2026-06-01 cs.LG cs.CL 版本更新

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

面向设备上大语言模型微调的内存高效结构化反向传播

Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee

发表机构 * OptAI Inc.（OptAI公司）

AI总结提出MeSP方法，通过手动推导利用LoRA低秩结构的反向传播，在计算数学等价梯度的同时平均减少49%内存，使内存受限设备上的微调成为可能。

Comments ACL2026

详情

AI中文摘要

设备上微调能够实现大语言模型的隐私保护个性化，但移动设备存在严重的内存限制，通常所有工作负载共享6-12GB内存。现有方法迫使在高内存的精确梯度（MeBP）和低内存的噪声估计（MeZO）之间进行权衡。我们提出内存高效结构化反向传播（MeSP），通过手动推导利用LoRA低秩结构的反向传播来弥合这一差距。我们的关键洞察是，中间投影 $h = xA$ 可以在反向传播中以最小成本重新计算，因为秩 $r \ll d_{in}$，从而无需存储它。在Qwen2.5模型（0.5B-3B）上，MeSP相比MeBP平均减少49%内存，同时计算数学上等价的梯度。我们的分析还揭示，MeZO的梯度估计与真实梯度的相关性接近零（余弦相似度≈0.001），解释了其收敛缓慢的原因。MeSP将Qwen2.5-0.5B的峰值内存从361MB降低到136MB，使得先前在内存受限设备上不可行的微调场景成为可能。

英文摘要

On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

URL PDF HTML ☆

赞 0 踩 0

2602.11137 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Weight Decay Improves Language Model Plasticity

权重衰减提升语言模型可塑性

Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

发表机构 * Broad Institute, Schmidt Center（Broad研究所，Schmidt中心）； University of Tübingen, Tübingen AI Center（图宾根大学，图宾根人工智能中心）； Harvard University（哈佛大学）

AI总结本文通过系统实验表明，预训练中较大的权重衰减能提高模型的可塑性，使微调后下游性能更优，并揭示了其促进线性可分表示、正则化注意力矩阵和减少过拟合的机制。

详情

AI中文摘要

使用Transformer进行上下文无关语言识别

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Boston University（波士顿大学）； Allen Institute for AI（人工智能研究院）

AI总结本文证明循环Transformer通过O(log N)层和O(N^6)填充符号可识别所有上下文无关语言，并针对无歧义子类将填充需求降至O(N^3)。

详情

AI中文摘要

Transformer在处理符合某种语法的良好形式输入（如自然语言和代码）的任务中表现出色。然而，它们如何处理语法句法仍不清楚。事实上，在标准复杂性猜想下，标准Transformer无法识别上下文无关语言（CFL）——一种描述句法的规范形式，甚至无法识别正则语言（CFL的子类）。过去的工作表明，O(log(N))循环层（相对于输入长度N）允许Transformer识别正则语言，但循环Transformer识别上下文无关语言的问题仍然开放。在这项工作中，我们证明具有O(log(N))循环层和O(N^6)填充符号的循环Transformer可以识别所有CFL。然而，使用O(N^6)填充符号的训练和推理可能不切实际。幸运的是，我们表明，对于无歧义CFL等自然子类，Transformer上的识别问题变得更加易处理，只需要O(N^3)填充。实验上，循环和填充Transformer在识别CFL方面比固定深度Transformer表现更好。总体而言，我们的结果揭示了Transformer识别CFL的复杂性：虽然一般识别可能需要难以处理的填充量，但无歧义性等自然约束产生了高效的识别算法。

英文摘要

Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work has shown that $\mathcal{O}(\log(N))$ looping layers (w.r.t. input length $N$) allow transformers to recognize regular languages, but the question of context-free recognition with looped transformers remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(N))$ looping layers and $\mathcal{O}(N^6)$ padding symbols can recognize all CFLs. However, training and inference with $\mathcal{O}(N^6)$ padding symbols is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(N^3)$ padding. Empirically, looped and padded transformers perform better than fixed-depth transformers in recognizing CFLs. Overall, our results shed light on the intricacy of CFL recognition by transformers: while general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

URL PDF HTML ☆

赞 0 踩 0

2602.06161 2026-06-01 cs.CL cs.AI 版本更新

作为统计估计的机械可解释性：方差分析

Maxime Méloux, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG（格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG）

AI总结本文从统计估计角度审视机械可解释性中的电路发现，揭示因果中介分析中单输入得分的固有方差导致电路不稳定，并系统分解方差来源，倡导更严谨的实践。

详情

AI中文摘要

机械可解释性（MI）旨在通过识别功能子网络来逆向工程模型行为。然而，这些发现的科学有效性取决于其稳定性。在这项工作中，我们认为电路发现不是一个独立的任务，而是一个建立在因果中介分析（CMA）基础上的统计估计问题。我们揭示了这一基础层的根本不稳定性：精确的单输入CMA得分表现出高固有方差，这意味着组件的因果效应是一个易变的随机变量，而非固定属性。然后，我们证明电路发现流程继承了这一方差并进一步放大。快速近似方法，如边缘属性修补及其后续方法，引入了额外的估计噪声，而在数据集上聚合这些噪声得分会导致脆弱的结构估计。因此，输入数据或超参数的小扰动会产生截然不同的电路。我们系统地分解了这些方差来源，并倡导更严格的MI实践，优先考虑统计稳健性和稳定性指标的常规报告。

英文摘要

Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

URL PDF HTML ☆

赞 0 踩 0

2512.19673 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

自底向上策略优化：你的语言模型策略内部隐藏着内部策略

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； Tencent AI Lab（腾讯AI实验室）

AI总结本文通过分解Transformer残差流中的内部层策略和内部模块策略，提出自底向上策略优化（BuPO）方法，通过早期优化内部层来重建LLM的推理基础，在复杂推理基准上验证了有效性。

Comments Preprint. Our code is available at https://github.com/Trae1ounG/BuPO

详情

AI中文摘要

现有的强化学习方法将大型语言模型（LLM）视为统一策略，忽略了其内部机制。在本文中，我们通过Transformer的残差流将基于LLM的策略分解为内部层策略和内部模块策略。我们对内部策略的熵分析揭示了不同的模式：（1）普遍地，内部策略从早期层的高熵探索演变为顶层层的确定性精炼；（2）Qwen表现出显式的渐进推理结构，与Llama中的突然收敛形成对比。此外，我们发现优化内部层会引发特征精炼，迫使较低层早期捕获高层推理表示。受这些发现启发，我们提出了自底向上策略优化（BuPO），一种新的强化学习范式，通过在早期阶段优化内部层来自底向上重建LLM的推理基础。在复杂推理基准上的大量实验证明了BuPO的有效性。

英文摘要

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's residual stream. Our entropy analysis of internal policy reveals distinct patterns: (1) universally, internal policies evolve from high-entropy exploration in early layers to deterministic refinement in the top layers; and (2) Qwen exhibits an explicit progressive reasoning structure, contrasting with the abrupt convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO.

URL PDF HTML ☆

赞 0 踩 0

2601.13433 2026-06-01 cs.CL cs.LG 版本更新

InfiMed-ORBIT: 通过基于评分标准的增量训练使大语言模型对齐开放复杂任务

Pengkai Wang, Pengwei Liu, Qi Zuo, Zhijie Sang, Congkai Xie, Hongxia Yang

发表机构 * Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China（香港理工大学计算机系）； Department of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院）

AI总结提出ORBIT框架，利用动态生成的病例条件评分标准指导增量强化学习，仅用2k样本将Qwen3-4B-Instruct在HealthBench-Hard上的得分从7.0提升至27.5，达到同规模开源模型最优。

详情

AI中文摘要

强化学习（RL）推动了大语言模型（LLM）的许多近期突破，尤其是在奖励可自动计算的任务（如代码生成）中。然而，在开放式的医学对话中，RL效果较差，因为反馈模糊、依赖上下文，且难以简单总结为单一标量信号——通常需要高度监督的奖励模型，并存在奖励破解的风险。因此，我们引入了ORBIT，一个专为关键医学对话设计的基于评分标准的开放式增量训练框架。ORBIT将医学对话构建与动态生成的病例条件评分标准相结合，这些评分标准作为增量RL的自适应指南。与依赖外部医学知识库或手工规则的方法不同，ORBIT使用评分标准引导的评估，并可与通用指令遵循LLM一起实现，避免了任务特定的评判微调。仅使用2k训练样本，ORBIT将Qwen3-4B-Instruct的HealthBench-Hard得分从7.0提升至27.5，在相似规模的开源模型中实现了最先进的性能，同时随着评分标准覆盖范围的扩大，保持了良好的咨询质量。

英文摘要

Reinforcement learning (RL) has powered many recent breakthroughs in large language models (LLMs), especially for tasks where rewards can be computed automatically, such as code generation. However, it is less effective in open-ended medical dialogue, where feedback is ambiguous, context-dependent, and difficult to simply summarize into a single scalar signal-often requiring heavily supervised reward models and creating risks of reward hacking. Thus, we introduce ORBIT, an open-ended rubric-based incremental training framework tailored for critical medical dialogues. ORBIT integrates medical dialogue construction with dynamically generated case-conditioned rubrics that serve as adaptive guides for incremental RL. Unlike approaches that rely on external medical knowledge bases or handcrafted rules, ORBIT uses rubric-guided evaluation and can be implemented with general-purpose instruction-following LLMs, avoiding task-specific judge fine-tuning. With only 2k training samples, ORBIT raises Qwen3-4B-Instruct's HealthBench-Hard score from 7.0 to 27.5, achieving state-of-the-art performance among similarly sized open-source models while maintaining strong consultation quality as rubric coverage broadens.

URL PDF HTML ☆

赞 0 踩 0

2511.19923 2026-06-01 cs.CV cs.CL 版本更新

Distilling Counterfactual Reasoning from Language to Vision: Causal Graph Guided Post-Training for Video Understanding

从语言到视觉的反事实推理蒸馏：因果图引导的视频理解后训练

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

发表机构 * Rutgers University（新泽西罗格斯大学）

AI总结针对视觉语言模型在反事实推理上的不足，提出CounterVQA基准和CFGPT后训练方法，通过从语言模态蒸馏反事实推理能力提升视频理解。

详情

AI中文摘要

视觉语言模型（VLM）最近在视频理解方面取得了显著进展，特别是在特征对齐、事件推理和指令遵循任务中。然而，它们在反事实推理（即在假设条件下推断替代结果）方面的能力仍未得到充分探索。这种能力对于鲁棒的视频理解至关重要，因为它需要识别潜在的因果结构并推理未观察到的可能性，而不仅仅是识别观察到的模式。为了系统评估这一能力，我们引入了CounterVQA，一个基于视频的基准测试，具有三个渐进难度级别，评估反事实推理的不同方面。通过对最先进的开源和闭源模型的全面评估，我们发现了一个显著的性能差距：虽然这些模型在简单的反事实问题上达到了合理的准确性，但在复杂的多跳因果链上性能显著下降。为了解决这些限制，我们开发了一种后训练方法CFGPT，通过从语言模态蒸馏其反事实推理能力来增强模型的视觉反事实推理能力，在CounterVQA的所有难度级别上均取得了一致的改进。数据集和代码将后续发布。

英文摘要

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

URL PDF HTML ☆

赞 0 踩 0

2511.17826 2026-06-01 cs.LG cs.CL stat.ML 版本更新

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

跨张量并行大小的确定性推理，消除训练-推理不匹配

Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu

发表机构 * Independent Researcher（独立研究者）； University of Minnesota, Minneapolis, Minnesota, USA（明尼苏达大学）； Rice University, Houston, Texas, USA（里士满大学）； NVIDIA Corp., Santa Clara, California, USA（NVIDIA公司）

AI总结针对不同张量并行大小导致浮点运算非结合性引起的推理非确定性问题，提出基于树的核（TBIK）实现跨TP大小的比特级一致结果，消除RL训练中推理与训练引擎间的精度不匹配。

详情

AI中文摘要

确定性推理对于大型语言模型（LLM）应用（如LLM-as-a-judge评估、多智能体系统和强化学习（RL））日益关键。然而，现有的LLM服务框架表现出非确定性行为：当系统配置（例如张量并行（TP）大小、批大小）变化时，即使采用贪心解码，相同的输入也可能产生不同的输出。这是由于浮点运算的非结合性以及GPU间归约顺序不一致导致的。虽然先前的工作通过批不变核解决了与批大小相关的非确定性，但跨不同TP大小的确定性仍然是一个开放问题，特别是在RL设置中，训练引擎通常使用全分片数据并行（即TP=1），而部署引擎依赖多GPU TP以最大化推理吞吐量，从而在两者之间产生自然的不匹配。这种精度不匹配问题可能导致RL训练性能次优甚至崩溃。我们识别并分析了TP引起不一致的根本原因，并提出了基于树的核（TBIK），这是一组TP不变的矩阵乘法和归约原语，无论TP大小如何，都能保证比特级相同的结果。我们的关键见解是通过统一的层次二叉树结构对齐GPU内和GPU间的归约顺序。我们在Triton中实现了这些核，并将其集成到vLLM和FSDP中。实验证明，在不同TP大小下，确定性推理的概率发散为零，且具有比特级可重复性。此外，在采用不同并行策略的RL训练流程中，我们在vLLM和FSDP之间实现了比特级相同的结果。代码可在https://github.com/nanomaoli/llm_reproducibility获取。

英文摘要

Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2509.12440 2026-06-01 cs.CL cs.AI 版本更新

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

MedFact：大型语言模型在中文医学文本上的事实核查能力基准测试

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai

发表机构 * Xunfei Healthcare Technology Co., Ltd.（讯飞医疗科技有限公司）

AI总结为评估LLM在中文医学文本中的事实核查能力，构建了包含2116个专家标注实例的MedFact基准，涵盖13个专科、8种错误类型等，并发现模型在错误定位上表现不足，存在“过度批评”现象。

Comments Accepted to The Fifth Workshop on Generation, Evaluation, and Metrics (GEM) at ACL 2026

详情

AI中文摘要

在医疗应用中部署大型语言模型（LLM）需要具备事实核查能力，以确保患者安全和法规合规。我们引入了MedFact，一个具有挑战性的中文医学事实核查基准，包含来自多样化真实文本的2,116个专家标注实例，涵盖13个专科、8种错误类型、4种写作风格和5个难度级别。构建采用混合AI-人类框架，其中迭代的专家反馈优化AI驱动的多标准过滤，以确保高质量和难度。我们评估了20个领先的LLM在真实性分类和错误定位方面的表现，结果显示模型通常能判断文本是否包含错误，但难以精确定位错误，顶级模型的表现仍不及人类。我们的分析揭示了“过度批评”现象，即模型倾向于将正确信息误判为错误，而高级推理技术（如多智能体协作和推理时扩展）可能加剧这一问题。MedFact突显了部署医疗LLM的挑战，并为开发事实可靠的医疗AI系统提供了资源。

英文摘要

Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

URL PDF HTML ☆

赞 0 踩 0

2503.22996 2026-06-01 cs.CL 版本更新

Rethinking Sparse Mixture of Experts from a Unified Perspective

从统一视角重新思考稀疏混合专家模型

Giang Do, Hung Le, Truyen Tran

发表机构 * Applied Artificial Intelligence Intiative (A2I2), Deakin University, Victoria, Australia（应用人工智能倡议（A2I2），德金大学，维多利亚，澳大利亚）

AI总结针对稀疏混合专家模型中固定预算导致无关选择或遗漏关键分配的问题，提出基于线性规划的统一框架USMoE，通过统一机制和评分实现灵活专家选择，提升性能并降低推理成本。

Comments 35 pages

详情

Journal ref: ICML 2026

AI中文摘要

稀疏混合专家（SMoE）模型在保持恒定计算开销的同时扩展了模型容量。SMoE方法分为两类：Token Choice（将每个令牌路由到固定数量的专家）和Expert Choice（为每个专家分配固定数量的令牌）。然而，对令牌或专家使用固定预算导致两种方法都会选择不相关的令牌-专家对或忽略关键分配，从而降低整体性能。为填补这一空白，我们通过线性规划的视角从统一角度重新思考SMoE，为SMoE模型提供了通用公式。此外，我们引入了统一稀疏混合专家（USMoE），这是一个包含统一机制和统一评分的新框架，以克服这些限制。我们提供了理论证明和实证证据，展示了USMoE的有效性。在多种数据设置（干净和损坏）、多个领域（包括文本和视觉任务）以及不同学习方法（无训练和基于训练）上的广泛评估表明，USMoE不仅比现有SMoE方法带来了显著的性能提升，而且实现了更灵活的专家选择预算，在不影响模型性能的情况下降低了推理成本。我们的实现已在https://github.com/giangdip2410/USMoE公开。

英文摘要

Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories: Token Choice, which routes each token to a fixed number of experts, and Expert Choice, which assigns a fixed number of tokens to each expert. However, the use of fixed budgets for tokens or experts causes both approaches to select irrelevant token-expert pairs or overlook critical assignments, which degrades overall performance. To fill that gap, we rethink SMoE from a unified perspective through the lens of linear programming, which provides a general formulation for SMoE models. Furthermore, we introduce Unified Sparse Mixture of Experts (USMoE), a novel framework comprising a unified mechanism and a unified score to overcome these limitations. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness. Extensive evaluations across diverse data settings (clean and corrupted), multiple domains (including texts and vision tasks), and different learning approaches (training-free and training-based) show that USMoE not only delivers significant performance improvements over existing SMoE methods, but also enables more flexible expert selection budgets, reducing inference costs without compromising model performance. Our implementation is publicly available at https://github.com/giangdip2410/USMoE.

URL PDF HTML ☆

赞 0 踩 0

2510.20853 2026-06-01 eess.AS cs.CL cs.SD 版本更新

Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization

超越听觉：通过生理学启发的标记化从耳机学习任务无关的ExG表示

Hyungjun Yoon, Seungjoo Lee, Yu Yvonne Wu, Xiaomeng Chen, Taiting Lu, Freddy Yifei Liu, Taeckyung Lee, Hyeongheon Cha, Haochen Zhao, Gaoteng Zhao, Dongyao Chen, Cecilia Mascolo, Sung-Ju Lee, Lili Qiu

发表机构 * KAIST（韩国科学技术院）； Carnegie Mellon University（卡内基梅隆大学）； University of Cambridge（剑桥大学）； Shanghai Jiao Tong University（上海交通大学）； Pennsylvania State University（宾夕法尼亚州立大学）； UCLA（加州大学洛杉矶分校）； Northwest University（北华大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Microsoft Research（微软研究院）

AI总结提出一种基于耳机的生理学启发的多频带标记化方法（PiMT），通过无干扰的日常ExG数据采集和重建任务学习鲁棒表示，实现跨多种任务（包括五种人类感官）的通用ExG监测。

Comments Accepted to ICLR 2026

详情

AI中文摘要

电生理（ExG）信号为人类生理学提供了有价值的见解，但由于两个关键限制，构建能够泛化到日常任务的基础模型仍然具有挑战性：（i）数据多样性不足，因为大多数ExG记录是在受控实验室中使用笨重、昂贵的设备收集的；以及（ii）任务特定的模型设计需要定制的处理（即目标频率滤波器）和架构，这限制了跨任务的泛化。为了解决这些挑战，我们引入了一种可扩展的、任务无关的野外ExG监测方法。我们使用基于耳机的硬件原型收集了50小时的无干扰自由生活ExG数据，以缩小数据多样性差距。我们方法的核心是生理学启发的多频带标记化（PiMT），它将ExG信号分解为12个生理学启发的标记，然后通过重建任务学习鲁棒的表示。这使得能够在全频谱范围内进行自适应特征识别，同时捕获任务相关信息。在我们新的DailySense数据集（第一个支持基于ExG的五种人类感官分析的数据集）以及四个公共ExG基准上的实验表明，PiMT在多种任务上始终优于最先进的方法。

英文摘要

Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i)~insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii)~task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset, the first to enable ExG-based analysis across five human senses, together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.

URL PDF HTML ☆

赞 0 踩 0

2503.05846 2026-06-01 cs.CL cs.AI 版本更新

EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

EMCEE：通过提取合成多语言上下文桥接知识与推理以提升大语言模型的多语言能力

Hamin Koo, Jaehyung Kim

发表机构 * Yonsei University（延世大学）

AI总结提出EMCEE框架，通过从LLM自身提取并融合语言特定知识，结合推理输出，显著提升多语言任务性能，尤其在低资源语言上平均提升31.7%。

Comments ACL 2026 Main

详情

AI中文摘要

大语言模型（LLMs）在广泛任务中取得了显著进展，但其对以英语为中心的训练数据的严重依赖导致在非英语语言中性能大幅下降。虽然现有的多语言提示方法强调将查询重新表述为英语或增强推理能力，但它们往往未能融入对某些查询至关重要的语言和文化特定基础。为了解决这一局限性，我们提出了EMCEE（提取合成多语言上下文并合并），一个简单而有效的框架，通过从LLM自身显式提取和利用查询相关知识来增强其多语言能力。具体来说，EMCEE首先提取合成上下文以揭示LLM中编码的潜在语言特定知识，然后通过基于判断的选择机制动态地将这种上下文见解与面向推理的输出合并。在涵盖多种语言和任务的四个多语言基准上的大量实验表明，EMCEE始终优于先前的方法，总体平均相对提升16.4%，在低资源语言中提升31.7%。

英文摘要

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCEE (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCEE first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCEE consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

URL PDF HTML ☆

赞 0 踩 0

2510.11683 2026-06-01 cs.LG cs.AI cs.CL 版本更新

测试时的自反生成

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Nanyang Technological University（南洋理工大学）； University of Edinburgh（爱丁堡大学）； City University of Hong Kong（香港城市大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出SRGen框架，通过动态熵阈值识别高不确定性token并训练校正向量，在测试时进行自反生成以纠正概率分布，提升大模型推理的可靠性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地通过长思维链解决复杂推理任务，但其仅前向的自回归生成过程是脆弱的；早期token错误可能级联，这明确需要自反机制。然而，现有的自反要么对完整草稿进行修订，要么通过昂贵的训练学习自我修正，两者本质上都是反应性的且低效。为解决此问题，我们提出测试时的自反生成（SRGen），一种轻量级测试时框架，在不确定点生成前进行反思。在token生成过程中，SRGen利用动态熵阈值识别高不确定性token。对于每个识别的token，它训练一个特定的校正向量，充分利用已生成的上下文进行自反生成以纠正token概率分布。通过回顾性分析部分输出，这种自反使得决策更可靠，从而显著降低高不确定点错误的概率。在具有挑战性的数学推理基准和多种LLM上的评估表明，SRGen能显著增强模型推理能力。此外，我们的发现将SRGen定位为一种即插即用的方法，将反思融入生成过程以实现可靠的LLM推理，在有限开销下取得一致收益，并可与其他训练时（如RLHF）和测试时（如SLOT）技术结合。

英文摘要

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can significantly strengthen model reasoning. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and can be combined with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

URL PDF HTML ☆

赞 0 踩 0

2509.23195 2026-06-01 cs.CL q-bio.NC 版本更新

The relative strength of hierarchical structure and statistics differs across the measures in naturalistic reading

自然阅读中层级结构与统计的相对强度因测量指标而异

Nan Wang, Hanlin Wu, Jiaxuan Li

发表机构 * Department of Brain and Cognitive Sciences, University of Rochester（罗切斯特大学脑科学与认知科学系）； Department of Linguistics and Modern Languages, the Chinese University of Hong Kong（香港中文大学语言学与现代语言系）； Department of Language Science, University of California Irvine（加州大学 Irvine 分校语言科学系）

AI总结本研究通过同步脑电图和眼动追踪，结合贝叶斯网络建模和回归分析，探究层级句法结构与统计因素在在线理解中的相对强度，发现层级结构在阅读前即可影响理解，但其强度因行为或神经层面而异。

详情

AI中文摘要

层级句法结构与非层级、统计或序列因素长期以来被视为解释在线理解的竞争理论。大量证据表明，层级和非层级因素都能塑造理解，而更开放的问题是层级何时以及以多强的影响力作用于理解。我们通过同步脑电图和眼动追踪来探讨这一问题，将句法深度作为操作化层级结构的变量。关于时间问题，层级句法结构在阅读句子之前就已影响阅读，并且最早可在阅读前108毫秒出现。这一点得到了转移概率分析和注视相关电位回归的支持。注视转移分析表明，读者更倾向于在句法核心词之间移动，而非按照序列词序，这表明扫描路径是由深层句法结构而非纯统计驱动的。关于强度问题，我们结合贝叶斯网络建模和回归分析，表明变量的强度取决于待解释的现象。贝叶斯网络分析显示，层级句法结构比统计特征携带更多的预测权重。注视相关电位回归表明，在回归分析中，层级句法结构显著预测了右前脑区的词级神经活动，但与词汇惊奇度相比普遍较弱。综合证据，我们的分析表明，层级结构可以在行为和神经层面预期性地引导受试者的在线理解，其强度随阅读行为的不同方面而变化。

英文摘要

The hierarchical syntactic structure and non-hierarchical, statistical, or sequential factors have long been framed as rival theories in accounting for online comprehension. A lot of evidence has shown that both hierarchical and non-hierarchical factors can shape comprehension and the more open question is when, and how strongly, hierarchy exerts its influence in comprehension. We addressed the question with co-registered EEG and eye-tracking, treating syntactic depth as the variable for operationalizing hierarchical structure. For the timing question, hierarchical syntactic structure is shown to influence reading before reading a sentence and can emerge as early as 108ms before reading. This is supported by both transitional probability analysis and regression on fixation-related potential. Analyses on fixation-transition showed that readers preferentially moved between syntactically central words rather than according to serial word order, suggesting that scanpaths are driven by deep syntactic structure rather than by pure statistics. For the strength question, we combined Bayesian network modeling and regression analysis to show that strength of a variable is dependent on the phenomenon that is to be explained. Bayesian network analysis showed that hierarchical syntactic structure carried more predictive weight than statistical features. Regression on fixation-related potential demonstrated that hierarchical syntactic structure significantly predicted word-level neural activity in the front-right region in regression analyses, but is generally weaker in comparison with lexical surprisal. Evidence combined, our analyses suggested that hierarchical structure can anticipatorily guide subjects' online comprehension both on a behavioral and neural level, with its strength varies across different facets of reading behavior.

URL PDF HTML ☆

赞 0 踩 0

2501.04661 2026-06-01 cs.CL cs.AI 版本更新

Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

超越记忆：使用短语结构评估大型语言模型中的语义泛化

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

发表机构 * Georgetown University（乔治城大学）； University of Bath（巴斯大学）； DEVCOM U.S. Army Research Laboratory（美国陆军研究实验室）； University of Maryland, College Park（马里兰大学学院公园分校）

AI总结通过构式语法构建诊断评估，测试大型语言模型在低频但人类易理解的短语结构上的语义理解与泛化能力，发现模型在句法相同但语义不同的构式上性能下降超40%。

Comments Camera Ready: AACL-IJCNLP (2025)

详情

AI中文摘要

预训练数据的网络规模带来了一个重要的评估挑战：将预训练数据中充分代表的案例的语言能力与对域外语言（特别是预训练数据中较少见的动态真实世界实例）的泛化能力区分开来。为此，我们利用构式语法（CxG）构建了一个诊断评估，系统性地评估大型语言模型（LLM）的自然语言理解。CxG为测试泛化提供了一个心理语言学上合理的框架，因为它明确地将句法形式与抽象的、非词汇意义联系起来。我们的新颖推理评估数据集包含英语短语构式，已知说话者能够抽象出常见的实例以理解和产生创造性实例。我们的评估数据集使用CxG评估两个核心问题：第一，模型是否能够“理解”那些可能在预训练数据中出现频率较低、但对人类而言直观且易于理解的句子的语义；第二，LLM是否能够在句法相同但意义不同的构式中部署适当的构式语义。我们的结果表明，包括GPT-o1在内的最先进模型在第二个任务上性能下降超过40%，揭示了模型无法像人类那样在句法相同的形式上进行泛化以得出不同的构式意义。我们公开了我们的新颖数据集和相关的实验数据，包括提示和模型响应。

英文摘要

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can 'understand' the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

URL PDF HTML ☆

赞 0 踩 0

2508.08204 2026-06-01 cs.CL cs.AI 版本更新

Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

大型语言模型中推理时间不确定性的人类对齐与校准

Kyle Moore, Jesse Roberts, Daryl Watson

AI总结本文评估了多种推理时间不确定性度量，发现它们与人类群体不确定性高度对齐，尽管与人类答案偏好不一致，但在正确性相关性和分布分析上表现出中等到强校准证据。

Comments We have discovered a critical error in the normalized entropy calculation that may have substantially inflated nearly all results herein. We have since fixed this error in a new work, but we believe that the new work is sufficiently dissimilar in focus, methods, dataset, and results as to be misleading if presented as a simple replacement. As such, we propose removal and retraction instead

详情

AI中文摘要

最近，评估大型语言模型的不确定性校准引起了广泛关注，以促进模型控制和调节用户信任。推理时间不确定性可能为模型或外部控制模块提供实时信号，对于应用这些概念以改善LLM用户体验尤为重要。尽管许多现有论文考虑模型校准，但相对较少的工作试图评估模型不确定性与人类不确定性的对齐程度。在这项工作中，我们使用既有度量和新颖变体评估了一系列推理时间不确定性度量，以确定它们与人类群体水平不确定性以及传统模型校准概念的接近程度。我们发现，许多度量显示出与人类不确定性强烈对齐的证据，尽管与人类答案偏好缺乏对齐。对于那些成功的度量，我们在正确性相关性和分布分析方面发现了中等到强校准证据。

英文摘要

There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

URL PDF HTML ☆

赞 0 踩 0

2507.17335 2026-06-01 cs.CV cs.CL 版本更新

TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

TransLPRNet：用于单/双行中文车牌识别的轻量级视觉-语言网络

Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

发表机构 * Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University（水电工程智能视觉监测湖北省重点实验室，中国三峡大学）； College of Computer and Information Technology, China Three Gorges University（计算机与信息学院，中国三峡大学）； Hubei Key Laboratory of Digital Finance Innovation（数字金融创新湖北省重点实验室）； School of Information Engineering, Hubei University of Economics（信息工程学院，湖北经济学院）

AI总结针对开放环境中车牌类型多样和成像条件复杂的问题，提出一种集成轻量视觉编码器和文本解码器的统一解决方案，通过预训练框架和透视校正网络实现单/双行中文车牌的高精度识别。

详情

大语言模型时代的图机器学习

Shijie Wang, Jiani Huang, Zhikai Chen, Yu Song, Wenzhuo Tang, Haitao Mao, Wenqi Fan, Hui Liu, Xiaorui Liu, Dawei Yin, Qing Li

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Michigan State University（密歇根州立大学）； North Carolina State University（北卡罗来纳州立大学）； Baidu Inc（百度公司）

AI总结本文综述了大语言模型如何增强图机器学习的泛化、迁移和少样本学习能力，以及图如何提升大语言模型的推理和可解释性。

Comments Accepted by TIST

详情

AI中文摘要

图在表示社交网络、知识图谱和分子发现等各个领域的复杂关系中扮演着重要角色。随着深度学习的出现，图神经网络（GNN）已成为图机器学习（Graph ML）的基石，促进了图的表示和处理。最近，大语言模型（LLM）在语言任务中展现出前所未有的能力，并被广泛应用于计算机视觉和推荐系统等各种应用中。这一显著成功也引起了将LLM应用于图领域的兴趣。越来越多的努力致力于探索LLM在提升图机器学习的泛化性、迁移性和少样本学习能力方面的潜力。同时，图，尤其是知识图谱，富含可靠的事实知识，可用于增强LLM的推理能力，并可能缓解其局限性，如幻觉和缺乏可解释性。鉴于这一研究方向的快速进展，有必要对LLM时代图机器学习的最新进展进行系统综述，为研究人员和从业者提供深入理解。因此，在本综述中，我们首先回顾了图机器学习的最新发展。然后，我们探讨了如何利用LLM来增强图特征的质量，减轻对标注数据的依赖，并解决图异质性和分布外（OOD）泛化等挑战。之后，我们深入探讨了图如何增强LLM，突出了它们增强LLM预训练和推理的能力。此外，我们调查了各种应用，并讨论了这一有前景领域的潜在未来方向。

英文摘要

Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graphs. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML's generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph Heterophily and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.

URL PDF HTML ☆

赞 0 踩 0