arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.03299 2026-06-03 cs.CL 版本更新

LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models

LLM-XTM:利用大语言模型增强跨语言主题模型

Minh Chu Xuan, Tien-Phat Nguyen, Linh Ngo Van, Dinh Viet Sang, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) VNU University of Engineering and Technology(VNU工程技术大学) Monash University(墨尔本大学)

AI总结 提出LLM-XTM框架,通过LLM引导的主题精炼与自一致性不确定性量化,以黑盒方式稳定提升跨语言主题模型的连贯性和对齐性,减少对双语资源的依赖。

Comments ACL 2026

详情
AI中文摘要

跨语言主题建模旨在发现跨语言的共享语义结构,但现有模型依赖稀疏的双语资源,往往产生不连贯或弱对齐的主题。最近基于LLM的精炼方法提高了可解释性,但成本高、在文档级别且容易产生幻觉,而先前的白盒方法需要无法获取的token概率。我们提出LLM-XTM,一个集成LLM引导的主题精炼与自一致性不确定性量化的框架,能够以黑盒、稳定且可扩展的方式增强跨语言主题模型。在多语言语料库上的实验表明,LLM-XTM在减少对双语词典和昂贵LLM调用依赖的同时,实现了更优的主题连贯性和对齐性。

英文摘要

Cross-lingual topic modeling aims to discover shared semantic structures across languages, yet existing models depend on sparse bilingual resources and often yield incoherent or weakly aligned topics. Recent LLM-based refinements improve interpretability but are costly, document-level, and prone to hallucination, with prior white-box approaches requiring inaccessible token probabilities. We propose LLM-XTM, a framework that integrates LLM-guided topic refinement with self-consistency uncertainty quantification, enabling black-box, stable, and scalable enhancement of cross-lingual topic models. Experiments on multilingual corpora show that LLM-XTM achieves superior topic coherence and alignment while reducing reliance on bilingual dictionaries and expensive LLM calls.

2606.03990 2026-06-03 cs.LG cs.CL cs.CV 版本更新

Neuron Populations Exhibit Divergent Selectivity with Scale

神经元群体随规模表现出分化的选择性

Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

发表机构 * UC Berkeley(加州大学伯克利分校) TTIC

AI总结 通过分析Rosetta神经元在不同规模模型中的分布与特性,发现其数量遵循次线性幂律增长,且选择性随规模增强,而非Rosetta神经元则保持低选择性,提出一个平衡特征效用与神经元容量的分析模型解释这一极化现象。

Comments Project page and code: https://avdravid.github.io/rosetta-neuron-scaling/

详情
AI中文摘要

我们研究神经网络中的神经元群体是否随规模可预测地演化,将缩放定律扩展到损失等宏观可观测指标之外。为探究此问题,我们研究了Rosetta神经元——一类先前被表征的、其激活模式在独立训练的模型中相似的神经元(Dravid et al., 2023)。在分别对高达30B参数的语言模型和高达5B参数的视觉模型的分析中,我们观察到Rosetta神经元群体遵循模型规模的次线性幂律,绝对数量增长但占总神经元数的比例缩小。我们进一步观察到神经元极化效应:Rosetta神经元随规模变得更具选择性且日益单语义化,与不断增长但仍保持低选择性的非Rosetta群体分离。一个平衡特征效用与有限神经元容量的分析模型解释了次线性幂律缩放和这种极化效应。最后,我们发现Rosetta神经元随规模变得更加领域专业化,并通过一个针对持续预训练的目标数据过滤案例研究展示了其选择性。我们的结果指向一个可解释的、共享的神经元层面结构的缩放定律,将模型大小与神经元通用性、选择性和专业化的系统性变化联系起来。

英文摘要

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

2606.03982 2026-06-03 cs.CL 版本更新

Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

语言模型使用数字特定和单位特定启发式比较数量

Mutsumi Sasaki, Go kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

AI总结 本研究通过控制实验发现,语言模型在比较带单位的数量时,并非进行精确的尺度转换,而是依赖数字差异和单位尺度差异的启发式策略,导致在比较边界附近系统性错误。

详情
AI中文摘要

带有测量单位的数量,例如110 cm和1.2 m,要求语言模型(LMs)将数字与符号单位尺度相结合。在这里,我们研究LMs如何在跨越多个单位系统的受控设置中比较此类数量。我们发现,在比较边界附近,准确性会下降,其中值的微小变化决定了正确答案。由此产生的错误是系统性的:线性代理模型从数字差异和单位尺度差异线索中预测LM偏好,并且对这些变量对齐的子空间进行因果干预会改变模型的输出。结果表明,LMs通过一系列关于数字和单位的启发式策略来比较数量,而不是先将两个表达式转换为精确的共享尺度表示。

英文摘要

Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

2606.03980 2026-06-03 cs.LG cs.CL 版本更新

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Skill-RM: 通过智能体技能统一异构评估标准

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba(通义千问大模型应用团队,阿里巴巴) Sun Yat-sen University(中山大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) ETH Zürich University of Zurich(苏黎世联邦理工学院)

AI总结 提出Skill-RM框架,将奖励建模重构为可重用的奖励评估技能执行,通过动态选择和聚合证据统一异构评估标准,在奖励基准和下游任务中优于传统方法。

详情
AI中文摘要

奖励模型(RMs)为LLM后训练提供关键反馈信号,特别是在强化微调(RFT)和强化学习(RL)流程中。然而,当前的奖励评估依赖于异构标准,如基于规则的验证器、真实参考、程序化检查表和复杂评分标准,而统一整合所有类型证据的机制尚未被探索。为此,我们提出技能奖励模型(Skill-RM),一个统一框架,将奖励建模重构为可重用的奖励评估技能的执行。通过将奖励计算视为结构化的智能体任务,Skill-RM提供一致的接口来编排异构资源,动态选择和聚合针对每个输入特定要求定制的证据。这种方法使奖励模型能够超越静态评估,确保跨不同任务的一致性和透明度。在奖励基准和下游应用(包括最佳N选择和强化学习)上的大量实验表明,Skill-RM始终优于传统的评判基线。我们的发现表明,Skill-RM不仅为奖励建模提供了统一解决方案,而且通过战略性和动态的证据编排实现了卓越性能。代码见此链接。

英文摘要

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

2606.03969 2026-06-03 cs.CL cs.AI 版本更新

Quantifying Faithful Confidence Expression in Large Reasoning Models

量化大型推理模型中的忠实置信表达

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University(耶鲁大学)

AI总结 针对大型推理模型(LRM)在长链思维输出中难以忠实表达内在置信度的问题,提出基于令牌概率、隐藏状态和响应一致性的框架,系统量化其语言决断性与内部不确定性之间的对齐程度。

Comments Code: https://github.com/yale-nlp/faithful_lrm

详情
AI中文摘要

可靠的不确定性沟通对于LLMs的可信度至关重要,然而忠实校准(FC)——模型内在置信度与(语言上)表达的置信度之间的对齐——是一个持续存在的失败模式。这一挑战对大型推理模型(LRM)尤为关键,因为其扩展的推理轨迹常被用户解读为深思熟虑、能力和信心的证据。尽管FC重要且LRM广泛使用,但LRM能否忠实表达其置信度仍知之甚少。此外,衡量FC的主流范式难以泛化到LRM生成的长链思维输出,这些输出往往缺乏清晰的步骤边界、步骤结构不一致,并在整个轨迹中编码复杂的条件依赖——使得内在置信度的估计复杂化。为应对这一挑战,我们引入了一个新颖的框架来系统量化LRM的FC。我们的框架基于令牌概率、隐藏状态和采样响应一致性,分析语言决断性与三种内部不确定性来源的关系。我们还设计了一种前缀条件采样方法,以控制轨迹中的条件和结构变化。将我们的框架应用于一系列多样化的领先模型、数据集和提示,我们发现忠实置信表达是LRM的一个重大挑战。推理行为不会自动转化为改进的FC,针对非推理模型的提示干预在推理设置中并不能提高忠实性。不同的置信估计器还对同一轨迹产生不同评估,揭示了先前评估方法的脆弱性。综合来看,我们的工作将FC确立为LRM的一个独特的可靠性和对齐目标,尤其是在这些系统越来越多地部署在高风险场景中的背景下。

英文摘要

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

2606.03968 2026-06-03 cs.CL cs.AI 版本更新

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC:为超越可验证奖励的强化学习协同设计查询与评分标准

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

发表机构 * Amazon(亚马逊) Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对基于评分标准的强化学习中查询分布固定导致的评分标准质量瓶颈,提出QUBRIC框架,通过协同设计查询与评分标准,利用教师关键点、对比生成和可学习性过滤,在ArenaHard上取得+5.5点提升,并泛化到法律、道德和叙事推理任务。

详情
AI中文摘要

基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的一条有前景的途径,但现有方法在优化评分标准时,将查询分布视为固定不变。我们识别出一个结构性瓶颈:评分标准的质量受限于查询结构。开放式查询会导致模糊的评分标准;而简单地将查询收窄则会引入任何模型都无法验证的虚构参考,导致所有回答失败,训练无法获得奖励信号。我们提出QUBRIC,一个协同设计查询和评分标准的框架。教师导出的关键点将开放式查询改写为基于场景、可评估的问题。然后,对比评分标准生成将教师策略的差距转化为查询级别的标准,可学习性过滤仅保留信息量丰富的查询-评分标准对用于GRPO训练。QUBRIC在ArenaHard上相比SFT基线取得了+5.5分的提升。仅使用指令遵循数据训练,它进一步迁移到三个涵盖法律、道德和叙事推理的保留基准(平均提升+6.3分),改进集中在推理相关维度。这些结果证明,协同设计查询和评分标准可以使基于评分标准的强化学习成为严格可验证任务之外RLVR的实用补充。

英文摘要

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

2606.03967 2026-06-03 cs.CL cs.AI 版本更新

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

AlignAtt4LLM:面向仅解码器LLM的快速AlignAtt方法在IWSLT 2026同声传译任务中的应用

Quentin Fuxa, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL(查理大学,人文学院,ÚFAL) University of Edinburgh(爱丁堡大学)

AI总结 提出AlignAtt4LLM系统,通过显式源文本跨度、离线选择翻译对齐头、选择性qk快速重放和运行时查询/键捕获,首次将AlignAtt策略应用于仅解码器LLM,在英德、英意同声传译中优于基线。

Comments Accepted to IWSLT 2026

详情
AI中文摘要

我们描述了AlignAtt4LLM,一个用于英语到德语、意大利语和中文的IWSLT 2026同声传译系统。该系统是一个同步级联:Qwen3-ASR结合强制对齐生成增量更新的源文本转录,Gemma-4 E4B-it在MT侧的AlignAtt策略下翻译该前缀。据我们所知,这是AlignAtt首次应用于仅解码器LLM,而早期AlignAtt系统使用的编码器-解码器交叉注意力在此类模型中不存在。我们通过提出(1)提示中的显式源文本跨度,(2)离线选择翻译特定的对齐头,(3)草稿到源注意力块的选择性qk快速重放,以及(4)保持模型输出比特一致的运行时查询/键捕获,恢复了一个可用的策略。在IWSLT 2026开发集上,AlignAtt4LLM在约2秒的低延迟和低于4秒CU-LongYAAL的高延迟场景下,均优于欧洲目标语言(英语到德语和英语到意大利语)的提供基线。英语到中文的结果较为复杂,但该方法不依赖于Gemma-4:由于AlignAtt4LLM仅需要确定的提示布局、校准的对齐头和查询/键捕获,相同的策略可以重新应用于针对非欧洲目标语言的更强翻译专用仅解码器MT骨干网络。

英文摘要

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

2606.03965 2026-06-03 cs.CL cs.AI 版本更新

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering:实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结 提出Agentic Chain-of-Thought Steering (ACTS)方法,通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语,实现预算感知的策略控制,从而在保持推理质量的同时显著节省token,并支持准确率-效率的可控权衡。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性,但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度,但隐式地决定了模型的思考方式。在本文中,我们提出了Agentic Chain-of-Thought Steering (ACTS),它将推理引导形式化为一个马尔可夫决策过程,其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步,控制器观察推理轨迹和剩余思考预算,然后发出一个包含推理策略和引导短语的引导动作,以启动推理器的下一步。这使得在保持推理器生成连续性的同时,能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体,并进行多预算增强,然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明,ACTS在显著节省token的同时达到了与全思考相当的性能,并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS 版本更新

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics(电信与人工智能系,布达佩斯技术与经济大学) SpeechTex Ltd.(SpeechTex公司) ELTE Research Centre for Linguistics(ELTE语言研究所)

AI总结 针对低资源语言和特定领域,提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线,实验表明合成对话能有效提升ASR性能,在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情
AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线,该流水线生成带有参与者元数据的场景级对话,将说话人属性映射到TTS语音配置文件,并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下,评估了五种LLM家族,分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估,该方法本身适用于任何语言,只要各组件有相应资源。结果表明,合成对话持续改善语音识别性能,但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据,在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明,通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

2606.03948 2026-06-03 cs.CL 版本更新

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

CUNI 提交至 IWSLT 2026 的用于同声传译的袖珍离线模型

Aziz Sharipov Ortega, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL(查理大学,人文学院,语言学与应用语言学研究所)

AI总结 本研究通过将离线直接语音到文本翻译模型 Canary 与最先进的策略 AlignAtt 结合,实现了同声传译能力,并在 IWSLT 2026 同声传译共享任务中提交了捷克语到英语以及英语到德语和意大利语的系统,展示了高翻译质量、低计算需求和多语言支持。

Comments IWSLT 2026

详情
AI中文摘要

我们使用最先进的策略 AlignAtt,为离线直接语音到文本翻译模型 Canary 实现了同声传译能力,并将其提交至 IWSLT 2026 同声传译共享任务,涵盖捷克语到英语以及英语到德语和意大利语的翻译。我们系统的优势在于:(1) 高翻译质量,在计算无关的模拟中,无论是在低延迟还是高延迟场景下,均优于类似规模的基线系统;(2) 低计算需求,模型仅有 10 亿参数;(3) 多语言能力——支持 25 种源语言和 25 种目标语言。

英文摘要

We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality -- support of 25 source and 25 target languages.

2606.03928 2026-06-03 cs.LG cs.CL 版本更新

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

面向推理模型的价值感知随机KV缓存淘汰

Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia

发表机构 * University of Southern California(南加州大学) University of Chicago(芝加哥大学)

AI总结 针对推理模型长输出导致的KV缓存瓶颈,提出价值感知随机淘汰方法VaSE,通过保护大幅度值状态和引入随机性,在4倍压缩下比最强淘汰方法准确率提升超4%。

Comments Codes: https://github.com/terarachang/VaSE

详情
AI中文摘要

推理模型通过扩展思维链提高了准确性,但其长输出造成了内存和计算瓶颈。KV缓存淘汰方法通过从缓存中淘汰不重要的键值对来降低这一成本,但它们的准确性往往不如基于选择的稀疏注意力替代方案,后者保留了完整的KV缓存。我们识别出对KV缓存淘汰准确性至关重要的关键因素。首先,一小部分值状态具有异常大的幅度,淘汰它们会导致灾难性失败,模型进入重复推理循环。其次,在淘汰过程中引入随机性通过增加缓存多样性提高了准确性。基于这些发现,我们提出了价值感知随机KV缓存淘汰(VaSE),这是一种无需训练的方法,保护大幅度值状态并促进多样化的淘汰决策。在六个推理任务上,使用VaSE进行4倍KV缓存压缩的Qwen3模型在相同稀疏度下比最先进的选择方法获得了更高的平均准确率,同时比最强的淘汰方法高出超过4%。总体而言,VaSE弥合了效率与准确性之间的差距,支持FlashAttention2,并为推理模型实现了静态内存占用。

英文摘要

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

2606.03924 2026-06-03 cs.CL 版本更新

Knowledge Editing in Masked Diffusion Language Models

掩码扩散语言模型中的知识编辑

Haewon Park, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院)

AI总结 研究将定位-编辑方法从自回归模型迁移到掩码扩散模型,发现编辑位置可迁移但多词编辑性能下降,并提出优化中间状态的简单修正方法。

详情
AI中文摘要

知识编辑旨在更新或纠正语言模型中的事实知识。一种广泛使用的方法是定位-编辑,它分两步进行:首先在模型中定位事实,然后在那里编辑权重。迄今为止,此类方法仅在自回归模型(ARMs)上开发。它们的基本假设是否适用于掩码扩散模型(MDMs)——后者双向建模文本并通过迭代去噪而非下一个词预测生成——仍是一个开放问题。我们通过将定位-编辑迁移到MDMs,并在匹配规模下比较两个MDMs(LLaDA, Dream)与两个ARMs(LLaMA, Qwen)来解决这个问题。我们的核心发现分为两部分。首先,编辑应用的位置跨范式迁移:因果追踪在两种模型中均突出显示最后一个主体词处的相同早期到中间层MLP,并且编辑在那里最有效。其次,这个共享位置并不能保证共享结果。单词编辑在两种模型中均成功,但随着目标变长,编辑在MDMs中系统性退化,而在ARMs中则不然。失败源于编辑事实的生成方式:生成多词目标需要经过部分未掩码的中间状态,而编辑从未针对这些状态进行优化。在此诊断指导下,我们引入了一个简单的修正方法,针对这些状态优化编辑,从而显著恢复了多词性能。

英文摘要

Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

2606.03871 2026-06-03 cs.CV cs.CL cs.LG 版本更新

Visual Instruction Tuning Aligns Modalities through Abstraction

视觉指令调优通过抽象对齐模态

Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga

发表机构 * Area Science Park, Trieste, Italy(特里埃斯特Area Science Park)

AI总结 通过探针分析和因果干预,发现视觉指令调优将视觉特征直接嵌入LLM的中间语义层,绕过早期单模态处理层,并通过扩展和强化现有抽象阶段对齐视觉与文本表示。

详情
AI中文摘要

视觉指令调优有效地使预训练的大语言模型(LLM)能够同时处理图像信息和文本。然而,视觉特征如何嵌入到LLM骨干网络的逐层抽象层次中仍不清楚。通过一系列不同的视觉-语言架构,我们表明指令调优主要充当桥梁,将视觉特征直接嵌入到LLM的中间语义层,绕过了用于单模态处理的早期层。通过探针分析和因果干预,我们表明这些中间层是视觉-语言处理的语义核心,并在广泛的 multimodal 基准测试中发挥关键作用。此外,通过比较语义等价的视觉和文本表示的几何结构,我们发现微调扩展并强化了现有的抽象阶段,使视觉特征与已有的文本特征对齐。最后,我们通过将微调限制在中间层来确认这种局部对齐的功能作用:该策略在视觉中心基准测试中保持了全微调的性能,同时减少了训练时间。我们的结果表明,多模态集成是一种局部现象,由LLM内部抽象引擎的重新利用驱动。

英文摘要

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

2606.03867 2026-06-03 cs.CL cs.AI 版本更新

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

一种基于LLM和知识图谱的无训练混合智能体框架用于多文档摘要

Cuong Vuong Tuan, Trang Mai Xuan, Tien-Cuong Nguyen, Vu-Duc Ngo, Thien Van Luong

发表机构 * Faculty of Artificial Intelligence and Data Science, Phenikaa University(人工智能与数据科学学院,泛尼克大学) VNPT AI, VNPT Group(VNPT AI,VNPT集团) MobiFone Research and Development Center, MobiFone Corporation(MobiFone研发与开发中心,MobiFone公司) Business AI Lab, Faculty of Data Science and Artificial Intelligence, National Economics University, College of Technology(商业人工智能实验室,数据科学与人工智能学院,国家经济大学,技术学院)

AI总结 提出一种无需训练、结合大语言模型和知识图谱的混合智能体框架,通过分解摘要任务为专用智能体(抽取、知识感知抽象、迭代精炼)并利用多视角一致性机制,在英文和越南语数据集上取得领先性能。

Comments Accepted by Neural Computing and Applications

详情
AI中文摘要

多文档摘要(MDS)在从文本数据集合中提取关键信息方面发挥着关键作用。现有方法通常难以捕捉复杂的文档间关系,严重依赖大量标注数据进行监督训练,或在跨领域和跨语言时泛化能力有限。为解决这些限制,我们提出一种无训练的混合智能体框架用于MDS,该框架利用大语言模型(LLM)和知识图谱的互补优势。我们的方法将摘要分解为专门的智能体任务:抽取式选择、知识感知抽象和迭代精炼,每个任务无需特定微调。我们通过由LLM引导的多视角一致性机制统一其输出。在四个英文和越南语数据集上的实验表明,该方法达到了最先进或具有竞争力的性能,验证了我们模块化设计的有效性和适应性。

英文摘要

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

2606.03866 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Taiji: 面向工业LLM增强推荐的帕累托最优策略优化与语义ID权衡

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu, Peng Jiang, Kun Gai

发表机构 * Kuaishou Technology(快手科技) Unaffiliated(无隶属)

AI总结 提出Taiji框架,通过逆向工程推理和开放拒绝采样生成高质量CoT数据,并采用帕累托最优策略优化(POPO)自适应调整跨域奖励权重,实现LLM语义知识与推荐ID特征的帕累托最优权衡,在快手广告平台部署后服务超4亿日活用户。

Comments 8 pages, 2 figures

详情
AI中文摘要

通过大型语言模型(LLM)扩展推荐系统已成为工业界的显著趋势。然而,通过后训练(如SFT和RL)将LLM的语义空间与推荐系统的ID空间对齐仍然具有挑战性。现有的LLM4Rec范式受到两个主要问题的瓶颈:(1)在SFT期间,难以衡量和改进开放域推荐中的思维链(CoT)质量;(2)在RL对齐过程中,忽略了LLM语义奖励与推荐偏好奖励之间的权衡。受这些挑战启发,我们提出了Taiji,一种专为工业推荐系统设计的新型LLM-as-Enhancer框架。为了克服SFT瓶颈,我们利用逆向工程推理和开放拒绝采样生成高质量、领域特定的CoT数据。为了解决RL对齐问题,我们提出了帕累托最优策略优化(POPO),它自适应调整跨域奖励权重。理论上,它在LLM的语义世界知识与代表在线用户偏好的协同ID特征之间实现了最优权衡。大量的离线评估和在线A/B测试验证了Taiji的有效性。自2026年5月在快手广告平台部署以来,Taiji目前每天服务超过4亿用户,产生了显著的商业收入,并展示了其在网络规模环境中的强大可扩展性。

英文摘要

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

2606.03846 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

聚类自评估:一种简单而有效的大型语言模型不确定性量化方法

Qi Cao, Takeshi Kojima, Andrew Gambardella, Helinyi Peng, Yutaka Matsuo, Yusuke Iwasawa

发表机构 * The University of Tokyo(东京大学)

AI总结 提出一种基于语义聚类和多项选择概率的简单自评估方法,用于大型语言模型的不确定性量化,在多个模型和数据集上优于基线方法。

Comments Findings of ACL 2026

详情
AI中文摘要

大型语言模型(LLM)在各种任务中表现出色,但常常生成看似合理实则事实错误的回答。这一问题因缺乏明确的不确定性估计而加剧,使用户难以判断模型输出的可靠性。现有的不确定性量化方法通常依赖间接信号,如生成样本的熵。这些信号难以解释,且未充分利用模型评估自身不确定性的能力。我们提出一种简单而有效的自评估方法用于LLM的不确定性量化。我们的方法将生成样本分组为语义不同的聚类,将其转化为结构化多项选择题的答案选项,并使用LLM分配给每个选项的概率作为置信度估计。在多个模型和数据集上的实验表明,我们的方法始终优于基线方法。值得注意的是,仅需两个额外样本即可达到竞争性能,证明了其有效性和效率。

英文摘要

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

2606.03825 2026-06-03 cs.LG cs.CL 版本更新

Dynamic Short Convolutions Improve Transformers

动态短卷积改进Transformer

Oliver Sieberling, Bharat Runwal, Rameswar Panda, Yoon Kim

发表机构 * Massachusetts Institute of Technology(麻省理工学院) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 本文提出动态短卷积作为新的神经网络原语,通过输入依赖的滤波器增强Transformer,在语言建模中相比标准Transformer和静态短卷积变体持续提升性能,并带来计算优势。

详情
AI中文摘要

Transformer已成为大型语言模型的主导架构,主要得益于注意力、前馈层、残差连接和归一化的可扩展性和灵活性。本文引入动态短卷积作为改进Transformer的额外神经网络原语。与静态短卷积不同,动态卷积使用输入依赖的滤波器,在保持卷积局部性偏差的同时增加表达能力。动机实验表明,在具有挑战性的关联回忆任务中,对键、查询和值表示应用动态短卷积相比静态卷积变体提升了性能。在从150M到2B参数的语言建模实验中,动态卷积持续优于标准Transformer和用静态短卷积增强的Transformer。拟合缩放定律表明,当动态卷积应用于键、查询和值向量时,相对于计算匹配的Transformer具有1.33倍的计算优势,而在每个线性层后添加动态卷积时优势达到1.60倍。动态卷积还在线性RNN(Mamba-2/Gated DeltaNet)和混合专家架构上带来了改进。我们通过自定义Triton内核使这些增益变得实用,实现了高效的训练和可管理的端到端减速。这些结果表明,动态短卷积是一种可扩展、硬件高效且富有表现力的原语,可用于推进基于Transformer的语言模型。

英文摘要

Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations improves performance on challenging associative recall tasks compared with static convolutional variants. Across language-modeling experiments ranging from 150M to 2B parameters, dynamic convolutions consistently outperform standard Transformers and Transformers augmented with static short convolutions. Fitting scaling laws indicates a 1.33$\times$ compute advantage over compute-matched Transformers when dynamic convolutions are applied to the key, query, and value vectors, and a 1.60$\times$ advantage when adding dynamic convolutions after every linear layer. Dynamic convolutions also offer improvements on linear RNNs (Mamba-2/Gated DeltaNet) and mixture-of-experts architectures. We make these gains practical with custom Triton kernels that enable efficient training with a manageable end-to-end slowdown. These results suggest that dynamic short convolutions are a scalable, hardware-efficient, and expressive primitive for advancing Transformer-based language models.

2606.03817 2026-06-03 cs.CL 版本更新

Rethinking the Idiomaticity Decomposability Hypothesis: Evidence from Distributional Learning

重新思考习语可分解性假说:来自分布学习的证据

Maggie Mi, Golzar Atefi, Atsuki Yamaguchi, Felix Gers, Aline Villavicencio, Nafise Sadat Moosavi

发表机构 * University of Sheffield(谢菲尔德大学) Berliner Hochschule für Technik (BHT)(柏林技术大学(BHT)) University of Exeter(埃克塞特大学) Federal University of Rio Grande do Norte, Brazil(巴西北里奥格兰德联邦大学)

AI总结 本文利用上下文语言模型作为受控分布学习者,研究习语可分解性(构成意义对整体比喻意义的贡献程度)与句法灵活性的关系,发现模型导出的可分解性与人类判断弱相关,与句法灵活性呈微小但一致的负相关,且习语表征的稳定化受惊讶度、可分解性和频率共同影响,其中可分解性具有最强的训练依赖效应。

Comments ACL 2026 Main - long paper (9 pages + Appendices)

详情
AI中文摘要

习语可以根据其可分解性进行分析,即可分解性是指构成意义对整体比喻意义的贡献程度。可分解性被认为可以预测句法灵活性。基于用法的解释则将习语行为归因于分布经验,如说话者熟悉度和可预测性。我们使用上下文语言模型作为受控分布学习者来检验这些观点。我们提出了一种模型内部的可分解性度量,并将其与人类评分、句法灵活性和可预测性相关联,同时跟踪预训练期间习语的学习过程。模型导出的可分解性与人类判断弱相关,并且与句法灵活性呈微小但一致的负相关。预训练分析表明,模型中习语表征的稳定化不能仅由频率解释。相反,惊讶度、可分解性和频率都有贡献,其中可分解性显示出最强的训练依赖效应。

英文摘要

Idioms can be analysed in terms of their decomposability, the extent to which constituent meanings contribute to the figurative whole. Decomposability is thought to predict syntactic flexibility. Usage-based accounts instead attribute idiom behaviour to distributional experience, such as speaker familiarity and predictability. We examine these views using contextualised language models as controlled distributional learners. We propose a model-internal measure of decomposability and relate it to human ratings, syntactic flexibility, and predictability while tracking idiom learning during pretraining. Model-derived decomposability correlates weakly with human judgments and shows a small but consistent negative relationship with syntactic flexibility. Pretraining analyses show that stabilisation of idiom representations in models is not explained by frequency alone. Instead, surprisal, decomposability, and frequency all contribute, with decomposability showing the strongest training-dependent effect.

2606.03793 2026-06-03 cs.CL cs.CV 版本更新

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

探索多语言多模态大语言模型的对抗鲁棒性与安全对齐

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋) Khalifa University, UAE(卡比拉大学,阿联酋) Australian National University, Australia(澳大利亚国立大学,澳大利亚)

AI总结 本研究通过梯度攻击和跨语言评估,发现多语言多模态大语言模型存在可迁移的对抗脆弱性,并揭示低资源语言因理解失败而呈现的虚假安全现象,提出深层训练整合才能实现真正的多语言安全对齐。

详情
AI中文摘要

多模态大语言模型将视觉感知整合到语言推理中,引入了一个连续的攻击面,容易受到对抗攻击。先前关于MLLM鲁棒性的工作主要关注以英语为中心的任务,多语言行为尚未被探索。我们通过对12种不同语言的对抗鲁棒性和多模态安全性进行系统研究来填补这一空白,评估通过指令调优获得多语言能力的开源MLLM。基于梯度的攻击揭示了一种可迁移的多语言脆弱性:在一种语言中优化的对抗图像会继续在其他语言中引发失败,表现出强大的跨语言可迁移性。多语言安全性进一步取决于模型检索或解释有害指令的有效性。当有害意图通过文本发出时,语言基础更强的语言更常引发允许滥用的响应,而较弱的语言产生较少的不安全输出。当嵌入图像作为排版内容时,英文脚本被可靠地识别和遵循,而非英文脚本很少被视觉编码器解析。因此,低资源语言可能看起来更安全,但这是理解和视觉基础失败的人为产物,而非真正的对齐,我们将这种现象称为“失败导致的安全”。相比之下,在整个训练阶段(而不仅仅在指令调优阶段)构建多语言能力的MLLM,如Qwen3-VL,表现出真正的跨语言安全性,跨语言保持主动拒绝,而不是掩盖理解失败。浅层多语言适应(例如在翻译的指令数据上进行微调)可能产生表面理解,在低资源语言中造成虚幻的安全感;跨训练阶段的更深层整合才能实现真正的多语言安全对齐。

英文摘要

Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire multilingual capability through instruction tuning. Gradient-based attacks reveal a transferable multilingual vulnerability: adversarial images optimized in one language continue to induce failure in others, demonstrating strong cross-lingual transferability. Multilingual safety further varies with how effectively a model retrieves or interprets harmful instructions. When harmful intent is issued through text, languages with stronger linguistic grounding more often elicit misuse-enabling responses, while weaker languages produce fewer unsafe outputs. When embedded in the image as typographic content, English scripts are reliably recognised and followed, whereas non-English scripts are rarely parsed by the vision encoder. Lower-resource languages may therefore appear safer, but this is an artefact of comprehension and visual-grounding failures rather than genuine alignment, a phenomenon we term safety-by-failure. In contrast, MLLMs that build multilingual capability throughout their training stages rather than only at instruction tuning, such as Qwen3-VL, exhibit genuine cross-lingual safety, maintaining active refusal across languages rather than masking comprehension failure. Shallow multilingual adaptation, such as fine-tuning on translated instruction data, may produce surface-level understanding that creates illusory safety in low-resource languages; deeper integration across training stages leads to genuine multilingual safety alignment.

2606.03782 2026-06-03 cs.CL 版本更新

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

基于语法的推理:合成语言推理轨迹能否增强低资源机器翻译?

Renhao Pei, Yihong Liu, Sampo Pyysalo, Hinrich Schütze, Shaoxiong Ji

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学) Center for Information and Language Processing, LMU Munich(慕尼黑大学信息与语言处理中心) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文提出自动生成语言推理轨迹的方法,通过上下文学习、监督微调和强化微调评估其对低资源机器翻译的影响,发现推理轨迹在推理时指导效果显著,但作为训练数据收益有限。

详情
AI中文摘要

大型语言模型(LLMs)通过上下文学习整合语言资源,为极低资源语言的机器翻译(MT)提供了一种有前景的方法。然而,LLMs在翻译过程中往往难以有效应用语法信息。受思维链推理最新进展的启发,我们研究了低资源MT是否能从结构化的中间语言分析和语法推理步骤中受益。我们提出了一种从通用依存树库、词典和语法规则库自动生成逐步语言推理轨迹的流程。我们以锡伯语和Chintang语为测试案例,在三种设置下评估这些轨迹:上下文学习(ICL)、监督微调(SFT)和强化微调(RFT)。我们的结果表明,语言推理轨迹作为推理时指导最为有效:在ICL中,可靠的句子特定轨迹在大多数模型、语言和指标上显著提升了翻译性能。相比之下,将语言推理轨迹作为训练数据使用带来的收益较小且不一致,因为模型学习了轨迹格式但往往生成错误内容。这些发现表明,当提供可靠的语言分析时,LLMs能够利用语法信息进行低资源MT,而学习生成此类分析仍然是主要瓶颈。

英文摘要

Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic analysis and grammatical reasoning. We propose a pipeline for automatically generating step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks. We evaluate these traces in three settings: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), on Xibe and Chintang as test cases. Our results show that linguistic reasoning traces are most effective as inference-time guidance: in ICL, reliable sentence-specific traces substantially improve translation performance across most models, languages, and metrics. In contrast, using the linguistic reasoning traces as training data yields smaller and less consistent gains, as models learn the trace format but often generate erroneous content. These findings suggest that LLMs can leverage grammatical information for low-resource MT when given reliable linguistic analyses, while learning to generate such analyses remains a major bottleneck.

2606.03780 2026-06-03 cs.CL cs.LG 版本更新

Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

专家感知的稀疏MoE语言模型中事实回忆的因果追踪

Yuetian Lu, Ali Modarressi, Yihong Liu, Hinrich Schütze

发表机构 * Center for Information and Language Processing (CIS)(信息与语言处理中心) Ubiquitous Knowledge Processing Lab (UKP)(无所不在的知识处理实验室) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 针对稀疏混合专家语言模型,提出专家感知的因果追踪方法,通过干预专家级更新定位事实回忆的关键专家,发现专家级定位依赖于模型和协议。

Comments Preprint

详情
AI中文摘要

事实回忆的因果追踪主要在密集Transformer语言模型中进行研究,其中干预将信息流定位到层或前馈模块。稀疏混合专家(MoE)语言模型引入了一个更尖锐的问题:当事实预测由路由的MoE块中介时,哪些路由的专家贡献起作用?我们为稀疏MoE语言模型制定了专家感知的因果追踪。使用CounterFact事实,我们首先通过向主题词嵌入添加噪声来破坏模型的事实偏好,然后测试干净的MoE块输出或干净的专家级更新是否恢复了真实与虚假logit对比。对于Qwen3-30B-A3B-Base,层扫描选择并验证了第44层,专家级追踪识别出L44E069作为在干净运行中反复选择的专家,其保留的补丁优于其他活跃的同一层专家补丁。对于Mixtral-8x7B-v0.1,层级追踪验证了中层信号,但该信号并未定位到选定的单个专家;相反,联盟检查通过路由的多专家更新恢复了它。这些结果表明,MoE事实追踪可以做到专家感知,同时也表明专家级定位是依赖于模型和协议的,而非普遍适用。

英文摘要

Causal tracing of factual recall has been studied predominantly in dense transformer language models, where interventions localize information flow to layers or feed-forward modules. Sparse mixture-of-experts (MoE) language models introduce a sharper question: when a factual prediction is mediated by a routed MoE block, which routed expert contributions matter? We formulate expert-aware causal tracing for sparse MoE language models. Using CounterFact facts, we first corrupt the model's factual preference by adding noise to subject-token embeddings, and then test whether clean MoE-block outputs or clean expert-level updates restore the true-vs-foil logit contrast. For Qwen3-30B-A3B-Base, a layer sweep selects and validates layer 44, and expert-level tracing identifies L44E069 as an expert repeatedly selected in the clean run whose held-out patch outperforms other active same-layer expert patches. For Mixtral-8x7B-v0.1, layer-level tracing validates a mid-layer signal, but the signal is not localized to the selected singleton expert; a coalition check instead recovers it with routed multi-expert updates. These results suggest that MoE factual tracing can be made expert-aware, while also showing that expert-level localization is model- and protocol-dependent rather than universal.

2606.03773 2026-06-03 cs.CL 版本更新

KletterMix: Climbing Toward High-Quality German Pretraining Data

KletterMix: 攀登高质量德语预训练数据

Maurice Kraus, Ruben Härle, Sebastian Sztwiertnia, Abbas Goher Khan, Mehdi Ali, Michael Fromm, Kristian Kersting

发表机构 * AI & ML Group, TU Darmstadt(人工智能与机器学习小组,德累斯顿技术大学) Lab1141 Lamarr Institute(拉马尔研究所) Fraunhofer IAIS(弗劳恩霍夫人工智能研究所) hessian.AI(海斯坦.AI) German Research Center for AI (DFKI)(德国人工智能研究中心(DFKI)) Centre for Cognitive Science, TU Darmstadt(认知科学中心,德累斯顿技术大学)

AI总结 通过翻译高质量英语语料库构建德语预训练语料库KletterMix,并验证其在德语下游任务中的有效性。

详情
AI中文摘要

高质量的预训练数据是现代语言模型的核心要素,但德语资源的开发远不如英语资源:它们通常规模更小、筛选不仔细、文档不完善,并且很少通过受控训练实验进行验证。我们引入了KletterMix,一个用于语言模型预训练和退火的高质量德语语料库,旨在为自然语言处理和建模社区提供可重复使用的数据集产品。KletterMix通过将最先进的英语预训练语料库翻译成德语构建,同时保留文档边界、元数据、源结构和主题多样性。这种构建方式产生了一个具有现代预训练数据集规模和多样性的德语语料库,同时允许与其英语源进行直接比较。我们通过广泛的语料库级别分析来记录数据集,包括翻译质量、文档长度分布、主题覆盖、源组成和地理元数据。使用COMETKiwi,我们表明翻译后的文档在不同领域实现了高质量,表明仔细的翻译可以保留原始语料库的许多语义和风格丰富性。除了数据集构建,我们还评估了KletterMix作为训练数据。通过针对现有德语语料库的受控预训练和退火消融实验,我们表明在KletterMix上训练的模型在德语下游评估中取得了可衡量的改进。这些结果表明,精心策划的翻译数据可以显著增强德语预训练数据生态系统。

英文摘要

High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source. We document the dataset through a broad set of corpus-level analyses, including translation quality, document length distributions, topic coverage, source composition, and geographic metadata. Using COMETKiwi, we show that the translated documents achieve strong quality across diverse domains, suggesting that careful translation can preserve much of the semantic and stylistic richness of the original corpus. Beyond dataset construction, we evaluate KletterMix as training data. Through controlled pretraining and annealing ablations against established German corpora, we show that models trained on KletterMix achieve measurable improvements on German-language downstream evaluations. These results demonstrate that carefully curated translated data can substantially strengthen the German pretraining data ecosystem.

2606.03768 2026-06-03 cs.CL 版本更新

HybridThinker: Efficient Chain-of-Thought Reasoning via Compressed Memory and Transient Thought Steps

HybridThinker: 通过压缩记忆和瞬态思考步骤实现高效的思维链推理

Xin Liu, Runsong Zhao, Xinyu Liu, Junhao Ruan, Pengcheng Huang, Shichao Dong, Chunyang Xiao, Chenglong Wang, Changliang Li, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) China Unicom Cloud-Link(中国联通云链) NiuTrans Research(牛传研究院)

AI总结 提出HybridThinker方法,通过保留压缩记忆和临时保留思考步骤,并采用混合训练方案,在保持推理准确性的同时显著压缩思维链长度。

Comments 23 pages, 9 figures

详情
AI中文摘要

扩展的思维链(CoT)轨迹提升了LLM的推理能力,但带来了大量的计算和内存成本。现有的CoT压缩方法通过记忆令牌将思考步骤压缩为紧凑表示,并在推理时仅保留这些表示,但细粒度信息的丢失使得后续步骤更容易出错。为了缓解这一问题,我们提出了 extbf{HybridThinker},除了保留这些表示外,思考步骤也会被临时保留以提供细粒度细节。然而,我们观察到在训练期间天真地让后续步骤可以访问思考步骤,会使模型绕过记忆令牌直接从这些步骤中检索信息,导致模型通过记忆令牌压缩和检索信息的能力训练不足。因此,我们引入了一种混合训练方案,其中只有部分思考步骤可以通过注意力直接访问后续步骤,而其他思考步骤被屏蔽,迫使模型使用记忆令牌进行压缩和检索。在4个推理基准测试中,HybridThinker与未压缩的基线性能相当,在相似的推理时间下,平均准确率比现有的CoT压缩方法提高了5.8个百分点。消融研究证实,临时思考步骤保留和混合训练方案都对这一提升有贡献。

英文摘要

Extended chain-of-thought (CoT) traces improve LLM reasoning but incur substantial computational and memory costs. While existing CoT compression methods mitigate this by condensing thought steps into compact representations via memory tokens and retaining only these representations at inference time, the loss of fine-grained information makes subsequent steps more error-prone. To alleviate this, we propose \textbf{HybridThinker}, where in addition to preserved these representations, thought steps are also temporarily retained to provide fine-grained details. However, we observe that naively keeping thought steps accessible to subsequent steps \emph{during training} lets the model bypass memory tokens by retrieving information directly from these steps, leaving the model's ability to compress and retrieve information through memory tokens insufficiently trained. We therefore introduce a hybrid training scheme, in which only some thought steps are directly accessible through attention to subsequent steps, while the other thought steps are masked, forcing the model to use memory tokens for compression and retrieval. Across 4 reasoning benchmarks, HybridThinker matches the uncompressed baseline, advancing the state of the art in CoT compression by 5.8 points on average accuracy with similar inference time. Ablation studies confirm that both temporary thought-step retention and the hybrid training scheme contribute to these gains.

2606.03761 2026-06-03 cs.CL 版本更新

Framing Migration News with LLMs: Structured CoT as a Support for Human Interpretation

使用LLM构建移民新闻框架:结构化思维链作为人类解释的支持

David Alonso del Barrio, Jing Wen, Daniel Gatica-Perez

发表机构 * Idiap Research Institute(日内瓦研究所) University of Geneva(日内瓦大学) EPFL(瑞士联邦理工学院)

AI总结 本研究提出结构化思维链(SCoT)提示方法,利用本地可部署的开源LLM(Llama3-8B)辅助移民新闻的框架分析,在单GPU上提升分类性能,并通过人类评估验证其逻辑性和对解释的影响。

详情
AI中文摘要

移民新闻的框架分析是一项具有社会重要性的任务:研究移民如何被叙述的媒体学者和研究人员需要不仅准确,而且透明、可审计,并在学术研究小组典型的资源限制下可访问的工具。现有的基于LLM的方法依赖于专有API和大模型,这引发了媒体研究人员对数据隐私、可重复性和公平访问的担忧。本研究探讨了本地可部署的开源LLM如何作为辅助工具支持可解释的框架分析。我们引入了一种使用Llama3-8B的结构化思维链(SCoT)提示方法,实现了基于预定义框架类别的逐步推理。这种结构化设计允许用户审计模型输出,并在本质上主观的任务中检查替代解释。我们在一个与移民相关的新闻数据集上评估了我们的方法,结果表明SCoT在单GPU上可行,同时比零样本和少样本基线提高了分类性能。然后,我们进行了一项以人为中心的评估,注释者评估“模型推理”的一致性和影响力。结果表明,SCoT解释通常被认为是逻辑的(平均得分4.1/5,尽管不同文本间存在显著差异),并且可以引发对初始解释的反思,即使存在分歧。我们的发现强调了LLM辅助框架分析的潜力和风险。虽然结构化推理可以增加模型输出的可追溯性并支持批判性解释,但它也可能以微妙的方式影响人类判断。通过支持本地部署并强调人在环中的交互,这项工作为研究具有社会影响力的媒体叙事的负责任和可访问的计算工具做出了贡献。

英文摘要

Frame analysis of migration news is a socially consequential task: media scholars and researchers who study how migration is narrated need tools that are not only accurate, but transparent, auditable, and accessible within the resource constraints typical of academic research groups. Existing LLM-based approaches rely on proprietary APIs and large models that raise concerns about data privacy, reproducibility and equitable access among media researchers. This work studies how a locally deployable open-source LLM can support interpretable frame analysis as an assistive tool. We introduce a Structured Chain-of-Thought (SCoT) prompting approach using Llama3-8B, enabling step-by-step justifications grounded in predefined framing categories. This structured design allows users to audit model outputs and examine alternative interpretations in a task that is inherently subjective. We evaluate our approach on a dataset of migration-related news and show that SCoT improves classification performance over zero-shot and few-shot baselines while remaining feasible on a single GPU. Then, we conduct a human-centered evaluation in which annotators assess the coherence and influence of "the model's reasoning". Results indicate that SCoT explanations are generally perceived as logical (mean score 4.1/5, though with notable variation across texts) and can prompt reflection on initial interpretations, even when disagreement persists. Our findings highlight both the potential and risks of LLM-assisted frame analysis. While structured reasoning can increase the traceability of model outputs and support critical interpretation, it can also influence human judgment in subtle ways. By enabling local deployment and emphasizing human-in-the-loop interaction, this work contributes to discussions on responsible and accessible computational tools for the study of socially impactful media narratives.

2606.03739 2026-06-03 cs.CL cs.IT math.IT 版本更新

Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

熵门:LLM流水线中用于近无损令牌压缩的熵淬火

Justice Owusu Agyemang, Jerry John Kponyo, Kwame Opuni-Boachie Obour Agyekum, Francisca Adoma Acheampong, Kwame Agyeman-Prempeh Agyekum, James Dzisi Gadze

发表机构 * VIA Cybersecurity Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana(维亚网络安全实验室,卡布雷姆大学科学与技术学院,库马西,加纳) Quantum and Assistive Technologies Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana(量子与辅助技术实验室,卡布雷姆大学科学与技术学院,库马西,加纳)

AI总结 提出Entropy Gate框架,通过熵淬火(一种热力学过程)逐步冻结低信息令牌,实现40-60%压缩率且语义保真度高于0.80,并证明其接近信息论极限。

详情
AI中文摘要

LLM流水线在低信息内容上浪费大量令牌预算:重复上下文、冗长响应和冗余模板。我们引入Entropy Gate,一种令牌压缩框架,应用熵淬火——一种热力学过程,逐步冻结低能量令牌同时保持语义保真度。每个令牌获得一个多因素信息能量$E(t)$,结合统计、结构和位置分量。自适应淬火计划$T(\tau) = T_0 / (1 + \alpha \tau)$移除玻尔兹曼生存概率$p_i = \exp(-E_i / kT)$低于阈值的令牌,并有一个保真度门在能量加权相似度低于$\theta$时停止压缩。我们证明按$E(t)$降序选择令牌最大化预期语义保留,淬火产生嵌套生存集,且可达压缩率接近信息论极限$\text{CR} \to 1 - I(P; T)/H(P)$。第一阶段启发式方法在五种提示类别上实现40-60%压缩,同时保持$S_E > 0.80$,能量平方放大$E \to E^2$额外增加10-25个百分点。上下文去重在重复块上节省50-70%。输出端淬火受简洁性提高准确性的发现启发,进一步减少响应开销。结合外部记忆,压缩在代理工作负载上复合乘数达到88-96%。该框架无状态、模型无关,并作为兼容OpenAI的HTTP代理部署。

英文摘要

LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(τ) = T_0 / (1 + ατ)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $θ$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E > 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.

2606.03728 2026-06-03 cs.CL cs.IR 版本更新

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

通过归因视角对法律问答中的引用质量进行重排序

Mohamed Hesham Elganayni, Selim Saleh

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 针对法律问答中检索增强生成系统的引用质量问题,提出基于扰动归因分数训练轻量级交叉编码器对候选段落重排序,显著提升引用忠实度并与专家答案对齐。

Comments 11 pages, 4 tables, 1 figure. Published at ASAIL 2026 (8th Workshop on Automated Semantic Analysis of Information in Legal Text), co-located with ICAIL 2026, Singapore

详情
AI中文摘要

用于法律问答的检索增强生成系统通常基于语义相似度检索段落,并将其提供给语言模型,然后生成带引用的答案。先前的工作假设高排名的段落最有可能被模型有效引用。基于扰动的归因方法(如C-LIME)仅用于事后解释。然而,在AQuAECHR基准测试中,语义相似度与段落归因并不相关。在检索器的候选池中,基于相似度的排序在呈现黄金引用段落方面表现不如随机选择。为了解决这一局限性,我们训练了一个轻量级交叉编码器,基于连续的扰动归因分数在生成前对段落进行重排序。该方法在AQuAECHR基准测试上使用两个语言模型和五折交叉验证进行评估。重排序器显著提高了引用的忠实度以及与专家黄金答案的对齐程度。值得注意的是,在不同模型上独立训练的两个重排序器收敛程度超过了它们的原始归因一致性。这一发现表明,交叉编码器减少了模型特定的噪声,并产生了一个可部分跨模型传递的共享相关性信号,尽管同模型重排序仍然更有效。这些结果表明,基于扰动的归因为引用感知检索提供了一种实用的、模型无关的训练信号。

英文摘要

Retrieval-augmented generation systems for legal question answering typically retrieve passages based on semantic similarity and provide them to a language model, which then generates cited answers. Prior work assumes that highly ranked passages are most likely to be usefully cited by the model. Perturbation-based attribution methods, such as C-LIME, have been used exclusively for post-hoc explanation. However, on the AQuAECHR benchmark, semantic similarity does not correlate with passage attribution. Within a retriever's candidate pool, similarity-based ranking performs worse than random selection at surfacing gold citation paragraphs. To address this limitation, a lightweight cross-encoder is trained on continuous perturbation-based attribution scores to re-rank passages prior to generation. This approach is evaluated on the AQuAECHR benchmark, using two language models and five-fold cross-validation. The re-ranker substantially improves citation faithfulness and alignment with gold expert answers. Notably, two re-rankers trained independently on different models converge beyond their raw attribution agreement. This finding indicates that the cross-encoder reduces model-specific noise and produces a shared relevance signal that partially transfers across models, although same-model re-ranking remains more effective. These results demonstrate that perturbation-based attribution provides a practical, model-agnostic training signal for citation-aware retrieval.

2606.03695 2026-06-03 cs.CL 版本更新

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

不要忘记你的嵌入:通过精确编辑嵌入实现鲁棒的知识擦除

Clara Haya Suslik, Or Shafran, Mor Geva

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University(巴尔-艾赫伦计算机科学与人工智能学院,特拉维夫大学)

AI总结 提出 EMBER 模块,利用稀疏矩阵分解精确擦除词嵌入中的概念相关特征,增强现有知识擦除方法的鲁棒性和特异性。

详情
AI中文摘要

随着语言模型在现实应用中的广泛部署,从模型中擦除特定知识的能力对安全性和合规性变得至关重要。主流方法通过更新模型参数实现持久移除,但目标知识往往可以通过对抗性提示或重新学习恢复。在这项工作中,我们假设这种局限性部分源于现有方法忽略了嵌入层。为了解决这个问题,我们引入了 EMBedding ERasure (EMBER),一个即插即用的擦除模块,利用稀疏矩阵分解从词嵌入中精确擦除概念相关特征。通过在 Gemma-2-2B-it 和 Llama-3.1-8B-Instruct 上对不同概念的综合评估,我们发现用 EMBER 增强现有方法可以一致地提高擦除效果和特异性,且连贯性损失最小。此外,它显著提高了对重新学习的鲁棒性,将恢复的准确率降低高达 50%,在 Llama 上限制在 35%,而先前方法为 70%-76%。进一步分析表明,连贯性成本是局部的,仅影响一小部分概念专属词元。我们的工作确立了精确的嵌入层干预对于鲁棒的概念擦除是必要的,并证明现有方法可以从这种增强中受益。

英文摘要

As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.

2606.03693 2026-06-03 cs.CL cs.CV 版本更新

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

语言转换会破坏医学视觉语言模型吗?印度尼西亚放射学视觉问答案例研究

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

发表机构 * Intelligent System Laboratory, Faculty of Computer Science Brawijaya University(智能系统实验室,计算机科学学院布拉维亚大学)

AI总结 本研究通过构建印尼语放射学VQA数据集IndoRad-VQA,评估医学视觉语言模型在非英语临床语言下的鲁棒性,发现英语与印尼语设置间存在8-25%的性能差距,表明需要更包容的多语言评估。

Comments accepted to MMFM-BIOMED Workshop @ CVPR 2026

详情
AI中文摘要

医学视觉语言模型(VLM)通常在英语放射学视觉问答基准上进行评估,其在非英语临床语言下的鲁棒性很大程度上未被探索。我们引入了IndoRad-VQA,这是VQA-RAD的印尼语改编版,以评估当问题以印尼语提出时,医学VLM是否保留放射学推理能力。放射学问答对被翻译成印尼语,并通过基于自我评估的质量控制来保持临床意义、术语一致性和答案等价性。我们在英语和印尼语提示设置下评估了通用、东南亚多语言和医学专用VLM。除了准确性,我们量化了英语和印尼语输入之间的语言鲁棒性差距。我们还进行了错误分析,以识别问答的失败模式,例如是/否翻转、侧向性错误和输出语言不匹配。我们的发现表明,在英语医学VQA基准上的强性能并不一定转化为印尼语临床环境中的鲁棒行为。我们观察到英语和印尼语设置之间的性能差距为8%到25%,具体取决于评估指标。这些结果突显了对医学多模态基础模型进行更包容的多语言评估的必要性。数据集可在以下网址获取:此 https URL。

英文摘要

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

2606.03692 2026-06-03 cs.AI cs.CL 版本更新

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

SkillPyramid:一种用于自我进化智能体的层次化技能整合框架

Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(认知与决策智能复杂系统重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院,北京,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院,北京,中国)

AI总结 针对智能体缺乏系统性技能构建、积累和迁移的问题,提出SkillPyramid层次化技能整合框架,通过自进化机制在任务执行中组合、验证和吸收新技能,在三个基准上平均奖励提升38.0%,执行步骤减少27.7%。

详情
AI中文摘要

最近的AI智能体可以灵活调用技能来解决复杂任务,但其长期改进从根本上受到缺乏系统性技能构建、积累和迁移的限制。特别是,没有统一的技能整合框架,智能体倾向于在不同任务中冗余构建相似能力,无法有效将经验转化为可复用资产,并且难以将任务特定技能泛化到新场景。为了解决这一限制,我们提出了SkillPyramid,一个技能整合框架,它重用现有技能经验以实现更广泛的任务泛化。在层次化技能拓扑上运行,SkillPyramid进一步引入了一种自进化机制,使智能体能够在任务执行过程中组合、验证和吸收新技能。在ALFWorld、WebShop和ScienceWorld上使用四个骨干模型的实验表明,SkillPyramid将平均奖励提高了38.0%,并将执行步骤减少了27.7%。总体而言,我们的方法将技能集合从静态资源池转变为动态进化系统。

英文摘要

Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

2606.03648 2026-06-03 cs.CL cs.AI 版本更新

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

微调大语言模型的安全性测量应基于能力

Krishnapriya Vishnubhotla, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko

发表机构 * National Research Council, Canada(加拿大国家研究理事会)

AI总结 通过将微调锚定于特定能力目标,多维度评估微调对模型能力和安全性的影响,发现微调模型对安全提示可能产生不连贯输出、自动安全判断不可靠,且结论因安全基准和评估者而异。

Comments 8 pages plus appendices

详情
AI中文摘要

通过微调将基础大语言模型适应用户的任务或偏好风格可能会损害模型的安全性。先前的研究在有限且看似随机的实验设置中考察了微调对模型安全性的影响。我们认为,将微调锚定于特定的能力目标对于避免任意的经验选择至关重要,这使我们能够得出关于安全性影响的有意义结论,并在一致的基础上比较缓解方法。我们通过关注能力和安全性,对微调对模型行为的影响进行了多维度评估。我们的结果揭示了重要问题:(1) 微调模型可能对安全提示产生不连贯的生成内容,(2) 对于这种不连贯输出,自动安全判断不可靠,(3) 关于微调影响的结论可能因安全基准以及安全评估者的选择而改变。

英文摘要

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

2606.03628 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

通过幻觉拒绝采样构建可靠的长文本生成

Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) DeepMind(深度思维)

AI总结 提出分段幻觉拒绝采样框架SHARS,利用任意幻觉检测器在生成过程中拒绝并重采样幻觉片段,以缓解长文本生成中的幻觉累积问题,提升事实一致性。

Comments accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在开放式文本生成方面取得了显著进展,但仍容易产生不正确或无依据的幻觉内容,这损害了其可靠性。在长文本生成中,由于幻觉雪崩现象(早期错误传播并累积到后续输出),这一问题更加严重。为了解决这一挑战,我们提出了一种新颖的推理时幻觉缓解框架,称为分段幻觉拒绝采样(SHARS),该框架使用任意幻觉检测器在生成过程中识别并拒绝幻觉片段,并重新采样直到生成忠实的内容。通过仅保留可信信息并在此基础上构建后续生成,该框架减轻了幻觉累积并增强了事实一致性。为了实例化该框架,我们采用语义不确定性作为检测器,并引入了若干关键修改以解决其局限性并更好地适应长文本。我们的方法使模型能够自我纠正幻觉,无需外部资源(如网络搜索或知识库),同时保持与这些资源的兼容性以便未来扩展。在标准化幻觉基准上的实证评估表明,我们的方法显著减少了长文本生成中的幻觉,同时保持甚至提高了生成的信息量。代码可在以下网址获取:this https URL。

英文摘要

Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.

2606.03624 2026-06-03 cs.AI cs.CL 版本更新

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

桥接辅助约束以解决大型推理模型中的指令遵循问题

Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of International Relations(国际关系大学) Tencent Jarvis Lab(腾讯Jarvis实验室) Westlake University(西湖大学) King’s College London(伦敦国王学院)

AI总结 针对大型推理模型难以可靠遵循多重约束的问题,提出约束关系图补全框架,通过显式建模约束关系并发现桥接约束,将约束违反率降低39%。

Comments a pre-MIT Press publication version

详情
AI中文摘要

大型推理模型(LRMs)在许多任务中展现出令人印象深刻的能力,但在可靠地遵循多个指令方面存在困难,要么无法满足单个约束,要么难以同时平衡相互竞争的约束。我们将这一挑战形式化为约束遵循问题(CAP)。本文引入了一个新颖的框架,通过将指令表示为约束的结构化知识图来解决CAP。我们的方法,约束关系图补全(CRGC),显式建模约束之间的关系,识别遵循挑战,并发现“桥接约束”,帮助模型更好地聚焦和协调需求。桥接约束作为辅助指令,使主要约束更加突出和兼容。与通过通用训练方法增强指令遵循的现有方法不同,CRGC通过利用模型自身的知识来创建更好的生成路径,从而专门提高约束满足度。在三个流行的指令遵循数据集上的实验表明,与标准提示相比,我们的方法将约束违反减少了39%,同时保持了大型推理模型的推理能力。

英文摘要

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.

2606.03604 2026-06-03 cs.CL 版本更新

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

超越字面:多模态模因理解中的语用意图分解

Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Luyao Ye, Huimin Wang, Hanqi Yan, Binyang Li, Kam-Fai Wong, Yulan He

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei(华为) Central China Normal University(中央师范大学) Shenzhen University(深圳大学) King’s College London(伦敦国王学院) University of International Relations(国际关系大学)

AI总结 针对大型视觉语言模型(LVLMs)在理解模因时倾向于描述字面内容而非语用意图的问题,提出Intent Projection框架,通过表示、输出和目标三层面的字面-语用分解,在六个基准上超越开源模型并缩小与专有模型的差距。

详情
AI中文摘要

当被问及一个模因或讽刺帖子的含义时,大型视觉语言模型(LVLMs)倾向于描述图像显示的内容,而不是作者试图传达的信息。标准指令调优将帖子的字面内容与其语用意义纠缠在一起,让表面细节污染最终响应。我们将模因理解重新定义为字面-语用分解问题,并提出 extbf{Intent Projection},这是一个在单个LVLM骨干网络中的表示、输出和目标三个层面分离这两个信号的框架。在表示层面,一个正交投影模块从融合的图像-文本表示中移除主要的单模态方向,仅保留语用残差,同时一个表面真实情感分类器用一个离散标签锚定解码器,该标签命名了极性差距。在输出层面,模型外化一个结构化的推理链,在目标层面,一个对比奖励明确惩罚重复字面描述的答案。在六个多模态基准测试中,Intent Projection始终优于开源基线,并缩小了与专有模型的差距,在字面崩溃最具破坏性的高分歧帖子上取得了最大收益。

英文摘要

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

2606.03603 2026-06-03 cs.CV cs.CL 版本更新

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

世界模型遇见语言模型:论具体推理与抽象推理的互补性

Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出受控具体推理框架及PF-OPSD方法,通过结合世界模型的视觉模拟与多模态大语言模型的抽象推理,在空间前瞻和开放域物理预测任务上提升性能与鲁棒性。

详情
AI中文摘要

世界模型和多模态大语言模型(MLLMs)为从静态视觉观察预测未来结果提供了互补能力。世界模型可以生成可能未来的具体视觉推演,而MLLMs可以对问题、目标和规则进行抽象推理。然而,生成的推演是随机的,可能在视觉上合理但任务不正确,因此需要确定视觉模拟何时有用、推演是否可信以及它应如何影响最终答案。我们将此问题形式化为受控具体推理,其中模型学习在抽象推理之外调用、验证和整合视觉未来模拟。为了研究这一设置,我们构建了两个人工验证的基准:用于可控空间前瞻的VRQABench和用于开放域物理预测的OpenWorldQA,并提出了特权未来在策略自蒸馏(PF-OPSD)。在训练期间,PF-OPSD仅使用真实未来视频和答案作为教师侧特权上下文来评估在策略具体推理轨迹,而可部署的学生在测试时从未观察到真实未来。实验结果表明,PF-OPSD在VRQABench和OpenWorldQA上分别比基线高出10.6%和10.9%,同时增强了对噪声或冲突推演的鲁棒性。我们的代码和数据集可在以下网址获取:https://this https URL。

英文摘要

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

2606.03602 2026-06-03 cs.LG cs.AI cs.CL 版本更新

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

CauTion:知道何时信任LLM进行集成因果发现

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) Tongji University(同济大学)

AI总结 提出CauTion框架,通过共识过滤和LLM可靠性估计,将LLM领域知识可靠地集成到多个统计因果发现算法中,解决纯统计方法的局限和LLM错误问题。

详情
AI中文摘要

从观测数据进行因果发现仍然具有挑战性,因为纯统计方法存在根本性限制,例如等价类内的统计可区分性和对有限样本量的敏感性。虽然大型语言模型(LLM)提供了有希望的领域知识来源来补充统计推断,但现有的LLM增强方法容易受到LLM错误的影响,并且产生高昂的令牌成本。此外,依赖单一数据驱动算法可能使结果对算法特定偏差敏感。为了解决这些限制,我们提出了CauTion,一个通过共识过滤和LLM可靠性估计将LLM领域知识可靠地集成到统计因果发现算法集成中的框架。CauTion分三个阶段进行。首先,算法集成利用共识投票解决算法一致的最多96%的边,在过滤后的共识边上实现接近完美的准确性。其次,一个信任校准仲裁机制通过无注释的信任校准过程估计LLM和算法的相对可靠性,然后用于控制信任加权投票过程,将LLM仲裁限制在算法证据不可靠的边上。第三,应用循环修复步骤确保最终因果图是有效的无环图。在六个数据集上的实验表明,CauTion在性能上始终优于数据驱动和LLM增强的基线,在更大的图上获得更大的收益,并且对LLM错误具有强大的鲁棒性。代码可在以下网址获取:https://this URL。

英文摘要

Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

2606.03544 2026-06-03 cs.AI cs.CL 版本更新

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

SAGE: 智能体生态中社会化演化的定量评估

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

发表机构 * Tsinghua University, China(清华大学, 中国) Meituan, China(美团, 中国)

AI总结 提出SAGE框架,通过对比社会演化(SocialEvo)与自我演化(SelfEvo)两种计算条件,在三个领域评估共享经验对智能体性能的影响,发现群体历史并非普遍放大器,但能帮助陷入停滞的智能体取得突破,且社会收益依赖于抽象能力而非暴露量。

Comments 13 pages, 5 figures

详情
AI中文摘要

自我改进的语言智能体通常被孤立评估:一个智能体尝试任务、接收反馈并迭代优化自身行为。然而,智能体越来越多地与同伴一起运作,其策略和结果公开可见。这引发了一个研究不足的问题:共享经验何时能产生自我改进无法单独实现的改进?我们引入了SAGE(社会智能体群体演化),一个评估框架,比较两种计算匹配的条件:SocialEvo,其中来自五个不同模型家族的智能体共同演化,可访问所有同伴的历史;以及SelfEvo,其中每个智能体获得相同数量的任务尝试,但只能看到自己的过去,这是自我改进智能体研究中的常规做法。我们在三个领域实例化SAGE:开放式机器学习研究、长期经济规划和战略多人游戏,并在多个演化轮次中进行评估。我们发现群体历史并非普遍放大器:最强的智能体并未超过其自我演化上限。然而,在自我改进下停滞的智能体,当同伴经验可用时,可以取得重大突破。在竞争环境中,反事实控制显示智能体普遍改进,而非发展针对对手的策略。在不同形式的共享历史中,过滤后的同伴轨迹和反思性摘要通常优于原始日志,表明社会收益依赖于抽象而非暴露量。这些发现表明,同伴历史收益是智能体特定的、领域依赖的,并取决于从公共轨迹中抽象可转移知识的能力。

英文摘要

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

2606.03535 2026-06-03 cs.IR cs.CL 版本更新

Can LLM Rerankers Predict Their Own Ranking Performance?

LLM 重排序器能否预测自身的排序性能?

Shiyu Ni, Keping Bi, Jiafeng Guo, Jingtong Wu, Zengxin Han, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety(人工智能安全国家重点实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 研究 LLM 重排序器能否通过自一致性或口头化置信度来估计自身生成的排序质量,并提出两种监督方法 Verb-Num 和 Verb-List 以改进校准。

详情
AI中文摘要

检索效果在不同查询间差异显著,因此在获得相关性判断之前估计排序质量非常重要。查询性能预测(QPP)解决了这一需求,但大多数现有方法依赖于检索或重排序后的外部预测器。本文研究 extit{重排序器内部 QPP}:LLM 重排序器能否估计其刚刚产生的排序的质量?我们探讨了无训练和基于训练的方法。对于无训练估计,我们检查了跨采样排序的特定于度量的自一致性以及由重排序器直接生成的口头化置信度。在 TREC Deep Learning 2019--2022 上使用四个 LLM 的实验表明,自一致性与最先进(SOTA)方法竞争力相当,并且在几乎所有设置下校准更好,而直接口头化置信度严重过度自信。为了改进口头化置信度,我们提出了两种监督方法 Verb-Num 和 Verb-List,使 LLM 重排序器仅需少量额外输出标记即可生成校准的排序质量估计。

英文摘要

Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.

2606.03463 2026-06-03 cs.AI cs.CL 版本更新

DMF: A Deterministic Memory Framework for Conversational AI Agents

DMF:对话式AI代理的确定性记忆框架

Matteo Stabile, Enrico Zimuel

发表机构 * Roma Tre University(罗马三大学)

AI总结 提出一种CPU优先的确定性记忆框架DMF,通过经典NLP分析、向量几何和数学评分替代生成式记忆压缩,实现零令牌成本且与Mem0相当的准确性。

Comments 21 pages, 3 figures

详情
AI中文摘要

对话式AI代理需要在大时间跨度的交互中既具可扩展性又语义连贯的记忆系统。现有方法主要依赖基于大语言模型(LLM)的写入时摘要,这引入了非确定性、令牌成本上升以及剪枝决策不透明等问题。我们提出确定性记忆框架(DMF),一种CPU优先的方法,用完全确定性的流水线替代生成式记忆压缩,该流水线基于经典NLP分析、向量几何和数学评分。DMF为每次对话交互分配一个生存分数$\Omega$,该分数由确定性内容信号、对话线索和结构化来源通过逻辑投影组合计算得出。一个交互计数衰减定律,记为$\Omega_{\mathrm{eff}}(\Delta n)$,控制着相关性随新轮次到达的演变,其中$\Delta n$是较新交互的数量而非实际时间,从而保持完全确定性。我们给出了DMF的数学公式、结构化召回流水线、剪枝决策过程和评估协议。实验在基于LoCoMo和LongMemEval数据集构建的专用基准上进行。我们将DMF与AI代理的流行记忆层Mem0进行比较。DMF在准备记忆上下文时使用零令牌,在整个对话中使用的令牌数少5到242倍,同时达到相当的准确性。这些结果表明,可以从记忆管理循环中消除LLM调用,将令牌成本降至几乎为零,并为对话式AI代理实现确定性记忆系统。

英文摘要

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $Ω$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $Ω_{\mathrm{eff}}(Δn)$, governs how relevance evolves as new turns arrive, where $Δn$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

2606.03437 2026-06-03 cs.CL 版本更新

Large Language Models Are Overconfident in Their Own Responses

大型语言模型对自己的回答过度自信

Mario Sanz-Guerrero, Manuel Mager, Katharina von der Wense

发表机构 * Johannes Gutenberg University Mainz(莱茵河畔明斯特约翰·古腾堡大学) University of Colorado Boulder(科罗拉多大学波德分校)

AI总结 研究指令微调与聊天模板导致的大型语言模型校准偏差,发现“所有权偏见”使模型对自己的回答自信度高出26%,并提出通过将模型回答伪装为用户输入来降低过度自信。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

先前工作表明,指令微调的大型语言模型(LLMs)比其基础预训练模型校准更差。然而,关于常用聊天模板对对话型LLM校准的影响知之甚少。在本工作中,我们通过解耦后训练算法和聊天格式的影响,研究了导致这种校准偏差的机制。我们发现,虽然指令微调从根本上损害了校准,但聊天模板通过“所有权偏见”加剧了问题——模型对自己回答的自信度显著高于对用户提供的相同回答的自信度。在六个近期开源权重LLM、三个基准和三种置信度获取方法上的大量实验表明,模型对自己的回答分配的置信度高出26%。利用这一见解,我们提出一种简单的推理时策略:在置信度获取时将模型的回答框定为用户输入。该方法显著降低了过度自信,并将校准提高了26%,无需重新训练,缩小了基础模型与指令微调模型之间的差距。

英文摘要

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.

2606.03412 2026-06-03 cs.CL 版本更新

Lexicons and grammars for language processing: industrial or handcrafted products?

语言处理的词典和语法:工业产品还是手工制品?

Eric Laporte

AI总结 本文分析词典和语法的手工构建与自动化工业化两种趋势,探讨哪种方式或两者结合能获得最佳结果。

详情
Journal ref
Léxico e gramática: dos sentidos à construção da significação, Cultura acadêmica, 2009, Trilhas Lingüísticas, 16, pp.51-84
AI中文摘要

近年来,语言数据在语言处理中的应用逐渐增加。这些数据现在通常被称为语言资源。用于此目的的大多数语言资源是文本集合,如布朗语料库和宾州树库,但电子词典(WordNet、FrameNet、VerbNet、ComLex、词典-语法...)和形式语法(TAG...)也在最近得到发展。词典和语法的大多数构建过程是手动的,而语料库的构建则一直高度自动化。然而,越来越多的语言处理专家认识到,词典和语法的信息内容比语料库更丰富,因此前者可以实现更精细的处理。构建时间的差异可能与信息内容的差异有关:语言学家手工制作词典和语法可能使其比自动生成的数据更具信息性。这种情况可能向两个方向发展:要么语言技术专家逐渐习惯于处理手工构建的资源,这些资源更具信息性且更复杂;要么词典和语法的构建过程被自动化和工业化,这是主流观点。两种演变都在进行中,并且它们之间存在紧张关系。语言学家和计算机科学家之间的关系取决于这些演变的未来,因为前者需要培训和雇佣大量语言学家,而后者主要依赖于计算机工程师提出的解决方案。本文旨在分析所讨论的语言资源的实际例子,并讨论手工制作或工业生成,或两者结合,哪种趋势能产生最佳结果或最现实。

英文摘要

During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

2606.03399 2026-06-03 cs.CL cs.CR 版本更新

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

选择性令牌级密码学编辑用于大型语言模型的隐私保护临床部署

Farhan Sheth, Ziyuan Yang, Yongying Lan, Si Yong Yeo

发表机构 * MedVisAI Lab, Singapore(新加坡MedVisAI实验室) Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, China(中国上海交通大学医学院瑞金医院) Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore(新加坡南洋理工大学Lee Kong Chian医学院)

AI总结 提出HERALD框架,通过令牌级密码学编辑仅加密敏感令牌,在保护隐私的同时保持下游模型效用,在分类和医疗问答任务上接近明文性能。

Comments 33 pages, 8 figures, 26 tables

详情
AI中文摘要

尽管大型语言模型(LLMs)越来越多地用于临床应用,但许多现有流程需要将原始敏感健康信息发送到远程服务器进行处理,这增加了隐私泄露的风险。缓解这种风险的一种自然方法是在传输前对数据进行加密。然而,加密整个数据集等直接解决方案会带来巨大的计算、对齐和通信开销,使得大规模实际部署不可行。为了在保护隐私的同时保持可用性,我们提出了通过自适应语言分解的医疗加密与编辑(HERALD),这是一个令牌级密码学编辑框架,通过仅加密敏感令牌同时保留上下文以供下游模型使用,实现了这种平衡。HERALD结合医学命名实体识别器(NER)与基于词性(POS)的策略来选择候选令牌,执行目标词形还原以稳定表面形式,并将每个受保护令牌替换为包裹在显式分隔符中的确定性密文。值得注意的是,HERALD是模型无关的,完全在客户端运行,确保敏感内容在存储、传输和处理过程中保持加密,无需更改下游模型。我们在公开数据集上对分类和医疗问答(MQA)任务评估了HERALD。在不同任务中,实验表明完全安全的基线遭受显著的效用损失,而HERALD始终恢复接近明文的性能。总体而言,HERALD提供了一种新颖的利用流程。

英文摘要

While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy leakage. A natural approach to mitigate this risk is to encrypt the data before transmission. However, straightforward solutions such as encrypting the entire dataset introduce prohibitive computational, alignment, and communication overheads, rendering large-scale practical deployment infeasible. To preserve privacy while maintaining usability, we present Healthcare Encryption & Redaction via Adaptive Linguistic Decomposition (HERALD), a token-level cryptographic redaction framework designed to achieve this balance by encrypting only sensitive tokens while preserving the surrounding context for downstream model utility. HERALD combines medical named-entity recognizer (NER) with part-of-speech (POS) driven policies to select candidate tokens, performs targeted lemmatization to stabilize surface forms, and substitutes each protected token with a deterministic ciphertext wrapped in explicit delimiters. Notably, HERALD is model-agnostic and operates entirely on the client side, ensuring that sensitive content remains encrypted throughout storage, transmission, and processing without requiring changes to downstream models. We evaluated HERALD on both classification and medical question answering (MQA) tasks on public datasets. Across different tasks, experiments illustrate that fully secured baselines suffer significant utility loss, whereas HERALD consistently recovers performance close to plaintext. Overall, HERALD provides a novel utilization pipeline.

2606.03398 2026-06-03 cs.CL cs.AI 版本更新

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Transformer建模计数器语言中栈表示的因果证据

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰)

AI总结 通过线性探针和消融实验,证明Transformer在计数器语言任务中学习的栈表示对其性能具有因果必要性。

Comments 8 pages, 8 figures

详情
AI中文摘要

形式语言已被证明是理解Transformer内部机制的有效途径。以往研究表明,在计数器语言上进行下一个词预测训练的Transformer会学习到与底层栈结构一致的表示。除了表示分析,本文还研究了这些表示的因果作用。我们训练线性探针从模型隐藏状态中预测每个词符处的栈深度,并从探针中提取主表示方向。从模型中消融该方向会导致序列准确率骤降至接近0%,这提供了强有力的经验证据,表明栈表示不仅是学习到的,而且对模型性能具有因果必要性。

英文摘要

Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysis, this paper investigates the causal role of these representations. Linear probes are trained to predict the stack depth at each token from the model's hidden states, and a principal representation direction is extracted from the probe. Ablation of this direction from the model causes sequential accuracy to collapse to near 0%, providing strong empirical evidence that the stack representation is not just learned, but is causally necessary for model performance.

2606.03391 2026-06-03 cs.LG cs.AI cs.CL 版本更新

When Model Merging Breaks Routing: Training-Free Calibration for MoE

当模型合并破坏路由:MoE的无训练校准

Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对MoE架构中模型合并导致的路由崩溃问题,提出基于二阶曲率的无训练校准方法HARC,通过闭式解和共轭梯度法高效重对齐路由器,显著提升数学推理和代码生成性能。

详情
AI中文摘要

模型合并已成为一种无需重新训练即可整合多个LLM能力的成本效益方法。然而,现有的合并技术主要基于线性参数算术或优化,在应用于混合专家(MoE)架构时面临困难。我们识别出MoE合并中的一个关键失效模式,称为路由崩溃,其中合并后的路由器无法将令牌分派给合适的专家。路由崩溃源于非线性softmax和离散Top-k路由机制对合并引起的参数扰动的敏感性,这种敏感性进一步被MoE预训练期间施加的负载平衡约束放大。由于微调后的专家表现出不同的专长,即使是适度的错误路由也可能导致严重的性能下降。为解决此问题,我们提出Hessian感知路由器校准(HARC),一种无训练框架,利用二阶曲率信息重新对齐合并后的路由器。该方法采用闭式解,可通过无矩阵共轭梯度法高效求解。在数学推理和代码生成任务上的实验表明,HARC有效缓解了多种MoE合并基线中的路由崩溃,并带来了显著的性能提升。我们的代码可在该https URL获取。

英文摘要

Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.

2606.03363 2026-06-03 cs.CL 版本更新

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

EntSQL:一个将Text-to-SQL置于长上下文企业知识中的基准

Chengxi Liao, Tao Xu, Zulong Chen, Chuanfei Xu, Yiyan Wang, Xinyun Wang, Yanlong Zhang, Xiaojun Chen, Zhibo Yang, Zeyi Wen

发表机构 * HKUST (GZ)(香港科技大学(广州)) Alibaba Group(阿里巴巴集团) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳))

AI总结 提出EntSQL基准,通过包含1066个跨五个业务领域的中英文对齐示例,评估LLM在长上下文企业文档中基于私有业务知识生成SQL的能力,最佳系统仅达15.9%准确率。

详情
AI中文摘要

Text-to-SQL使得通过自然语言访问数据库成为可能,最近的LLM显著提升了其能力。现有的基准如Spider、BIRD和Spider~2.0评估了模式泛化、大规模数据库和现实工作流,但很大程度上忽略了SQL生成依赖于私有业务知识(如内部指标、报告惯例和组织规则)的企业场景。我们引入了EntSQL,一个面向企业的Text-to-SQL基准,用于评估在专有业务文档上的长上下文基础。EntSQL包含1066个跨五个业务领域的中英文对齐语义示例,大多数示例需要超越问题和模式的领域知识,并涉及复杂的SQL结构。在英文输入上,当提供长文档时,最佳评估系统仅达到15.9%,突显了在企业知识基础上生成SQL的难度。

英文摘要

Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases, and realistic workflows, but largely overlook enterprise scenarios where SQL generation depends on private business knowledge, such as internal metrics, reporting conventions, and organizational rules. We introduce EntSQL, an enterprise-oriented Text-to-SQL benchmark for evaluating long-context grounding over proprietary business documents. EntSQL contains 1,066 aligned Chinese-English semantic examples across five business domains, with most examples requiring domain knowledge beyond the question and schema and involving complex SQL structures. On English inputs, the best evaluated system reaches only 15.9\% when long-form documents are provided, highlighting the difficulty of grounding SQL generation in enterprise knowledge.

2606.03359 2026-06-03 cs.SD cs.CL cs.LG 版本更新

Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

基于注意力机制的残差连接LSTM网络的语音情感识别

Daniil Krasnoproshin, Maxim Vashkevich

发表机构 * Institute of Cybernetics and Machine Learning, Belarusian State University(白俄罗斯国立大学信息学与机器学习学院)

AI总结 提出ResLSTM-SA轻量级架构,在LSTM中集成残差连接和软注意力,在RAVDESS数据集上以46.8k参数达到0.6517 UAR,优于传统基线且适合边缘部署。

Comments 6 pages, 5 figures, DSPA 2026

详情
AI中文摘要

语音情感识别是现代人机交互系统的重要组成部分。然而,许多最先进的方法依赖于具有高计算和内存需求的大型预训练模型,限制了其适用性。本文提出了ResLSTM-SA,一种轻量级架构,在基于LSTM的框架中集成了残差连接和软注意力。在RAVDESS数据集上,在严格的说话人独立划分下进行评估,所提出的模型在未加权平均召回率(UAR)方面优于传统的基于注意力的LSTM基线以及几种先前报道的CNN和混合CNN-LSTM架构。性能最佳的变体(ResLSTM-SA-h64)仅用46.8k可训练参数就达到了0.6517的最大UAR,以比大规模自监督替代方案少三个数量级的参数提供了具有竞争力的准确性,从而能够在边缘设备和实时语音助手上高效部署。源代码可在以下网址获取:https://this URL。

英文摘要

Speech emotion recognition is an important component of modern human-computer interaction systems. However, many state-of-the-art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM-SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM-based framework. Evaluated on the RAVDESS dataset under strict speaker-independent partitioning, the proposed model outperforms conventional attention-based LSTM baselines and several previously reported CNN- and hybrid CNN-LSTM architectures in terms of unweighted average recall (UAR). The best-performing variant (ResLSTM-SA-h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised alternatives, thereby enabling efficient deployment on edge devices and real-time voice assistants. The source code is available at https://github.com/Mak-Sim/ResLSTM-SER.

2606.03357 2026-06-03 cs.CL cs.AI 版本更新

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

未抽样的真相:SLM 中的心理测量衡量的是提示伪影,而非心理构念

Nils Schwager, Christoph Hau, Simon Münker, Achim Rettinger

发表机构 * Trier University(特里尔大学)

AI总结 通过提示变异框架分离语义信号与提示伪影,发现小型语言模型在心理测量中主要反映提示遵从性而非模拟心理特质,并提供了诊断工具。

Comments 10 pages, 5 figures, 3 tables

详情
AI中文摘要

当使用小型语言模型进行心理测量评估时,研究人员假设输出反映了语义推理。我们使用一个提示变异框架,在13个开放权重模型(0.6B到14B参数)上评估了这一前提,该框架将语义信号与提示伪影分离。通过系统地改变角色、指令、项目和选项符号,我们发现伪影方差经常压倒语义信号。在这些情况下,模型主要反映提示遵从性,而非模拟的心理特质。虽然这些发现限制了SLM在心理测量中的效用,但我们的框架提供了一个诊断工具,用于识别破坏性伪影并隔离语义理解,以用于未来的前沿模型研究。

英文摘要

When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that separates semantic signals from prompt artifacts. By systematically varying personas, instructions, items, and option symbols, we find that artifactual variance frequently overpowers the semantic signal. In these cases, models predominantly reflect prompt compliance rather than simulated psychological traits. While these findings limit SLM utility in psychometrics, our framework provides a diagnostic tool to identify destructive artifacts and isolate semantic understanding for future frontier-model research.

2606.03345 2026-06-03 cs.CV cs.CL cs.CY 版本更新

Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

超越语义:从视觉语言数据建模事实与情感感知体验

Youssef Mohamed, Kenneth Ward Church, Mohamed Elhoseiny

发表机构 * KAUST(卡斯土尼亚-沙特大学)

AI总结 提出P-Topics建模问题,通过PercepT两阶段架构从图像和标题中无监督发现并映射事实与情感感知体验,在ArtELingo数据集上显著优于基线。

Comments 8 pages

详情
AI中文摘要

我们提出了P-Topics(感知主题)建模,这是一个理解图像如何被情感和文化感知的新问题。目标是(1)在图像和标题数据集中发现并建模不同的感知体验,每个体验由客观事实和主观情感两方面定义;(2)将图像关联到其相关的感知体验。我们引入了**PercepT**(**感知**主题**T**ransformer),一个两阶段架构来处理P-Topics建模。在形成阶段,percepT通过无监督训练目标发现作为视觉-文本聚类的*P-Topics*,并动态选择聚类数量以匹配数据集的感知丰富度。在映射阶段,它通过注意力池化学习*P-Topic映射函数*,将图像关联到各自的聚类。在ArtELingo上,PercepT的轮廓系数达到**0.97**,而最接近的基线为**0.37**,反映了更好的感知聚类。PercepT的AUC分数达到**0.94**,而基线为**0.77**,显示了更好的感知聚类映射。人工评估证实PercepT捕获了语义上有意义的感知体验,并显著优于现有方法。我们的实现将公开。

英文摘要

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.

2606.03334 2026-06-03 cs.CL cs.LG 版本更新

Lingo_Research_Group at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection

Lingo_Research_Group 在 SemEval-2026 任务 9:评估用于极化检测的提示变体

Pritam Kadasi, Anuj Tiwari, Mayank Singh

发表机构 * Lingo Research Group(Lingo研究组) Indian Institute of Technology Gandhinagar(印度理工学院加尔各答分校) Noida Institute of Engineering and Technology(诺伊达工程与技术学院) ML Collective(ML集体)

AI总结 本文针对 SemEval-2026 任务 9 的多语言极化检测,通过设计 12 种不同提示变体,使用 aya-101 和 Gemma3-27B 模型,在三个子任务上评估提示方法的效果,发现提示方法在粗粒度极化检测上有效,但在细粒度多标签分类上困难增加。

Comments Accepted at the SemEval Workshop, ACL 2026

详情
AI中文摘要

本文提交的成果针对 SemEval-2026 任务 9:多语言文本分类挑战——极化检测,涵盖了所有三个子任务:(1) 二元极化检测,(2) 极化类型分类,以及 (3) 极化表现识别。我们采用系统性的短设计提示研究方法,考虑了 12 种在术语清晰度、定义详细程度、推理指导以及上下文示例使用上不同的设计提示。实验使用 aya-101 和 Gemma3-27B 进行,后者因性能考虑在开发阶段结束时被选用于提交。我们的系统在官方测试集上(22 种语言的平均值)在子任务 1 上的平均宏 F1 得分为 0.762,子任务 2 为 0.587,子任务 3 为 0.444,平均准确率分别为 0.819、0.678 和 0.498。通过跨任务和跨语言分析,我们证明基于提示的方法可以有效检测粗粒度极化,但在细粒度多标签社会语言学分类方面遇到越来越多的困难。

英文摘要

Our submission presented in this paper is for SemEval-2026 Task 9: Multilingual Text Classification Challenge - Polarization Detection and it covers all three subtasks: (1) binary polarization detection, (2) polarization type classification and (3) polarization manifestation identification. We adopt a systematic approach of research on short designed prompts by considering twelve designed prompts that are different in terminology clarity, detail of the definition, guidance of reasoning and in-context examples use. The experiments are conducted using aya-101 and Gemma3-27B, with the latter chosen for the submission at the end of the development through performance considerations. Our system has an average macro level F1-score of 0.762 on Subtask 1, 0.587 on Subtask 2 and 0.444 on Subtask 3 with the average accuracy of 0.819, 0.678 and 0.498, respectively, on the official test set averaged among 22 languages, respectively. With cross-task and cross-lingual analysis, we demonstrate that prompt-based approaches can be used effectively to detect coarse grained polarization but encounter more and more difficulties as far as fine-grained and multi-label sociolinguistic classification is concerned.

2606.03331 2026-06-03 cs.CL cs.AI 版本更新

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

评估大语言模型在真实世界消费设备维修问题上的有效性

Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Toronto(多伦多大学)

AI总结 本文通过引入包含991个真实维修问题的基准测试,评估六种大语言模型在英语和孟加拉语上的正确性、完整性、实用性和安全性,发现模型虽能提供有用帮助,但在高风险维修任务中仍不可靠。

详情
AI中文摘要

消费设备维修是大语言模型(LLMs)一个重要但尚未充分探索的测试平台。维修任务需要对不完整的问题描述、特定硬件的诊断、可操作的故障排除和安全关键决策进行推理,其中错误的建议可能导致设备损坏、电池危险或永久性数据丢失。我们引入了一个包含991个来自Reddit的真实世界维修问题的基准测试,涵盖手机维修、电脑维修和数据恢复,每个问题都配有技术人员编写的参考解决方案,并提供孟加拉语翻译以评估跨语言性能。我们使用四个维修特定标准(正确性、完整性、实用性和安全性)评估了六种最先进的LLMs在英语和孟加拉语上的表现。我们的结果表明,虽然LLMs可以提供有用的维修帮助,但在没有严格评估和明确安全保护措施的情况下,它们在高风险的真实世界维修任务中仍然不可靠。手机维修是最困难且对安全最敏感的领域,所有模型在板级诊断、维修优先级排序和安全恢复程序方面都犯了重大错误。跨领域和模型,孟加拉语回答的表现始终低于英语回答。在评估的模型中,GPT-5.4整体表现最佳。

英文摘要

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

2606.03327 2026-06-03 cs.DB cs.CL 版本更新

CAPER: Clause-Aligned Process Supervision for Text-to-SQL

CAPER: 面向Text-to-SQL的子句对齐过程监督

Lujie Ban, Jiasheng Shi, Jinyang Li, Xiaolin Han, Tsz Nam Chan, Chenhao Ma

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The University of Hong Kong(香港大学) Northwestern Polytechnical University(西北工业大学) Shenzhen University(深圳大学)

AI总结 提出CAPER方法,通过反事实干预SQL抽象语法树自动推导子句级监督,训练轻量级Clause-PRM模型CAPER-9B,用于策略优化和候选验证,在BIRD和Spider数据集上提升了执行准确率和故障定位能力。

详情
AI中文摘要

Text-to-SQL系统通常通过查询级执行正确性进行评估,但这种终端信号对于哪个中间SQL决策导致成功或失败几乎没有指导作用。Token级密集监督也不适合:SQL token与完整的语义决策不对齐,可能惩罚执行等效的查询,并且难以大规模可靠标记。因此,我们提出CAPER,通过对SQL抽象语法树进行反事实干预自动推导子句级监督,实现用于奖励建模的根因错误定位;所得数据用于训练CAPER-9B,一个轻量级的Clause-PRM,为策略优化和候选验证提供子句边界反馈。在BIRD和Spider上的实验表明,子句对齐监督不仅提高了执行准确率(相对于GPT-5.4实现了高达15.3%的相对EX提升),还增强了故障定位能力,在保留的故障上达到了84.53%的准确率和90.60%的MRR。我们的项目页面位于此https URL。

英文摘要

Text-to-SQL systems are typically evaluated by query-level execution correctness, but this terminal signal provides little guidance about which intermediate SQL decision caused success or failure. Token-level dense supervision is also ill-suited: SQL tokens do not align with complete semantic decisions, can penalize execution-equivalent queries, and are difficult to label reliably at scale. We therefore propose CAPER, which automatically derives clause-level supervision via counterfactual intervention on the SQL abstract syntax tree, enabling root-cause error localization for reward modeling; the resulting data is used to train CAPER-9B, a lightweight Clause-PRM that provides clause-boundary feedback for policy optimization and candidate verification. Experiments on BIRD and Spider show that clause-aligned supervision not only improves execution accuracy, achieving up to a 15.3% relative EX improvement over GPT-5.4, but also strengthens failure-localization capability, reaching 84.53% accuracy and 90.60% MRR on held-out failures. Our project page is at https://github.com/banrichard/RL-NL2SQL.

2606.03304 2026-06-03 cs.CL cs.LG 版本更新

From Script to Semantics: Prompting Strategies for African NLI

从脚本到语义:非洲自然语言推理的提示策略

Anuj Tiwari, Terry Oko-odion, Hannah Nwokocha

发表机构 * arXiv

AI总结 本研究系统评估了五种提示策略在斯瓦希里语、约鲁巴语和豪萨语的自然语言推理任务上的表现,发现对比提示策略最可靠且能显著提升模型性能。

Comments Accepted at the RAIL Workshop, LREC 2026

详情
AI中文摘要

大型语言模型(LLMs)在多语言环境中的评估日益增多,但它们在低资源非洲语言中的推理行为仍未得到充分探索,尤其是在无微调的纯提示设置下。我们使用AfriXNLI基准,对斯瓦希里语、约鲁巴语和豪萨语的自然语言推理(NLI)提示策略进行了系统研究。我们评估了五种提示策略:基线(零样本)、脚本感知、语言特定、对比和原生标签自翻译(NL-STP),使用了两个中等规模的开源模型(Llama3.2-3B和Gemma3-4B)。为隔离提示设计的影响,我们的研究中排除了少样本示例和思维链推理的影响。我们发现不同策略在类别性能上存在显著差异,某些配置中出现高度中性类崩溃和高预测偏斜。对比提示被证明是最可靠的策略,在不同语言和模型上持续改进,并具有更好的类别行为平衡和整体准确率提升。值得注意的是,精心构建的提示足以击败提供少样本提示和思维链提示的更强大基线。我们发现提示表述对于低资源语言的多语言NLI至关重要,并且语言感知的决策结构可以有效地增强资源受限环境下的鲁棒性。

英文摘要

Large language models (LLMs) are increasingly evaluated in multilingual settings, yet their inference behavior in low-resource African languages remains underexplored especially under pure prompting without fine-tuning. We present a systematic study of prompting strategies for Natural Language Inference (NLI) in Swahili, Yoruba, and Hausa using the AfriXNLI benchmark. We evaluate five prompting strategies Baseline (zero-shot), Script-Aware, Language Specific, Contrastive, and Native-Label Self-Translation (NL-STP) across two mid-sized open weight models (Llama3.2-3B and Gemma3-4B). To isolate the effect of prompt design, the effect of few-shot examples and Chain-of-Thought reasoning is eliminated in our study. We find a significant difference in performance of class wise across strategies with highly neutral class collapse and high prediction skew in some configurations. Contrastive prompting proves to be the most reliable and steadily improving strategy over language and model and has better balance of class behavior and balance of overall accuracy gains. Notably, well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts. We have found that prompt formulation is essential to multilingual NLI with low-resource languages and that language aware decision structuring can be used to meaningfully enhance robustness in resource challenged settings.

2606.03301 2026-06-03 cs.CL cs.CV 版本更新

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

SagaQA:面向电视剧长篇叙事理解的多跳推理基准

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

发表机构 * IRIT, University of Toulouse, France(法国图卢兹大学IRIT中心) Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局) CNRS, IRIT, France(法国CNRS与IRIT)

AI总结 提出SagaQA基准,通过跨剧集的多跳推理任务评估模型对完整电视剧多模态叙事的高层次理解,并比较并行、顺序和混合三种规划策略的性能。

详情
AI中文摘要

我们介绍了SagaQA,一个用于对完整电视剧进行多跳推理的长视频基准。现有的视频推理基准通常强调对相邻帧或片段的局部理解。SagaQA通过要求对整个电视剧中扩展的多模态叙事进行高层次理解来弥补这一空白。SagaQA的一个显著特征是其推理步骤的粒度。我们的数据集需要长距离推理跳跃来连接完全不同的剧集之间的信息。这要求模型对整个事件和动作进行推理,需要在多模态层面上深入理解剧集的叙事和进展。受近期智能体方法进展的启发,我们进一步研究了不同的规划策略如何处理这种复杂推理。我们将这些方法分为三类——并行规划器、顺序规划器和混合规划器——并评估它们生成连贯且完整推理计划的能力。我们在SagaQA上的结果表明,混合规划器始终能产生更高质量的计划,并在电视剧复杂、高层次叙事理解方面表现出更强的能力。

英文摘要

We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

2606.03291 2026-06-03 cs.CL 版本更新

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

大型语言模型中的多语言遗忘:迁移、动态与可逆性

Chaoyi Xiang, Olga Ohrimenko, Benjamin I. P. Rubinstein, Lea Frermann

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 本文通过扩展TOFU基准到五种语言,研究大型语言模型中的多语言遗忘,发现遗忘迁移在不同语言间高度可变,且遗忘主要作用于后期解码层而非共享的跨语言潜在空间,因此可以通过推理时的单一引导方向逆转大部分遗忘效果。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)能够记忆敏感事实,这促使了遗忘方法的发展,旨在无需昂贵重训练的情况下移除特定知识。然而,遗忘研究仍然高度以英语为中心。我们通过将TOFU基准扩展到五种语言来研究多语言遗忘,并使用不同的语言组合对模型进行微调、遗忘和查询。我们发现遗忘迁移(即遗忘模型在非遗忘语言中“忘记”事实的能力)高度可变:例如,在共享文字和语系的语言之间最强,并且我们表明遗忘语言能够预测哪些查询语言最可能产生最强的迁移。逐层分析揭示,遗忘在早期层中基本保持了共享的跨语言潜在空间完整,而是主要作用于后期解码层。这表明遗忘并未真正擦除知识,而是引发了表面的抑制。利用这一结构,单一推理时的引导方向可以逆转大部分这种抑制,跨语言恢复50%(Qwen)和90%(Gemma)的遗忘知识。

英文摘要

Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual unlearning by extending the TOFU benchmark to five languages, and fine-tune, unlearn, and query our models with different permutations of languages. We find that unlearning transfer, the ability of an unlearned model to "forget" facts in languages other than the unlearning language, is highly variable: e.g., it is strongest between languages sharing scripts and families, and we show that the unlearning language predicts which query languages are most likely to yield the strongest transfer. Layer-wise analysis reveals that unlearning leaves the shared cross-lingual latent space largely intact in early layers, instead operating primarily in later decoding layers. This suggests that unlearning does not truly erase knowledge, but rather induces superficial suppression. Exploiting this structure, a single inference-time steering direction reverses much of this suppression across languages, recovering 50% (Qwen) and 90% (Gemma) of the unlearned knowledge.

2606.03284 2026-06-03 cs.CL 版本更新

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

SEA-NLI:将自然语言推理作为理解东南亚文化的透镜

Peerawat Chomphooyod, Jian Gang Ngui, Yosephine Susanto, Attapol T. Rutherford, Alham Fikri Aji, Sarana Nutanong, Can Udomcharoenchaikit, Peerat Limkonchotiwat

AI总结 提出SEA-NLI基准,通过自然语言推理评估模型对东南亚文化的理解,发现现有模型表现不佳,文化适应和提示可提升性能。

详情
AI中文摘要

前沿LLM在西方语境中表现良好,但在东南亚等代表性不足的文化中测试不足。现有的NLI基准大多以西方为中心、源自翻译或单语,限制了其衡量文化基础推理的能力。我们引入了SEA-NLI,一个原生的、基于文化的NLI基准,涵盖八个东南亚国家的英语和本地区域语言,并由母语者验证。在17个编码器和解码器模型中,我们观察到所有模型表现较低,尤其是在语言和科技等知识密集型类别中。我们的分析表明,失败案例主要源于缺乏东南亚文化知识:适应东南亚的模型和文化感知提示提升了性能,而思维链提示带来的提升有限。

英文摘要

Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.

2606.03273 2026-06-03 cs.CV cs.AI cs.CL 版本更新

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

VistaHop: 视觉深度搜索的多跳视觉推理基准

Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin

发表机构 * East China Normal University(东华大学) Meituan(美团) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出VistaHop基准,通过多跳问答任务评估多模态大推理模型在视觉深度搜索中的迭代图像检查、视觉锚点定位和跨证据链推理能力,实验表明现有模型表现有限。

详情
AI中文摘要

视觉深度搜索要求多模态大推理模型(MLRM)智能体通过反复检查图像区域、将中间推理锚定在视觉证据上,并跨长推理链连接细粒度线索来回答复杂的视觉查询。然而,现有基准主要关注单步视觉理解或静态图像问答,对迭代图像检查、视觉锚点定位和多跳证据整合的评估有限。在这项工作中,我们引入了VistaHop,一个用于评估视觉深度搜索中以视觉为中心的搜索和多跳视觉推理的基准。VistaHop包含300张高分辨率图像、25个视觉搜索场景和350个多跳QA任务,这些任务要求模型跟随从视觉锚点出发的证据链,或融合跨多个基于图像的推理路径的信息。我们进一步开发了VistaArena,一个统一的评估环境,支持带有文本搜索、图像搜索、图像裁剪和基于证据的答案验证的工具增强推理。在七个代表性MLRM上的实验表明,当前模型远未解决VistaHop:最佳模型SenseNova-MARS-32B仅达到24.31%的Pass@1。这些结果揭示了在视觉定位、证据重访、长链推理和多锚点信息融合方面的持续局限性,凸显了对更强基准和训练方法的需求,以推动视觉深度搜索的发展。

英文摘要

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

2606.03259 2026-06-03 cs.CL 版本更新

Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent

超越“敬启者”:面向受众和意图的机器翻译定制化

Raphael Merx, Ekaterina Vylomova, Trevor Cohn

发表机构 * The University of Melbourne(墨尔本大学) Google(谷歌)

AI总结 本文通过系统评估50种语言、5种模型规模和8个文本领域,研究了大型语言模型在机器翻译中利用显式指令实现目的驱动翻译的能力,发现指令能显著提升翻译适应性,但传统指标无法评估适应质量。

详情
AI中文摘要

翻译质量取决于目的:同一源文本根据受众、语气和交际意图需要不同的翻译。然而,机器翻译模型和指标将翻译视为从源语言到目标语言的固定映射。大型语言模型使用户能够明确指定目的以及源文本,但这一能力尚未得到大规模评估。我们引入了一种跨50种语言、5种模型规模和8个文本领域的目的驱动翻译的系统评估。我们发现:(1) 显式指令显著提高了翻译的适应性,在非正式领域(对话、社交媒体)、较大模型规模和高资源语言中提升更大;(2) 指令优于语义匹配的少样本示例和段落级上下文;(3) 传统机器翻译指标无法捕捉适应质量,通常惩罚适应性翻译;(4) 当没有精心设计的指令时,模型可以从周围文档上下文中自我生成指令,缩小高达80%的与精心设计指令的适应性差距。我们的结果表明,目的适应型机器翻译是大型语言模型的一种可行且可衡量的能力,同时强调了需要目的感知的指标。

英文摘要

Translation quality depends on purpose: the same source text demands different translations depending on audience, tone, and communicative intent. Yet MT models and metrics treat translation as a fixed mapping from source to target. LLMs enable users to explicitly specify purpose alongside source text, yet this capability has not been evaluated at scale. We introduce a systematic evaluation of purpose-driven MT across 50 languages, 5 model sizes and 8 text domains. We find that (1) explicit instructions substantially improve translation adaptedness, with larger gains on informal domains (conversation, social media), for larger model sizes and for higher-resource languages; (2) instructions outperform semantically-matched few-shot examples and paragraph-level context; (3) traditional MT metrics fail to capture adaptation quality, often penalizing adapted translations; (4) when curated instructions are unavailable, models can self-generate them from surrounding document context, closing up to 80% of the adaptedness gap to curated instructions. Our results establish that purpose-adapted MT is a viable and measurable capability of LLMs, while highlighting the need for purpose-aware metrics.

2606.03250 2026-06-03 cs.CL 版本更新

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

词与道:德语医学NLP中领域特定BERT预训练的策略

Henry He, Johann Frei, Raphael Schmitt

发表机构 * School of Computation, Information and Technology(计算信息科技学院) Technical University of Munich(慕尼黑技术大学) Chair of IT Infrastructure for Translational Medical Research(转化医学研究信息基础设施主任) Faculty of Applied Computer Science(应用计算机科学学院) University of Augsburg(艾希施泰特大学) Institute of General Practice(基础医学研究所) Faculty of Medicine and Medical Center(医学学院和医学中心)

AI总结 本文提出ChristBERT系列模型,通过对比持续预训练、从头训练和领域词汇适应三种策略,在德语医学NLP任务中实现最优性能,并建立新的基准。

Comments Under revision at BMC Medical Informatics and Decision Making

详情
AI中文摘要

数字医疗产生大量临床文本,可支持AI辅助应用,但德语生物医学语言模型仍受限于较旧的架构或受限的训练数据。我们提出了ChristBERT(临床与健康相关议题及主题调优BERT),这是一个基于德语RoBERTa的领域特定语言模型家族,在包含科学出版物、临床文本、健康相关网络内容和翻译临床资源的13.5GB语料库上训练。为了探究领域适应策略在德语临床NLP中的影响,我们比较了持续预训练、从头训练和领域词汇适应。所得模型在三个医学命名实体识别任务和两个文本分类任务上进行了评估。ChristBERT在五个基准中的四个上持续优于现有的通用和医学德语语言模型,并为德语临床语言建模建立了新的最先进水平。我们的结果表明,最优适应策略取决于任务:在我们的评估中,从头训练对高度专业化的临床文本特别有效,而持续预训练在更常见的医学文本上表现良好。所有模型均已公开发布,以支持德语医学NLP的未来研究和应用。

英文摘要

Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.

2606.03247 2026-06-03 cs.CL cs.IR 版本更新

Structures Facilitate Retrieve, Rerank, and Generate

结构促进检索、重排序和生成

Yeqin Zhang, Haomin Fu, Xujie Zhang, Cam-Tu Nguyen

AI总结 提出SF-Re2G方法,通过利用文档结构信息改进段落表示、构建结构增强的重排序器并融入子图上下文,以提升文档对话系统的检索、重排序和生成性能。

详情
AI中文摘要

文档对话系统(DGDS)利用外部文档中的知识来回答特定领域的用户问题。现有解决方案通常将文档划分为独立的段落进行检索和响应生成。然而,这种方法既没有充分利用文档内的结构信息,也没有为知识选择和响应提供足够的(文档)上下文。本文提出SF-Re2G来系统地解决这些问题。首先,我们通过将段落与同一章节的其他段落进行对比来改进段落表示,从而提高检索性能。其次,构建了一个结构增强的重排序器,利用同一对话轮次的多个基础段落往往位于同一邻近区域的事实。具体来说,来自检索的候选者根据文档结构被分组为子图。重排序器将结合其组信息对候选者重新评分。最后,选中的段落用于生成响应,同时考虑子图上下文以改进生成。在两个DGDS数据集上的实验结果验证了我们的方法在中文和英文上的有效性。

英文摘要

Document-grounded dialogue systems (DGDS) utilize knowledge from external documents to answer domain-specific user questions. Existing solutions typically divide documents into independent passages for retrieval and response generation. This approach, however, neither makes good use of structural information within documents nor provides enough (document) context for knowledge selection and responses. This paper proposes SF-Re2G to address such issues systematically. Firstly, we seek to improve a passage representation by contrasting it with others of the same section, thus improving the retrieval performance. Secondly, a structure-enhanced reranker is built, leveraging the fact that multiple grounding passages of one dialog turn tend to be in the same neighborhood. Specifically, candidates from the retrieval are grouped into subgraphs according to the document structure. The reranker will rescore the candidate integrating its group information. Finally, the chosen passages are used for responses, taking into account the subgraph context for better generation. Experimental results on two DGDS datasets validate our method for both Chinese and English.

2606.03244 2026-06-03 cs.CL 版本更新

When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

何时复杂度调节对冻结句子嵌入有帮助?基于逐句和句子对难度适配的受控研究

Suhwan Hwang

发表机构 * Suhwan Hwang

AI总结 通过受控实验研究冻结句子编码器后接轻量适配器时,基于句子级和句子对级难度信号的调节效果,发现句子对级残差门控在较大和分级任务上持续提升性能,而句子级方法无效。

Comments 13 pages, 3 figures, 2 tables

详情
AI中文摘要

一个常见的直觉是句子嵌入应适应输入的难度。我们在受控的多随机种子设置中测试这一直觉:一个轻量后编码器适配器附加到冻结的Qwen3-Embedding-0.6B编码器上,仅访问其最终池化嵌入,并在四个释义和语义相似度任务(PAWS、MRPC、QQP、STS-B)上评估。该想法的朴素形式失败:基于表面的逐句复杂度与冻结基线误差几乎不相关(Pearson约0.05),且相比常数或打乱对照无优势,同时降低饱和基线。即使目标与非循环的句子对难度信号对齐,逐句门控仍无法可靠捕获难度,因为难度主要是句子对的属性,而非单个句子。相比之下,由留出的交叉编码器难度信号门控的小型句子对级残差在较大和分级任务上持续提升,包括STS-B上+0.022 Spearman和QQP上+0.037,同时所有随机种子均锚定于冻结基线。由于这种有用形式操作于句子对而非单个句子,所得模型最好理解为缓存冻结嵌入上的轻量重排序器,而非替代的单向量嵌入;我们不声称达到最先进。我们的贡献是对难度感知适配何时有帮助何时失败的受控说明,以及预测可用余量的预训练诊断。

英文摘要

A common intuition is that sentence embeddings should adapt to the difficulty of the input. We test this intuition in a controlled, multi-seed setting: a lightweight post-encoder adapter attaches to a frozen Qwen3-Embedding-0.6B encoder, accessing only its final pooled embedding, and is evaluated on four paraphrase and semantic-similarity tasks (PAWS, MRPC, QQP, STS-B). The naive form of the idea fails: surface-based per-sentence complexity is nearly uncorrelated with frozen-baseline error (Pearson approximately 0.05) and provides no advantage over constant or shuffled controls, while degrading a saturated baseline. Even when the target is aligned to a non-circular pair-difficulty signal, the per-sentence gate still cannot reliably capture difficulty because difficulty is primarily a property of the pair, not the individual sentence. In contrast, a small pair-level residual gated by a held-out cross-encoder difficulty signal yields consistent gains on the larger and graded tasks, including +0.022 Spearman on STS-B and +0.037 on QQP, while remaining anchored to the frozen baseline across all seeds. Because this useful form operates on sentence pairs rather than individual sentences, the resulting model is best understood as a lightweight re-ranker over cached frozen embeddings, not a replacement single-vector embedding; we make no state-of-the-art claim. Our contribution is a controlled account of when difficulty-aware adaptation helps and when it fails, together with a pre-training diagnostic that predicts the available headroom.

2606.03241 2026-06-03 cs.CL eess.AS 版本更新

Benchmarking Speech-to-Speech Translation Models

语音到语音翻译模型基准测试

Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo

发表机构 * Sony Group Corporation, Japan(日本索尼集团公司) Carnegie Mellon University, USA(美国卡内基梅隆大学)

AI总结 提出统一可复现的基准框架COMPASS,集成46个指标评估语音到语音翻译模型,通过相关性过滤将指标缩减至10个,并验证了领域特定指标与人类判断的高度相关性。

Comments Paper under submission

详情
AI中文摘要

语音到语音翻译(S2ST)已取得快速进展,但离线评估缺乏统一协议:研究报告非重叠的指标子集,阻碍了直接比较。我们引入COMPASS,一个统一且可复现的基准测试框架,集成了跨八个维度的46个指标,并将其部署在来自FLEURS和CVSS的1,248个模型-语言配置上,涵盖级联和端到端架构的十种语言对。架构表现出互补优势:最佳与最差之间的差距在自然度和说话人保留方面超过30%,但在翻译质量上仅相差几个百分点,因此单一指标排名系统地歪曲了系统质量。相关性过滤将46个指标减少到每个方向10个,其中三个轴在X→EN和EN→X上需要不同的指标(例如,TER/UTMOS vs. ChrF++/NISQA-MOS);这些子集保留了排名(Spearman's ρ>0.80),同时将评估时间减少了约2.5倍。在配音、播客和医学领域的人类验证表明,独立的MOS预测器无法预测听众偏好,而顶级领域特定指标与人类判断相关(ρ≥0.90)。我们发布COMPASS作为领域感知S2ST评估的基础。

英文摘要

Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $ρ>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($ρ\geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

2606.03239 2026-06-03 cs.CL 版本更新

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

ARBOR: 通过可复用评分标准缓冲区的在线过程奖励用于搜索智能体

Zheng Liu, Longxiang Zhang, Xintong Wang, Zhiang Xu, Shaoxiong Zhan, Xin Shan, Wen Huang, Tao Dai, Shu-Tao Xia, Chengfu Huo, Liang Ding

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) Peking University(北京大学) Shenzhen University(深圳大学)

AI总结 针对基于LLM的搜索智能体训练中结果奖励缺乏过程监督的问题,提出ARBOR框架,通过维护跨查询共享的评分标准记忆库,利用对比轨迹生成局部草案并整合为通用评分标准,以稀疏成对判断提供过程级梯度,在四个多跳QA基准上优于GRPO和DAPO基线。

详情
AI中文摘要

基于LLM的搜索智能体主要使用结果奖励进行训练,搜索过程本身无监督。这种信号在结果同质组(所有采样轨迹共享相同正确性)中退化,产生零组内优势和无梯度。现有的过程监督要么训练昂贵的验证器,要么生成每个查询的评分标准,这些评分标准在查询间不一致且使用一次后丢弃。我们提出ARBOR(自适应评分标准缓冲区用于在线奖励),一种可复用的过程奖励框架,维护跨查询共享的评分标准记忆。由对比轨迹诱导的查询局部草案被接纳、整合为跨查询通用评分标准,并随策略演化而淘汰。一小部分活跃的通用评分标准通过稀疏成对判断对轨迹评分,所得分数加到基础奖励上,即使在结果奖励一致时也能提供过程级梯度。ARBOR在四个多跳QA基准上持续优于GRPO和DAPO基线,将LLM评判准确率平均提高最多4.2个百分点,并将最多42%的原本零梯度训练组转化为信息丰富的组。

英文摘要

LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.

2606.03237 2026-06-03 cs.AI cs.CL cs.CY cs.LG cs.MA 版本更新

Solipsistic Superintelligence is Unlikely to be Cooperative

唯我论超级智能不太可能合作

Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo

发表机构 * DeepMind(深度Mind) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文指出,基于唯我论方法设计的超级智能(极端能力的任务求解器)因忽视部署引发的内生非平稳性而难以合作,呼吁将相互依存作为核心设计原则的非唯我论研究范式。

Comments 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026

详情
AI中文摘要

AI的核心挑战正从能力转向共存。AI研究的主导范式侧重于开发将世界视为外生且平稳反馈源的强大智能体。我们认为,源于这种唯我论AI设计方法的超级智能(极端能力的任务求解器)不太可能合作。部署AI系统会引发内生非平稳性,导致训练-测试-部署差距,即历史分布与部署环境相偏离。我们称此为单边优化的自我削弱属性。缩小这一差距需要参与合作的AI:即多个行为体导航其相互依存的均衡选择过程。我们呼吁一种非唯我论的研究范式,将这种相互依存作为核心设计原则,而非将合作视为待解决的任务。这需要构建涉及自适应对手方的动态评估测试平台,将制度视为设计原语,并保留人类能动性作为我们构建系统的结构性特征。

英文摘要

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

2606.03220 2026-06-03 cs.CL cs.AI 版本更新

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE: 面向MLLM生成Web工件的需求诱导状态评估

Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang

发表机构 * Tsinghua University(清华大学) Huawei Noah’s Ark Lab(华为诺亚实验室) East China Normal University(华东师范大学) Tongji University(同济大学) Institute of Artificial Intelligence, Beihang University(北京航空航天大学人工智能研究院)

AI总结 提出WebRISE框架,通过交互契约图(ICG)将任务需求转化为可观察状态、用户意图转换和DOM/视觉断言,以评估MLLM生成的Web工件的功能正确性,实验表明ICG评分检测状态错误率是检查点评估的2-16倍。

详情
AI中文摘要

现有的MLLM生成Web工件基准通过局部证据评估交互,忽略了决定页面是否正常工作的需求诱导状态和转换。我们提出WebRISE,它将任务需求编译成交互契约图(ICG),包含可观察状态、用户意图转换以及DOM/视觉断言,以实现与实现无关的浏览器执行。WebRISE涵盖五种输入模态(文本、Markdown、草图、图像、视频)下的442个任务,包含5,495个转换和5,271个需求检查,将用户声明的功能与隐式的产品级约束分开。在14个MLLM中,即使最强的模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率,且视觉质量不能代表行为(Qwen3.6-35B-A3B在Markdown上:V=80.8但T=15.5)。视频提供了最强的交互信号(隐式覆盖率比文本高10.6个百分点),而隐式约束仍然存在;缺陷注入表明,基于ICG的评分检测状态错误的速率是检查点评估的2-16倍。

英文摘要

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

2606.03219 2026-06-03 cs.CL cs.LG 版本更新

Sample-Size Scaling of the African Languages NLI Evaluation

非洲语言自然语言推理评估的样本量缩放

Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale, Hannah Nwokocha

发表机构 * Noida Institute of Engineering and Technology(奈德人工智能工程与技术学院) ML Collective(机器学习集体)

AI总结 本研究通过AfriXNLI基准对16种非洲语言进行系统样本量缩放实验,发现NLI性能随样本量增加并非单调提升,而是呈现语言敏感且非单调的缩放行为,表明数据量不足以保证稳定收益,需语言敏感的数据集和更强多语言建模策略。

Comments Accepted at the AfricaNLP Workshop, EACL 2026

详情
AI中文摘要

非洲语言标注数据非常少,且增加标注数据量是否能可靠提升下游性能尚不明确。本研究基于AfriXNLI基准,对16种非洲语言进行了自然语言推理(NLI)的系统样本量缩放研究。在受控条件下,测试了两个约0.6B参数的多语言Transformer模型(在XNLI上微调的XLM-R Large和AfroXLM-R Large),样本量从50到500个标注示例不等,并在随机子采样运行中平均结果。与通常认为的随数据增加性能单调提升相反,我们发现了一种强烈语言敏感且通常非单调的缩放行为。一些语言在低资源场景下表现出早期饱和或性能下降,以及高方差。这些结果表明,数据量不足以保证非洲NLI的稳定收益,因此需要创建语言敏感的数据集和更强的多语言建模策略。

英文摘要

African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language sensitive datasets creation and stronger multi-lingual modelling strategies.

2606.03198 2026-06-03 cs.CL cs.AI 版本更新

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

AI评分歧视取决于复杂临床决策中的评分协议

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

发表机构 * Asclep Korea Inc.(Asclep韩国公司) Center for Data Science, New York University(纽约大学数据科学中心) Division of Endocrinology and Metabolism, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine(成均馆大学医学院内分泌与代谢科,三星医疗中心) Biomedical Statistics Center, Samsung Medical Center(三星医疗中心生物医学统计中心) Department of Digital Health, SAIHST(SAIHST数字健康科) Department of Data Convergence & Future Medicine, Sungkyunkwan University(成均馆大学数据融合与未来医学科)

AI总结 通过因子研究,发现基于评分标准的协议能放大AI评分者区分能力,而无评分标准协议则抑制这种区分,支持在临床AI评估中使用评分标准锚定。

Comments 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

详情
AI中文摘要

临床AI评估越来越多地委托给大型语言模型(LLMs)作为AI评分者进行评分,但其在不同评估条件下的评分行为尚未被定量表征。我们通过一项因子研究填补了这一空白,该研究关注成人2型糖尿病(T2D)药物治疗在12个月门诊随访中的AI评分者行为,这是一项涉及复杂决策的临床任务,通过七个评估问题操作化。四个开源LLMs同时作为临床决策支持系统(CDSS)模型和AI评分者。每个CDSS输出在两种评分协议下评分:基于评分标准的Gold Rubric(GR)协议(包含患者特定评分标准)和无评分标准的Non Gold Rubric(Non-GR)协议。线性混合效应模型将评分协议因子与五个设计因子(CDSS模型、CDSS提示配置(文档参考生成[DRG] vs. 基线)、评分者模型、提示字符和提示类型)交叉,并估计主效应及其协议交互。在所有问题中,AI评分者在Non-GR下始终给出非常窄范围内的更高分数(平均74-78分),而GR下的平均分数低7.69至49.64分,四分位距宽1.68至3.67倍。在每个问题内,GR将AI评分者对DRG和基线CDSS输出的区分能力放大了1.76至5.10倍,同时揭示了Non-GR抑制的评分者模型间的显著行为变异。这些发现支持评分标准锚定作为保留临床AI评估区分能力的评分协议;当问题需要患者特定或司法管辖区特定标准,而评分者模型无法仅从参数知识推断时,无评分标准评分无法替代。

英文摘要

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

2606.03197 2026-06-03 cs.CL 版本更新

MemTrain: Self-Supervised Context Memory Training

MemTrain:自监督上下文记忆训练

Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(通用人工智能国家重点实验室,智能科学与技术学院,北京大学) Samsung Research, Beijing, China(三星研究院,北京,中国)

AI总结 提出自监督框架MemTrain,通过两个耦合代理任务(端到端掩码重建和中间记忆召回)利用无标签维基百科语料增强LLM代理的上下文记忆能力,显著提升下游长文本推理性能。

详情
AI中文摘要

记忆是长时程LLM代理不可或缺的能力,使其能够保留和利用跨扩展交互积累的信息。现有的记忆代理方法通常在下游任务上通过强化学习进行端到端训练。然而,为记忆密集型场景收集高质量标注问题成本高昂,且所得训练数据往往缺乏足够的多样性以覆盖通用记忆行为。在这项工作中,我们提出MemTrain,一种自监督训练框架,用于普遍增强LLM代理的上下文记忆能力,以实现更有效的下游后训练。MemTrain在无标签维基百科语料上引入两个耦合代理任务:(1)端到端掩码重建目标,要求模型在多轮记忆更新后恢复掩码实体,从而从最终结果角度鼓励记忆维持;(2)中间记忆召回目标,要求模型利用中间记忆状态重建掩码历史信息,鼓励整个交互过程中的忠实压缩和记忆完整性。两个目标通过GRPO联合优化。在长文本QA和基于搜索的QA基准上的大量实验表明,MemTrain在不同模型上持续改善下游记忆密集型推理性能,相比直接的任务特定后训练,提升高达17.67个点。

英文摘要

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

2606.03180 2026-06-03 cs.CV cs.CL cs.LG 版本更新

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

GLINT:面向细粒度放射学表征的稀疏门控视觉-语言对齐

Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi

AI总结 针对放射学图像-报告全局对齐与局部病灶尺度不匹配的问题,提出GLINT框架,通过稀疏门控对齐和密集特征正则化实现零样本分类、定位和分割。

详情
AI中文摘要

放射学中的视觉-语言模型(VLM)通过利用临床工作流程中自然产生的图像-报告对,已成为一种可扩展的范式。然而,这种配对揭示了尺度上的不匹配:每个病灶仅占据图像的一小部分区域,但监督仅在全局图像-报告级别提供。这带来了一个核心挑战:先前的方法将权重密集地分布到所有补丁上,而不是集中在与给定查询相关的稀疏子集上。为了解决这个问题,我们提出了GLINT(门控语言-图像对齐)框架,该框架显式建模这种稀疏对应关系。在对齐方面,我们引入了稀疏门控对齐,这是一种新颖的架构,其中在单独的门控嵌入空间上的sigmoid门仅激活与每个文本查询相关的补丁,强制执行显式稀疏性。在表征方面,我们添加了密集特征正则化,将可训练编码器的中间特征锚定到冻结的自监督学习(SSL)教师模型上,从而保留门控所依赖的细粒度补丁特征。相同的方案适用于2D胸部X光片(CXR)和3D胸部计算机断层扫描(CT),分别基于DINOv3和V-JEPA 2.1构建。GLINT支持从自由文本查询进行零样本分类、定位和分割,据我们所知,这是首次在没有掩码监督的情况下在3D CT体积上展示零样本分割。值得注意的是,最显著的增益出现在零样本定位和分割上,这些任务需要稀疏的、特定于查询的定位,这与我们的设计意图一致。在下游评估中,GLINT在分类、报告生成和分割方面均优于SSL编码器和医学VLM。

英文摘要

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

2606.03179 2026-06-03 cs.CL 版本更新

HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift

HyperPatch: n元结构漂移下的顺序知识编辑

Yu-Kai Chan, Wen-Sheng Lien, Dong-Ting Yao, Bo-Kai Ruan, Kwan-Yeung Lin, Hong-Han Shuai, Meng-Fen Chiang

发表机构 * National Yang Ming Chiao Tung University

AI总结 针对非平稳环境中n元事件顺序更新引发的结构漂移问题,提出HyperPatch框架,通过超图流形上的稳定性建模,实现参数保留的知识编辑,在MQuAKE-CF和MQuAKE-T基准上分别取得96.24%和21.06%的跳步准确率相对提升。

Comments Accepted to Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

大型语言模型(LLMs)依赖知识编辑(KE)来维持时间有效性,然而现实世界中的知识本质上是n元的。我们证明,在非平稳环境中,对复杂关系的顺序更新会引发n元结构漂移,这是一种将n元事件二元化为三元组会破坏关系原子性的现象。这导致结构条件知识迁移失败,即检索器系统性错误接地,常被误诊为参数幻觉。为解决此问题,我们提出HyperPatch,一个参数保留框架,将顺序KE重新表述为超图流形上的稳定性问题。HyperPatch通过三个阶段保持事件完整性:(i)结构先验初始化,通过在超图神经网络(HGNN)上进行对比学习建立拓扑感知的嵌入空间,以捕获高阶相关性;(ii)顺序拓扑编辑,利用双阶段机制,采用基于SimHash的拓扑对齐进行快速冲突解决,以及拓扑LoRA自适应来跟踪漂移而无需骨干重训练;(iii)结构条件推理,整合来自融合语言和结构流形的全局一致证据。在MQuAKE-CF和MQuAKE-T基准上,HyperPatch在跳步准确率(H-Acc)上分别比最强基线相对提升96.24%和21.06%。进一步的消融实验表明,在连续n元更新流下具有卓越的可靠性,而标准基于KG的变体因结构错位导致H-Acc下降高达88.3%。

英文摘要

Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

2606.03165 2026-06-03 cs.CL cs.AI 版本更新

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

大型语言模型中词汇对齐和偏好阶段转变的完全自动识别

Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez

发表机构 * University of Washington(华盛顿大学)

AI总结 本文提出两种无需人工干预的评估指标——词汇对齐分数和三角化偏好转变,用于自动识别大型语言模型中的词汇过度使用及其与人类偏好学习的关联。

Comments 16 pages, 2 figures, 10 tables

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6116-6131
AI中文摘要

数字聊天助手(如ChatGPT)使用的语言可能与人类预期存在偏差(不对齐)。主要针对科学英语的研究已经描述了出现的偏差以及在一定程度上解释了原因,将其与人类偏好学习的训练阶段联系起来。然而,现有方法依赖于人工筛选。本文引入了两种无需筛选、假设较少的评估指标:词汇对齐分数(识别词汇过度使用)和三角化偏好转变(量化此类转变中有多少可归因于人类偏好学习)。使用PubMed摘要,生成了续写,并通过六个模型系列(Falcon、Gemma、Llama、Mistral、OLMo、Yi)的滑动窗口文档频率进行测量。该过程无需人工干预即可识别过度使用的词汇,如'suggest'、'additionally'和'strategy',并估计它们与偏好学习的关联。我们的发现重复了先前的工作,并且在参数设置、随机种子以及进一步数据的评估中保持稳定。该方法易于扩展,能够系统研究科学英语之外以及跨语言的词汇(不对齐),因此,这些指标有潜力为未来模型改进对齐并理解其起源做出贡献。

英文摘要

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

2606.03156 2026-06-03 cs.CL 版本更新

A cross-domain tropical species dataset with Chinese vernacular names and CITES source links

一个包含中文俗名和CITES来源链接的跨领域热带物种数据集

Jeff Wang

发表机构 * NEXLY LLC, United States(NEXLY LLC美国)

AI总结 构建了一个覆盖410,499个活跃热带物种的跨领域数据集,整合多源分类标识符,添加跨领域本体、中文俗名层(覆盖率99.50%)和CITES来源链接层,并报告了初步内部审查结果。

Comments 25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper arXiv:2606.00994. Dataset deposited at Zenodo (doi:10.5281/zenodo.20377811); canonical preprint-of-record at Zenodo (doi:10.5281/zenodo.20424981)

详情
AI中文摘要

我们描述了一个版本化的跨领域数据集,包含410,499个活跃热带物种(工作快照2026-04-20),涵盖三个应用子领域——热带植物、热带水生和热带宠物——这些领域共享商业和监管生命周期,但分布在按界组织的生物多样性基础设施中。该资源整合了来自GBIF、世界在线植物志、iNaturalist、NCBI分类学、生命目录和生命百科全书的分类标识符,并添加了三个原始层:一个跨领域本体,根据贸易和饲养背景重新划分分类群;一个中文俗名层,在排除未经验证的机器生成建议的类型学下,提供明确的每个名称来源;以及一个CITES来源链接层,将每个分类群连接到其Species+条目。中文俗名覆盖率——即带有与科学双名法不同的中文名称的分类群比例——达到99.50%(410,499个中的408,456个;全种群计数)。覆盖率表征完整性,而非名称翻译准确性;后者由四级来源类型学界定,并且是此处报告的初步内部审查的主题,其中盲法外部审计被确定为主要的未决事项。上游内容仅通过稳定标识符引用原始贡献层,支持CC-BY 4.0重用。该数据集存放在Zenodo上(https://doi.org/10.5281/zenodo.20377811)。本预印本是数据集当前状态的规范v1.0描述;未来的数据描述符提交是可预期的,但取决于“局限性”中列出的验证和发布工程事项。

英文摘要

We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains -- tropical_plants, tropical_aquatic, and tropical_pets -- that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage -- the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial -- reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset's current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.

2606.03143 2026-06-03 cs.LG cs.CL 版本更新

FederatedSkill: Federated Learning for Agentic Skill Evolution

FederatedSkill: 面向智能体技能演化的联邦学习

Jingbo Yang, Guanyu Yao, Yang Zhang, Ramana Rao Kompella, Gaowen Liu, Shiyu Chang

发表机构 * UC Santa Barbara(加州大学圣巴bara分校) MIT-IBM Watson AI Lab(麻省理工-IBM Watson人工智能实验室) Cisco Research(思科研究)

AI总结 提出FederatedSkill框架,通过语义技能差异作为通信单元,在保护隐私的同时实现个性化技能演化,相比自演化基线成功率提升44.4%,计算成本降低37.5%。

详情
AI中文摘要

现代LLM智能体越来越依赖技能库来处理复杂任务,使得技能演化成为自我改进的主要驱动力。然而,孤立的单用户任务流缺乏构建全面技能所需的多样性。虽然跨用户协作可以克服这一数据瓶颈,但当前的轨迹共享方法会损害用户隐私,并强加一个统一的全局库,无法适应客户端的异质性。我们引入了FederatedSkill,一个用于协作智能体演化的隐私保护框架。FederatedSkill超越了原始轨迹共享,利用语义技能差异(即对本地库的结构化补丁)作为通信的基本单位。在服务器端,一个演化智能体聚合这些补丁,动态建模客户端特定的能力边界,促进严格个性化的技能演化,而不是次优的全局平均。在20个不同的智能体任务族上评估,FederatedSkill相比自演化基线表现出显著提升,成功率最高提高44.4%,计算成本降低37.5%。

英文摘要

Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client-specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self-evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.

2606.03136 2026-06-03 cs.CR cs.CL 版本更新

PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations

PsychoPass: 多轮对抗性LLM对话的几何轮廓分析

Muberra Ozmen, Subhabrata Majumdar

发表机构 * Coveo Montreal, QC, Canada(加拿大蒙特利尔 Coveo) Indian Institute of Management Bangalore(班加罗尔印度管理学院)

AI总结 提出PsychoPass框架,通过提取对话轨迹在嵌入空间中的几何特征,在有害内容生成前预测多轮越狱攻击,并发现早期几何信号具有鲁棒性。

详情
AI中文摘要

对大型语言模型(LLM)的多轮越狱攻击揭示了当前防护措施的不匹配:它们作用于单个轮次,而攻击则作为跨对话的轨迹展开。我们提出从内容转向动态,将对话建模为表示空间中的路径,并询问对抗意图是否在其几何形状中早期编码。我们引入了PsychoPass,一个从嵌入空间中的对话轨迹提取几何特征以在有害内容生成前预测潜在攻击的框架。这些特征在朴素分类器中实现了近乎完美的性能,这很大程度上可以通过包含轮次数作为特征来解释。去除这一混淆因素后,仍保留了一个较小但一致的几何信号,其分类性能不显著依赖于编码器选择。关键的是,该信号在对话早期出现:仅从短前缀开始,攻击结果就高于随机水平,比基线防护更可靠。一项支持性理论分析通过长度与形状的分解、基于前缀长度的检测界限以及编码器不变性解释了这些发现。综合来看,这些结果表明对抗性对话留下了早期、表示鲁棒的几何指纹,适用于在线监控。

英文摘要

Multi-turn jailbreak attacks on large language models (LLMs) reveal a mismatch in current guardrails: they operate on individual turns, while attacks unfold as trajectories across conversations. We propose a shift from content to dynamics, modeling conversations as paths in representation space and asking whether adversarial intent is encoded early in their geometry. We introduce PsychoPass, a framework that extracts geometric features from conversation trajectories in embedding space to predict a potential attack before harmful content is produced. These features achieve near-perfect performance in naïve classifiers, which is largely explained by the inclusion of number of turns as a feature. After removing this confound, a smaller but consistent geometric signal remains, with classification performance that does not depend meaningfully on encoder choice. Crucially, this signal appears early in the conversation: attack outcomes remain above chance from short prefixes alone, more reliably than baseline guardrails. A supporting theoretical analysis explains these findings via a decomposition of length and shape, a detection bound based on prefix length, and encoder invariance. Together, these results show that adversarial conversations leave an early, representation-robust geometric fingerprint suitable for online monitoring.

2606.03132 2026-06-03 cs.CL 版本更新

DMT-CBT: Longitudinal Therapeutic State Modeling for CBT Counseling

DMT-CBT:面向CBT咨询的纵向治疗状态建模

Chang Liu, Shuyi Zhang, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Bin Hu

发表机构 * School of Information Science and Engineering, Lanzhou University(信息科学与工程学院,兰州大学)

AI总结 提出DMT-CBT框架,通过跨会话结构化治疗状态、多模态行为基础与工具增强干预,解决现有方法将CBT咨询简化为局部回复生成的问题,实现纵向治疗状态动态建模。

详情
AI中文摘要

大型语言模型(LLM)在认知行为疗法(CBT)咨询中展现出日益增长的潜力。然而,现有方法大多将咨询视为局部回复生成问题,专注于短文本、仅文本或单次会话交互中的共情回复。我们认为这种表述从根本上与真实心理治疗的本质不符。在临床CBT中,治疗是一个纵向过程,治疗师需要跨会话持续推断、更新和干预不断演变的治疗状态。真实的CBT还涉及多模态推理和延迟的跨会话干预效果,要求模型在部分可观测性下捕捉纵向治疗状态的演变。我们提出DMT-CBT,一个用于CBT咨询中治疗状态动态建模的框架。DMT-CBT跨会话维护结构化的治疗状态,同时整合多模态行为基础和工具增强的干预,以支持适应性治疗推理。基于该框架,我们构建了DMTCorpus,一个合成的多会话多模态CBT咨询数据集,具有演变的治疗状态、图像基础的患者行为以及跨会话干预的连续性。实验结果表明,与事后提取方法相比,DMT-CBT提高了咨询保真度和治疗联盟,产生了更有利的纵向情感轨迹,并更忠实地保留了治疗状态。

英文摘要

Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.

2606.03128 2026-06-03 cs.CR cs.AI cs.CL cs.LG 版本更新

Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

解耦式智能合约审计:通过蒸馏与聚合的轻量级LLM框架

Bagus Rakadyanto Oktavianto Putra, Muhamad Risqi Utama Saputra, Widyawan, Guntur Dharma Putra

发表机构 * University of Indonesia(印度尼西亚大学)

AI总结 提出一种基于轻量级开源LLM(0.6B-4B参数)的解耦式智能合约审计框架,通过rsLoRA、知识蒸馏和链式验证聚合策略,在漏洞检测中达到98.25%准确率,优于7B-34B参数模型。

Comments 12 pages, 4 figures, 5 tables. Accepted to IEEE ICWS 2026

详情
AI中文摘要

智能合约面临关键安全挑战,需要在去中心化网络服务中进行彻底审计。虽然大型语言模型(LLMs)在自动漏洞检测中展现出潜力,但现有方法缺乏严重性评估和可操作的修复建议,且计算开销过大。在本研究中,我们引入了一个高效的端到端智能合约安全审计框架,利用轻量级、高度优化的开源LLMs(0.6B-4B参数)。我们的框架将综合审计任务解耦为四个相互关联的组件:漏洞检测、解释、严重性分类和修复建议。为了在无需庞大参数量的情况下保持高准确性,我们实现了秩稳定低秩适配器(rsLoRA)、知识蒸馏以及自定义链式验证(CoVe)聚合策略,系统性地筛选并整合模型生成的多个草稿响应,形成高准确度的审计报告。实验结果表明,我们的轻量级流水线持续优于最先进的开源代码密集LLMs(7B至34B参数),在漏洞检测中达到98.25%的准确率,在生成解释任务中达到0.4375的对齐分数。此外,我们广泛的消融研究实证验证了我们的解耦审计过程相对于统一提示的优越性,并揭示了一种新颖的严重性中心性偏差,为未来LLM辅助审计研究建立了关键基准。

英文摘要

Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models (LLMs) have shown promise in automated vulnerability detection, existing approaches lack severity evaluations with actionable remediation and demand unnecessarily massive computational overhead. In this study, we introduce an efficient end-to-end smart contract security audit framework utilizing lightweight, highly optimized open-source LLMs (0.6B-4B parameters). Our framework decouples comprehensive audit tasks into four interconnected components: vulnerability detection, explanation, severity classification, and remediation recommendation. To maintain high accuracy without massive parameters, we implement Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a custom Chain-of-Verification (CoVe) aggregation strategy to systematically screen and consolidate multiple draft responses from the model into a highly accurate audit report. Experimental results demonstrate that our lightweight pipeline consistently outperforms state-of-the-art open-source coder dense LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and an alignment score of 0.4375 in generative explanation tasks. Furthermore, our extensive ablation studies empirically validate the superiority of our decoupled audit processes over unified prompting and uncover a novel severity centrality bias, establishing a critical benchmark for future research in LLM-assisted auditing.

2606.03113 2026-06-03 cs.CL 版本更新

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

基于强化学习的经验驱动动态退出机制用于大语言模型

Yanyu Zhu, Hoilam Pao, Niu Hu, Wei Guo, Shaoxiong Zhan, Boyu Lai, Zitai Wang, Yongqin Zeng, Hai-Tao Zheng

发表机构 * arXiv

AI总结 针对大语言模型自回归推理慢的问题,提出LEDE框架,利用离线强化学习动态选择最优退出层和推测长度,实现2.0-2.7倍加速,并比静态推测基线额外提升17%速度。

详情
AI中文摘要

大语言模型遭受缓慢的自回归推理。虽然自我推测解码加速了这一过程,但其效率受到静态配置(如固定退出层和推测长度)的阻碍。我们将此优化重新构建为马尔可夫决策过程,并提出LEDE,一个使用离线强化学习的框架。LEDE学习一个策略,根据每一步生成序列的局部上下文动态选择最优退出层和推测长度,平衡计算成本和草稿质量。在Llama-2和Llama-3模型上的全面评估表明,LEDE相比自回归解码实现了高达2.0倍至2.7倍的加速,并且比静态推测基线额外提供了17%的加速。

英文摘要

Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization as a \textbf{Markov Decision Process} and propose \textbf{LEDE}, a framework that uses offline reinforcement learning. LEDE learns a policy to dynamically select the optimal exit layer and speculation length based on the local context of the generated sequence at each step, balancing computational cost and draft quality. Comprehensive evaluations on Llama-2 and Llama-3 models show LEDE achieves up to a $2.0\times$$\sim$$2.7\times$ speedup over autoregressive decoding and and provides an additional 17\% speedup over the static speculative baselines.

2606.03102 2026-06-03 cs.CL 版本更新

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

小规模RL控制器,大语言模型:基于RL引导的自适应采样用于测试时扩展

Runpeng Dai, Tong Zheng, Rui Liu, Chengsong Huang, Hongtu Zhu

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) University of Maryland College Park(马里兰大学学院市分校) Washington University in St. Louis(圣路易斯华盛顿大学)

AI总结 提出将自适应采样建模为马尔可夫决策过程,使用强化学习训练轻量级控制器,在答案正确性、延迟和计算成本之间取得平衡。

详情
AI中文摘要

测试时扩展提高了大语言模型的推理性能,但显著增加了总计算量和延迟。现有的自适应采样方法通过动态决定何时停止采样来部分缓解这一问题,但通常依赖启发式规则或分布假设。在这项工作中,我们将自适应采样建模为马尔可夫决策过程(MDP)。我们使用强化学习(RL)训练一个轻量级采样控制器,以联合平衡答案正确性、延迟和计算成本。在每一轮中,控制器决定停止采样或获取更多样本。我们的方法轻量级,仅依赖最终答案的统计信息,并且可以在CPU上训练和部署。我们进一步表明,该框架可以解释为具有显式预算约束的约束优化问题的拉格朗日松弛。与ASC和ESC等强基线相比,实验表明我们的方法在答案正确性、采样轮数和所需总样本数之间实现了更好的权衡。

英文摘要

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.

2606.03099 2026-06-03 cs.CL cs.AI 版本更新

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

PhotoCraft: 具有层次自进化记忆的深度图像搜索代理推理

Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan, Xuanbo Su, Nanxing Hu, Yang Liu, Ce Hao, Shengqian Qin, Lianyu Hu, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc.(微信AI模式识别中心,腾讯公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出PhotoCraft,一种无需训练的分层记忆系统,通过工作、情景和语义记忆增强多模态大语言模型,实现深度图像搜索中的多步推理和知识迁移,在DISBench上提升检索性能达18.5%。

详情
AI中文摘要

深度图像搜索需要对丰富的上下文线索(如时间、地点和事件关系)进行多步推理。然而,现有的大语言模型代理大多是无状态和反应式的,缺乏持久记忆来维持长期上下文或跨任务迁移经验,这常常导致执行漂移和经验隔离。为了解决这些限制,我们提出了PhotoCraft,一种无需训练的分层记忆系统,用于照片搜索代理。受人类认知启发,PhotoCraft为多模态大语言模型配备了工作记忆、情景记忆和语义记忆,这些记忆在推理过程中被动态调用,以在多步推理和答案生成中保持逻辑一致性和知识可迁移性。在DISBench上的大量实验表明,PhotoCraft在不同多模态大语言模型骨干上持续改善了上下文感知检索,取得了高达18.5%的性能提升,并有效缓解了无记忆深度图像搜索中的关键瓶颈,为可靠且可泛化的多模态搜索代理提供了一条实用路径。

英文摘要

Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi-step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context-aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5\% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.

2606.03080 2026-06-03 cs.CL cs.AI 版本更新

Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

遗憾预训练:桥接先验与后验视角以增强知识基础

Mingkuan Zhao, Xiayu Sun, Wentao Hu, Suquan Chen, Jiaxuan Li, Xiaoyan Zhu, Xin Lai, Jiayin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出遗憾预训练框架,通过双视角架构利用未来信息增强因果语言模型,在OLMoE-1B-7B架构上平均准确率提升至33.9%。

详情
AI中文摘要

因果语言模型仅使用前文上下文对序列概率进行分解,在训练过程中尽管未来信息在训练数据中可用,但仍未被利用。本文介绍了遗憾预训练,这是一个基于学习使用特权信息范式的自监督框架。该框架采用双视角架构,其中单个模型同时生成因果学生分布和未来条件教师分布。训练目标通过遗憾损失增强标准语言建模,该损失最小化从教师到学生的KL散度,将未来感知信号传递到因果表示中。我们在OLMoE-1B-7B架构上研究了两种教师配置:LocalRegret,它将注意力扩展一个未来标记;以及GlobalRegret,它以目标位置被掩码的双向上下文为条件。在40亿个标记的训练后,对九个下游任务的实验表明,两种配置均持续优于基线。平均而言,GlobalRegret和LocalRegret分别达到33.9%和32.2%的准确率,超过了基线的30.2%。最值得注意的是,GlobalRegret将BoolQ性能提高了18.1个百分点(61.0%对42.9%)。该框架不引入额外参数,每个训练步骤仅需一次额外的推理模式前向传播。

英文摘要

Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.

2606.03078 2026-06-03 cs.CL 版本更新

G^2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation

G^2C-MT: 面向文档级机器翻译的图引导上下文选择

Baijun Ji, Zixuan Zhou, Xiangyu Duan, Yu Liu, Longbo Sun, Rupu Wei, Bohong Zhao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Trip.com Group(Trip.com集团)

AI总结 提出G^2C-MT方法,通过构建轻量级篇章图并采用深度偏置随机游走采样上下文路径,以结构化方式为文档级机器翻译选择上下文,提升翻译质量与鲁棒性。

Comments 9 pages, 2 figures; IJCAI2026

详情
AI中文摘要

有效的文档级机器翻译(DocMT)需要捕捉长距离篇章依赖。近期工作探索了基于检索和篇章感知的上下文选择方法。然而,这些方法通常缺乏对文档中远距离段落间结构化篇章依赖关系的显式建模。本文提出G^2C-MT(图引导的机器翻译上下文),将DocMT上下文选择视为轻量级篇章图上的结构化路径发现问题,而非检索非结构化上下文集或依赖昂贵的基于LLM的篇章建模。具体地,我们将每个段落表示为节点,并基于语义相似性、邻接性和关键词重叠建模每对节点间的关系。进一步,我们提出在图上进行深度偏置随机游走,为每个目标段落采样一条后向上下文路径。该上下文路径将用于提示大语言模型(LLM)进行翻译。该框架自然支持多路径上下文采样,通过聚合多样化的翻译候选来提升对篇章歧义输入的鲁棒性。跨多个领域的实验表明,G^2C-MT在多个LLM(包括DeepSeek-V3、Gemini-2.5-Flash-lite和Qwen-2.5/3系列)上均优于强基线。

英文摘要

Effective document-level machine translation (DocMT) requires capturing long-range discourse dependencies. Recent work has explored retrieval-based and discourse-aware context selection. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document. In this paper, we propose G^2C-MT (Graph-Guided Context for Machine Translation), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM-based discourse modeling. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap. Furthermore, we propose a depth-biased random walk over the graph to sample a backward context path for each target paragraph. The context path will be used to prompt a large language model (LLM) for translation. This framework naturally supports multi-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse-ambiguous inputs. Experiments conducted across various domains show that G^2C-MT outperforms strong baselines on multiple LLMs, including DeepSeek-V3, Gemini-2.5-Flash-lite, and the Qwen-2.5/3 series.

2606.03063 2026-06-03 cs.LO cs.CL 版本更新

ZX-Calculus:Trace-Indexed Dependent Types and Epistemic Semantics

ZX-演算:迹索引依赖类型与认知语义

Peng Chen

发表机构 * School of Information Science, Beijing Language and Culture University(北京语言文化大学信息科学学院)

AI总结 提出ZX-演算,通过迹索引类型、预层非单调语义和构造性AGM信念修正,保守扩展Martin-Löf依赖类型论,并给出Coq机械化证明。

详情
AI中文摘要

我们提出ZX-演算(知识演化演算),它是Martin-Löf依赖类型论(MLTT)的保守扩展,集成了迹索引类型、预层非单调语义和构造性AGM信念修正。本文附带Coq机械化证明(34个完整证明;两个核心结果零未完成)。(I)迹类型。FinTrace(s0,sn)是一个带类型的执行迹的归纳族。FinTrace和Star(Step)作为路径类型同构,但判断上不相等;TraceElim显式暴露事件标签e:Event,为事件驱动归纳提供了更符合人体工程学的接口。我们证明了迹可达性对应、确定性重放以及通过可归约候选(带传输引理,RC-elim推迟;所有其他核心结果经Coq验证)的规范性框架。(II)层语义。迹索引命题是自由迹偏序范畴Tf上的逆变层。分离定理(显式反模型)区分了证明论单调性和语义非单调性。项模型是初始CwF(句法泛性质,非经典完备性)。(III)AGM信念修正。我们给出了一个显式的构造性部分交收缩算法,经(C1)-(C4)验证。所有八条AGM公设(R1)-(R8)都是定理。R7和R8的证明使用了析取加固引理,并给出了自包含的构造性推导。(IV)集成。B^AGM在顺序修正中不满足层复合律BP-comp(显式反模型,Coq验证)。我们引入单步修正系统(SSRS),证明B^AGM是有效的SSRS(Coq验证),并表明这足以处理迹态射、收缩刻画和修正见证。BP-comp失败揭示了路径依赖信念修正与函子一致性之间的基本张力,此前未被识别。

英文摘要

We propose ZX-Calculus (Knowledge Evolution Calculus), a conservative extension of Martin-Lof Dependent Type Theory (MLTT) integrating trace-indexed types, presheaf non-monotone semantics, and constructive AGM belief revision. A Coq mechanisation accompanies the paper (34 complete proofs; zero admits for the two central results). (I) Trace types. FinTrace(s0,sn) is an inductive family of typed execution traces. FinTrace and Star(Step) are isomorphic as path types but not judgementally equal; TraceElim exposes the event label e:Event explicitly, giving a more ergonomic interface for event-driven induction. We prove the Trace-Reachability Correspondence, Deterministic Replay, and a canonicity framework via reducibility candidates with a Transport Lemma (RC-elim deferred; all other Core results are Coq-verified). (II) Sheaf semantics. Trace-indexed propositions are contravariant sheaves over the free trace partial-order category Tf. A Separation Theorem (explicit countermodel) distinguishes proof-theoretic monotonicity from semantic non-monotonicity. The term model is an initial CwF (syntactic universal property, not classical completeness). (III) AGM belief revision. We give an explicit constructive partial meet contraction algorithm verified against (C1)-(C4). All eight AGM postulates (R1)-(R8) are theorems. Proofs of R7 and R8 use the Disjunctive Entrenchment Lemma, given a self-contained constructive derivation. (IV) Integration. B^AGM fails the sheaf composition law BP-comp for sequential revision (explicit countermodel, Coq-verified). We introduce Single-Step Revision Systems (SSRS), prove B^AGM is a valid SSRS (Coq-verified), and show this suffices for trace morphisms, retraction characterisation, and revision witnesses. The BP-comp failure reveals a fundamental tension between path-dependent belief revision and functor consistency, not previously identified.

2606.03043 2026-06-03 cs.CL 版本更新

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

LLM作为评判者的几何学:为什么LLM间共识不等于人类对齐

Sourabrata Mukherjee, Hamna Hamna, Kalika Bali, Sunayana Sitaram

AI总结 通过几何量测量,发现LLM评判者之间高度一致但与人类对齐差,其评分子空间与人类子空间几乎正交,共识源于子空间坍缩而非人类对齐。

详情
AI中文摘要

LLM作为评判者现已普遍使用,但评判者之间高度一致,与人类却只有弱一致性。我们通过测量四个社区构建的印地语数据集、八种印地语语言和41个LLM评判者的四个几何量(分数分布、有效秩、与人类子空间的主角、评判者与人类之间的堆叠相关性,均带有自助置信区间)来检验这是共享信号还是共享偏差。在主观评分标准上,评判者使用的分数范围不到人类的一半($\sigma_J / \sigma_H \approx 0.3$--$0.5$)。他们的评估轴几乎与人类正交,且明显比人类彼此之间更远离人类($87^\circ$--$89^\circ$ 对比 $78^\circ$--$81^\circ$)。LLM间一致性超过LLM-人类一致性($r_{LL} \approx 0.35$ 对比 $r_{LH} \approx 0.27$--$0.32$)。在具有可验证事实答案的评分标准上,相同的诊断指标回落到人类范围内(轴 $58.5^\circ$;$r_{LH} = 0.519$)。微调和偏好优化恢复了分数分布($0.32 \rightarrow 1.08$),但几乎不改变轴(仍为 $87^\circ$--$88^\circ$)。只有在小的人类锚定集上的后验校准才能同时改善所有四个社区健康评分标准,使校准后的24B印地语评判者($r = 0.184$)优于GPT-5.5($r = 0.123$),但仍未达到人类可靠性(在可验证评分标准上人类-人类 $r = 0.474$)。我们认为,只有当对评判者评分子空间的直接几何检查通过时,LLM间一致性才应被视为人类对齐的证据;否则,共识反映的是坍缩子空间内的一致性。

英文摘要

LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM-as-judge stack across four community-built Indic datasets, eight Indic languages, and 41 LLM judges: score spread, effective rank, principal angle to the human subspace, and stacked correlations among judges and humans, all with bootstrap confidence intervals. On subjective rubrics, judges use less than half the human score range ($σ_J / σ_H \approx 0.3$--$0.5$). Their evaluation axis is nearly orthogonal to the human one and noticeably further from humans than humans are from each other ($87^\circ$--$89^\circ$ versus $78^\circ$--$81^\circ$). Inter-LLM agreement exceeds LLM--human agreement ($r_{LL} \approx 0.35$ versus $r_{LH} \approx 0.27$--$0.32$). On a rubric with a verifiable factual answer, the same diagnostics fall back into the human range (axis $58.5^\circ$; $r_{LH} = 0.519$). Fine-tuning and preference optimization recover spread ($0.32 \rightarrow 1.08$) but barely move the axis (still $87^\circ$--$88^\circ$). Only post-hoc calibration on a small human-anchored set improves all four community-health rubrics together, placing a calibrated 24B Indic judge ($r = 0.184$) ahead of GPT-5.5 ($r = 0.123$), yet still short of human reliability (human-human $r = 0.474$ on the verifiable rubric). We argue that inter-LLM agreement should be considered evidence of human alignment only when a direct geometric check on the judge's score subspace passes; otherwise, the consensus reflects agreement within a collapsed subspace.

2606.03032 2026-06-03 cs.CL 版本更新

The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation

深思幻觉:诊断多智能体LLM讨论中的事实损耗与立场同质化

Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Ningnan Wang, Nancy F. Chen, Min-Yen Kan

发表机构 * Xi’an Jiaotong University(西安交通大学) National University of Singapore(新加坡国立大学) Yunnan University(云南大学) Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局(A*STAR))

AI总结 本文提出DelibTrace框架,通过分解原子事实并追踪其存留,发现多智能体LLM讨论中高达72%的关键事实丢失,导致立场同质化,揭示了“同意越多、知道越少”的风险。

详情
AI中文摘要

多智能体LLM系统常将共识视为成功交互的证据。然而,对于深思性问题,可靠性取决于智能体是否保留了理解问题所需的事实和观点。我们识别出深思幻觉:讨论导致(1)事实损耗,即问题关键事实的逐渐丢失,以及(2)立场同质化,即不同立场向共识的崩溃。为了衡量这一过程,我们引入了DelibTrace框架,该框架将每个问题分解为原子事实,标记关键事实,将其分布在智能体之间,并追踪它们在讨论轮次中的存留。在涉及三个代表性LLM家族的伦理和新闻类深思中,多智能体讨论抹去了高达72%的关键事实。这种损失是严重的:保留的证据可能误导性地重构问题,最终立场仍锚定于基础模型先验,且单个恶意智能体可将错误信息注入不断缩小的共享上下文。这些结果揭示了一个更尖锐的风险:智能体可以在知道更少的情况下达成更多一致。我们呼吁进行评估,衡量哪些事实、不确定性和合理分歧在交互中得以存留。

英文摘要

Multi-agent LLM systems often treat consensus as evidence of successful interaction. For deliberative problems, however, reliability depends on whether agents preserve the facts and viewpoints needed to interpret an issue. We identify the deliberative illusion: discussion produces (1) factual attrition, the progressive loss of issue-critical facts, alongside (2) stance homogenization, the collapse of diverse positions toward consensus. To measure this process, we introduce DelibTrace, a framework that decomposes each issue into atomic facts, labels issue-critical ones, distributes them across agents, and tracks their survival across discussion rounds. Across ethical and news-based deliberation with three representative LLM families, multi-agent discussion erases up to 72% of issue-critical facts. This loss is consequential: retained evidence can reconstruct the issue misleadingly, final stances remain anchored in base-model priors, and a single malicious agent can inject misinformation into the shrinking shared context. These results reveal a sharper risk: agents can agree more while knowing less. We call for evaluations that measure which facts, uncertainties, and legitimate disagreements survive interaction.

2606.03029 2026-06-03 cs.CL cs.AI 版本更新

Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

基于研究者指定协变量的条件假设生成用于LLM文本分析

Paiheng Xu, Jing Liu, Wei Ai

发表机构 * University of Maryland, College Park(马里兰大学 College Park 分校)

AI总结 提出条件假设生成框架,通过纳入研究者指定的协变量来引导LLM发现相关子组内而非全局的差异模式,并采用特征-协变量交互和分层内去均值与逆频率重加权两种方法解决子组不平衡和符号反转问题。

详情
AI中文摘要

计算社会科学的一个核心目标是发现语言如何随感兴趣的结果(如政治派别或教学质量)变化的可解释差异。最近的基于LLM的假设生成方法用自然语言描述这些差异,但选择的是全局判别模式,而没有考虑基于研究者领域知识塑造数据的协变量。当忽略协变量时,所选模式可能反映混杂因素而非实质性感兴趣的差异。我们引入了条件假设生成,这是一个纳入研究者指定协变量的框架,以将假设发现引导至相关子组内成立的差异。出现了两个挑战:目标子组可能代表性不足(分层不平衡),并且差异的方向可能在子组间反转(符号反转)。我们提出了两种受计量经济学启发的方法:一种引入特征-协变量交互以检测符号反转,另一种应用分层内去均值和逆频率重加权以平衡代表性不足的分层。合成实验表明,每种方法在其目标设置中均优于全局基线,对两个真实世界数据集的专家评估证实,协变量感知的生成在相关子组内产生了更有用的假设。

英文摘要

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.

2606.03027 2026-06-03 cs.CL 版本更新

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

SEA-Embedding:面向东南亚的开放且可复现的文本嵌入

Peerat Limkonchotiwat, Raymond Ng, Sarana Nutanong, Jian Gang Ngui

发表机构 * AI Singapore(AI新加坡) School of Information Science and Technology, VISTEC(信息科学与技术学院,VISTEC)

AI总结 提出SEA-Embedding,一个完全开放且可复现的东南亚语言文本嵌入管道,通过研究数据组成、训练目标和基础编码器初始化三个核心因素,在SEA-BED上取得最优结果。

详情
AI中文摘要

文本嵌入对许多下游应用至关重要,因此鲁棒性对于现实世界的NLP很重要。然而,最近最先进的嵌入模型大多不可复现,因为它们依赖于封闭或未公开的训练数据,并且对于东南亚语言来说仍然不够鲁棒。我们提出了SEA-Embedding,一个完全开放且可复现的东南亚语言文本嵌入管道,仅使用公开可用的数据进行训练,并用它来研究鲁棒嵌入设计的三个核心因素:数据组成、训练目标和基础编码器初始化。SEA-Embedding在SEA-BED上取得了最先进的结果,同时能够对该地区的鲁棒文本嵌入进行系统且可复现的分析。

英文摘要

Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

2606.03022 2026-06-03 cs.CL cs.AI 版本更新

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

幻觉作为正交噪声:通过动态上下文正交化实现推理时流形对齐

Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Xingchen AGI Lab(星辰AGI实验室) China Telecom AI Technology (Beijing) Co., Ltd.(中国电信人工智能技术(北京)有限公司) Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出一种基于线性表示假设的几何框架,将大语言模型幻觉解释为残差流语义流形的正交噪声,并引入推理时干预方法动态上下文正交化(DCO),通过层间Z分数抑制机制选择性地衰减异常正交分量,在保持知识记忆的同时提升上下文忠实度。

详情
AI中文摘要

大语言模型(LLMs)中的幻觉——即生成与上下文事实或逻辑约束不一致的内容——仍然是可靠部署面临的持续挑战。在这项工作中,我们通过基于线性表示假设的几何框架来解决这个问题。我们提出,幻觉表现为相对于残差流语义流形的正交噪声。具体来说,我们假设虽然注意力头理想地传播与上下文子空间一致的信息,但当特定头引入与该子空间正交的分量时,就会产生幻觉,破坏潜在表示的一致性。基于这一表述,我们引入了动态上下文正交化(DCO),一种推理时干预方法。DCO利用输入残差流作为动态上下文锚点,对注意力头输出进行正交分解。为了区分上下文对齐的语义更新和发散噪声,DCO采用层间Z分数抑制机制,根据统计分布选择性地衰减异常正交分量。在XSum、NQ-Swap和IFEval等基准上对Llama-3-8B和70B的评估表明,与最先进的干预基线相比,DCO实现了更优的上下文忠实度。此外,DCO在TriviaQA和TruthfulQA等知识密集型任务上保持高性能,有效缓解了现有方法中常见的幻觉抑制与参数知识保留之间的权衡。我们的发现验证了幻觉的几何解释,并将DCO确立为一种计算高效的流形对齐方法。代码可在https://this https URL获取。

英文摘要

Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

2606.03021 2026-06-03 cs.CL 版本更新

Hint-Guided Diversified Policy Optimization for LLM Reasoning

提示引导的多样化策略优化用于大语言模型推理

Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Ant Group(蚂蚁集团)

AI总结 提出提示引导的多样化策略优化(HDPO),通过“提出-选择-思考”轨迹激励模型生成多样且可靠的解决方案,提升推理能力、候选方案多样性和可靠方案识别能力。

详情
AI中文摘要

大语言模型(LLMs)的最新发展展示了令人印象深刻的推理能力,其中可验证奖励的强化学习(RLVR)是一种有前景的增强策略。然而,现有的奖励机制局限于结果层面的正确性,缺乏明确的信号来引导模型考虑多样化的解决方案。相比之下,人类问题解决通常涉及评估多种潜在方法并选择最可靠的解决方案,而当前的RLVR框架并未明确激励这种认知过程。受此启发,我们提出了提示引导的多样化策略优化(HDPO),允许模型首先列出所有潜在的候选解决方案大纲作为提示,然后选择最可靠的一个进行进一步推理。HDPO包括两个阶段:结构化推理的冷启动和提示引导的多样化强化学习,以激励模型遵循“提出-选择-思考”轨迹生成多样且可靠的解决方案。实验结果表明,HDPO有效提升了LLM的推理能力,增强了候选解决方案的多样性以及LLM识别可靠解决方案的能力。

英文摘要

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

2606.02994 2026-06-03 cs.AI cs.CL 版本更新

Inducing Reasoning Primitives from Agent Traces

从智能体轨迹中归纳推理原语

Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出推理原语归纳方法,从ReAct智能体轨迹中挖掘并聚类常见推理步骤,构建伪工具库,在多个推理任务上显著提升性能。

Comments 22 pages including appendices

详情
AI中文摘要

ReAct风格的LLM智能体经常跨问题重新发现相同的推理例程,但这些例程被困在瞬时的草稿板中。我们引入了推理原语归纳,一种单次通过的方法,挖掘成功的ReAct轨迹,聚类循环出现的推理动作,并将最频繁的动作转换为一个紧凑的类型化伪工具库。每个伪工具由一个自然语言文档字符串指定,在调用时由LLM解释,标准的ReAct循环在测试时组合这些原语。核心结果是,归纳出的库优于生成其轨迹的原始智能体:在RuleArena NBA上提高44个百分点(30 -> 74),在MuSR团队分配上提高30个百分点(38 -> 68),在NatPlan会议规划上提高22个百分点(7 -> 29)。在涵盖叙事推理、规则应用和约束满足规划的五个可比较子任务中,单个固定配置在每个子任务上优于零样本思维链,匹配或超过专家编写的分解,并以更低的平均推理成本优于AWM。

英文摘要

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

2606.02991 2026-06-03 cs.CL cs.AI 版本更新

Pretraining Language Models on Historical Text

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

发表机构 * University of Waterloo(多伦多大学) Vector Institute(向量研究所) AIML, Adelaide University(AIML,阿德莱德大学) Department of Engineering Science, University of Oxford(牛津大学工程科学系) Oxford Centre for Economic and Social History, University of Oxford(牛津大学经济与社会史中心) Department of Computer Science, University College London(伦敦大学学院计算机科学系)

AI总结 提出TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型,通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件,解决数据质量、时间泄漏、训练和评估等挑战。

详情
AI中文摘要

我们介绍了TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型。开发历史语言模型需要解决数据质量和可用性、防止时间泄漏、设计时间一致的后训练流程以及构建可靠评估等挑战。为了解决这些问题,我们构建了TypewriterCorpus,一个54B词元的历史语料库,收集自多样化的档案和语言标注来源,并进行了广泛的数据清洗和泄漏缓解措施。此外,我们引入了词汇基础指令微调,一种后训练框架,限制响应直接基于历史源文档。使用该框架,我们构建了两个历史指令微调数据集:History-LIMA和History-SelfInstruct。为了评估能力和时间一致性,我们引入了History-Event,一个用于评估能力、时间基础和泄漏的基准套件。我们发布了TypewriterLM及所有相关资源,以支持未来对历史语言模型的研究。

英文摘要

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

2606.02983 2026-06-03 cs.CL 版本更新

A Locally Deployed RAG-Based Academic Advising System for Course Selection

基于本地部署RAG的课程选择学术咨询系统

Feng Li, Yoritaka Iwata

AI总结 提出一种本地部署的RAG学术咨询系统,利用大语言模型和结构化课程大纲检索,以隐私保护方式支持课程选择、先修课程理解和个性化学习规划。

Comments to be published in Elsevier's Procedia Computer. Sci. (KES 2026)

详情
AI中文摘要

基于课程之间先修关系的正确课程顺序对于学生全面发展知识和技能至关重要。然而,学生孤立地制定这一顺序时,常常因认知局限和信息过载而困惑。同时,教育机构由于教育资源有限,在提供关于正确顺序的充分学术建议方面遇到困难。为解决这些挑战,我们提出一种基于课程大纲信息的本地部署RAG学术咨询系统。通过将大语言模型与结构化课程大纲数据的检索相结合,该系统旨在以隐私保护的方式支持课程选择、先修课程理解和个性化学习规划。

英文摘要

The correct sequence of courses in the curriculum based on prerequisites between courses is of great importance for students to develop their knowledge and skills holistically. However, students crafting this sequence in isolation frequently struggle with recognition limitations and information overload that leads to confusion. Simultaneously, education institutions encounter difficulties in providing adequate academic advice for the correct sequence due to limited education resources. To address these challenges, we propose a locally deployed RAG-based academic advising system grounded in syllabus information. By combining large language models with retrieval from structured syllabus data, the system is designed to support course selection, prerequisite understanding, and personalized study planning in a privacy-preserving manner.

2606.02981 2026-06-03 cs.CL 版本更新

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

从标注验证集输出统计量预测推理时缩放增益

Luyang Zhang, Jingyan Li

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种基于标注验证集输出统计量的轻量级方法,通过三个核心特征(提示级一致性扩散、标签辅助的首个正确样本位置、完成长度方差)结合熵特征,使用岭回归预测最佳-of-N推理缩放增益,达到Spearman ρ=0.90的相关性。

详情
AI中文摘要

Best-of-$N$ 推理缩放(从语言模型中抽取 $N$ 个候选答案,并返回奖励模型评分最高的一个)能提高准确性,但提升幅度因模型而异,而预先预测该幅度目前需要端到端运行整个过程。先前的工作将模型采样输出的廉价统计量与验证集正确性(样本一致性、多样性、模型置信度以及正确样本出现的位置)与模型行为联系起来,但并未确定其中哪些能构成稳定、紧凑的 best-of-$N$ 增益预测器。我们基于单次标注验证集采样过程中计算的特征拟合岭回归预测器,使用 bootstrap-Lasso 对候选特征集进行稳定性分析,并给出带有显式线性近似残差的集中性分析。在三个基础模型族、六种后训练方法以及数学和推理任务领域上,稳定性分析识别出一个严格的三特征核心,包括提示级一致性扩散、标签辅助的首个正确样本位置和完成长度方差;基于该核心加上熵扩展构建的紧凑岭回归预测器,在奖励模型验证器下与实际 best-of-$N$ 增益的 Spearman 相关系数达到 $ ho = 0.90$。预期用途是在支付完整的奖励模型评分成本之前,利用标注验证集对候选配置进行筛选。

英文摘要

Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end-to-end. Prior work links cheap statistics of a model's sampled outputs and validation-set correctness (how often samples agree, how diverse they are, how confident the model is, and where correct samples appear) to model behavior, but does not isolate which of these form a stable, compact predictor of best-of-$N$ gain. We fit ridge predictors on features computed from a single labeled validation-set sampling pass, use bootstrap-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear-approximation residual. Across three base-model families, six post-training methods, and math and reasoning task domains, the stability analysis identifies a strict three-feature core spanning prompt-level agreement spread, label-assisted first-correct-sample position, and completion-length variance; a compact ridge predictor built from this core plus an entropy add-on reaches Spearman $ρ= 0.90$ with actual best-of-$N$ gain under a reward-model verifier. The intended use is labeled validation-set screening of candidate configurations before paying the full reward-model scoring cost.

2606.02976 2026-06-03 cs.CL 版本更新

Memory Retrieval for Changing Preferences

针对偏好变化的记忆检索

Yuehan Qin, Li Li, Linxin Song, Wei Yang, Jiate Li, Yuqing Yang, Yue Zhao

发表机构 * University of Southern California(南加州大学)

AI总结 提出基于贝叶斯因子的统一框架,通过量化历史轮次对潜在偏好状态的证据强度,实现长上下文对话系统中的记忆访问与选择。

详情
AI中文摘要

长上下文对话系统必须决定何时访问记忆以及交互历史的哪些部分是相关的。现有方法通常依赖启发式检索信号或始终开启的记忆使用,未能考虑用户偏好的变化性和潜在不一致性。在这项工作中,我们提出了一个基于偏好变化的记忆访问与选择统一框架。我们将个性化记忆检索表述为识别哪些历史轮次提供了关于用户潜在偏好状态的证据,而不是依赖表面语义相似性。为此,我们使用贝叶斯因子量化每个记忆轮次的效用,定义为当该轮次包含在上下文中时模型参考响应似然的改进。这提供了证据强度的原则性度量,以及用于记忆访问和选择的统一信号。通过将记忆检索视为效用估计,模型学会识别显著轮次并根据预期效用调节记忆使用。在四个异构记忆基准上的实验表明,我们的方法在需要建模偏好变化的长上下文、偏好密集型任务上优于现有的基于嵌入的检索,同时在语义相似性足够的低密度场景中保持竞争力。

英文摘要

Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to account for the changing and potentially inconsistent nature of user preferences. In this work, we propose a unified framework for memory access and selection based on changing preferences. We formulate personalized memory retrieval as identifying which historical turns provide evidence about a user's latent preference state, rather than relying on surface-level semantic similarity. To this end, we quantify the utility of each memory turn using a Bayes factor, defined as the improvement in the model's likelihood of the reference response when the turn is included in context. This provides a principled measure of evidence strength and a unified signal for both memory access and selection. By framing memory retrieval as utility estimation, the model learns to identify salient turns and regulate memory usage based on expected utility. Experiments on four heterogeneous memory benchmarks show that our approach outperforms existing embedding-based retrieval on long-context, preference-intensive tasks where modeling changing preferences is essential, while remaining competitive in low-density regimes where semantic similarity suffices.

2606.02973 2026-06-03 cs.CL 版本更新

Chatbots Output Meaningful (but Problematic) Language

聊天机器人输出有意义(但有问题)的语言

Matthew Stone, Una Stojnić

发表机构 * University of Parma(帕尔马大学) University of Cambridge(剑桥大学) University of Pittsburgh(匹兹堡大学) Franklin and Marshall College(弗兰克林与马歇尔学院)

AI总结 本文论证大型语言模型(LLM)的输出是有意义的,但无需假设其具有心理状态或意图,并探讨了这一观点对语言理论和AI伦理的影响。

Comments 49 pages

详情
AI中文摘要

AI聊天机器人的话语有意义吗?具体来说,如果用户问Anthropic的智能体Claude:“西班牙的首都是什么?”Claude回答:“马德里是西班牙的首都。”这句话是否具有其通常的意义——并且表达了一个真实的命题?大多数普通用户以及AI工程师认为答案显然是“是”。然而,许多认知科学家、语言学家和语言哲学家认为,关于语言和意义的主流意向主义理论得出了相反的结论。因此,更同情普通用户直觉的理论家主张对语言进行激进的“去拟人化”,修正我们对心理状态、意图和语义内容的理解,以捕捉LLM输出有意义的直觉。我们采取不同的方法。虽然我们也认为LLM的输出是有意义的,但我们认为,适当的人类语言理论已经适用于当前的聊天机器人。意义是一个低门槛:声称LLM输出有意义并不需要假设心理状态、意图、理性或LLM中交流所需的认知能力——实际上,也不需要任何其他拟人化假设。人们确实有交流意图(通常是成功的),但即便如此,在人类中,语言产出也可能偏离说话者的想法。我们的观点对于我们应该如何理论化——并批判性地参与——人类语言输出和合成生成的文本具有重要影响。特别是,说聊天机器人产生有意义的文本绝不意味着认可它们的输出,或假设该技术是(或不是)好的、强大的、合适的或有用的。

英文摘要

Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning -- and does it express a true proposition? Most ordinary users, as well as AI engineers, take the answer to be trivially "yes." However, many cognitive scientists, linguists, and philosophers of language argue that dominant intentionalist accounts of language and meaning deliver the opposite conclusion. Theorists more sympathetic to ordinary users' intuitions have therefore advocated a radical "de-anthropomorphization" of language, revising our understanding of mental states, intentions, and semantic content to capture the intuition that the outputs of LLMs are meaningful. We take a different approach. While we, too, argue that LLM outputs are meaningful, we contend that a proper theory of human language already applies, as is, to current chatbots. Meaning is a low bar: claiming that LLM outputs are meaningful does not require positing mental states, intentions, rationality, or the cognitive capacities requisite for communication in LLMs -- or, indeed, making any other anthropomorphic assumptions. People do have communicative intentions (typically successful ones), but nevertheless, even in humans, language production can depart from what the speaker has in mind. Our view has important consequences for how we should theorize about -- and critically engage with -- both human linguistic output and synthetically generated text. In particular, to say that chatbots produce meaningful text is not by any means to endorse what they output, or to assume that the technology is (or is not) good, powerful, appropriate, or useful.

2606.02971 2026-06-03 cs.CL 版本更新

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

EURO-5K:领域预训练何时重要?用于欧盟报告义务提取的Transformer基准测试

Marios Koniaris, Vasileios Kotronis, Eugenia Giannini, Panayiotis Tsanakas

发表机构 * Division of Computer Science, School of Electrical and Computer Engineering(计算机科学系,电气与计算机工程学院) National Technical University of Athens(雅典技术大学) Department of Humanities Social Sciences and Law, School of Applied Mathematical and Physical Sciences(人文社会科学与法律系,应用数学与物理科学学院)

AI总结 本文构建了EURO-5K数据集,通过对比判别式与生成式模型在欧盟报告义务提取上的表现,发现领域预训练在参数高效微调时收益显著,且模型可作为专用提取器。

详情
AI中文摘要

从欧盟立法中提取报告义务对于评估和减少监管报告负担至关重要。然而,区分报告要求与结构相似的条款需要专门的法律理解。当前的法律NLP方法缺乏具有明确指南和提取范式及领域适应策略比较评估的专门数据集。我们整理了EURO-5K,一个包含来自136项欧盟立法法案的句子级报告义务和具有挑战性的负例的语料库。在该数据集上,我们训练并比较了判别式标记分类模型(BERT风格)和生成式跨度提取模型(LLM),针对基线(基于模式和依赖关系的提取、少样本提示)评估了全微调和参数高效的QLoRA。结果表明,全微调的通用和法律BERT模型实现了相似的性能(0.89 F1),而微调的LLM在句子级提取上达到了编码器的准确度。法律预训练对生成式模型仅带来微小提升。相反,当适应能力受限时,法律预训练明显有益,因为参数高效微调的法律BERT优于其通用对应版本。学习曲线分析表明,法律预训练在数据极少时加速了早期学习。所有方法在大约3000个样本时收敛,之后收益递减,验证了数据集的充分性。在两个外部监管语料库上的跨数据集评估表明,我们的模型表现为专门的报告义务提取器,而非通用监管分类器。我们发布了EURO-5K、训练好的模型以及一个带有可解释性可视化和结构化RDF导出的交互式演示。这些表明,两种范式和参数高效训练为监管合规自动化提供了实用工具。

英文摘要

Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal understanding. Current legal NLP methods lack specialised datasets with clear guidelines and comparative evaluation of extraction paradigms and domain adaptation strategies. We curate EURO-5K, a corpus of sentence-level reporting obligations and challenging negative examples from 136 EU legislative acts. On this dataset, we train and compare discriminative token-classification models (BERT-style) and generative span-extraction models (LLMs), evaluating both full fine-tuning and parameter-efficient QLoRA against baselines (pattern and dependency-based extraction, few-shot prompting). Results show that fully fine-tuned generic and legal BERT models achieve similar performance (0.89 F1), while fine-tuned LLMs match encoder accuracy for sentence-level extraction. Legal pretraining offers only small gains for generative models. In contrast, it is clearly beneficial when adaptation capacity is constrained, as parameter-efficient tuning of Legal-BERT outperforms its generic counterpart. Learning curve analysis demonstrates that legal pretraining accelerates early learning with minimal data. All approaches converge around 3K samples with diminishing returns thereafter, validating dataset sufficiency. Cross-dataset evaluation on two external regulatory corpora shows that our models behave as specialised reporting obligation extractors rather than generic regulatory classifiers. We release EURO-5K, trained models, and an interactive demo with explainability visualizations and structured RDF export. These demonstrate that both paradigms and parameter-efficient training provide practical tools for regulatory compliance automation.

2606.02964 2026-06-03 cs.AR cs.CL cs.LG 版本更新

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

多段注意力:实现高效KV缓存管理以加速大型语言模型服务

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, Bin Cui

发表机构 * Peking University(北京大学)

AI总结 提出AsymCache,一种计算延迟感知的KV缓存管理系统,通过多段注意力、缓存驱逐策略和自适应分块调度器,在保持无损精度的同时显著降低TTFT和TPOT。

详情
AI中文摘要

大型语言模型(LLM)推理依赖键值(KV)缓存以避免冗余的注意力计算。虽然近似KV缓存保留技术通过牺牲模型精度来减少内存使用,但无损方法则从GPU内存中驱逐KV缓存块并按需重建以保留精确输出。现有的无损KV缓存管理系统主要基于访问频率或位置启发式做出驱逐决策,而不考虑不同KV缓存块如何影响GPU注意力内核的执行效率。在本文中,我们提出了AsymCache,一种用于LLM推理的计算延迟感知KV缓存管理系统,它明确地将缓存驻留决策与GPU注意力内核性能对齐,包括三个关键组件:用于高效非连续KV上下文处理的多段注意力(MSA)、联合优化命中率和位置感知重计算成本的缓存驱逐策略,以及用于高硬件利用率的自适应分块调度器。实验表明,与最新基线相比,AsymCache将TTFT降低了高达1.90-2.03倍,每输出令牌时间(TPOT)降低了1.62-1.71倍,证实了该方法在常见工作负载中的有效性,并验证了其平衡计算效率与缓存命中率的设计目标。此外,AsymCache的低级设计允许无缝集成到诸如Continuum的代理服务系统中,进一步将平均作业延迟降低高达18.1%。

英文摘要

Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.

2606.02953 2026-06-03 cs.CL 版本更新

Linguistic Productivity in Large Language Models: Models Coerce, but do not Preempt

大型语言模型中的语言生产力:模型强制但不抢占

Claire Bonial, Claire Benet Post, Laura Michaelis, Harish Tayyar Madabushi

发表机构 * Georgetown University(乔治城大学) University of Colorado Boulder(科罗拉多大学丹佛分校) University of Bath(巴斯大学)

AI总结 通过测试大型语言模型是否受固化(高频使用)和抢占(未观察到结构)两种统计信号影响,发现模型能识别强制情况下的构式生产力,但无法利用负面证据避免过度泛化。

详情
AI中文摘要

基于使用的语法理论认为,语言的创造性生产力受到两种不同频率信号的增强和约束:固化(源于高频使用)和抢占(源于在期望出现特定语言结构的语境中从未观察到该结构)。大型语言模型也是基于使用的,因为语言结构是通过接触大量文本而习得的。在这里,我们测试固化和抢占这两种对立的统计力量是否也鼓励和约束了LLM中的语言生产力。我们跨模型架构证明,较大的模型在强制情况下能够识别并用非词再现构式生产力(固化),其中更广泛的构式语境强制了对词汇项的非典型解释。然而,我们也表明,即使最大的模型也不会将负面证据扩展到新语言,并且统计抢占不能使模型避免对语义上合适但从未在数据中观察到的模式进行过度泛化。

英文摘要

Usage-based theories of grammars posit that creative productivity of the structures of language is both bolstered and constrained by two distinct frequency signals: entrenchment, stemming from high frequency usage, and preemption, stemming from having never observed a particular linguistic structure in a context where one might expect that structure to appear. Large Language Models are also usage-based, in the sense that the structures of language are learned through exposure to vast amounts of text. Here, we test whether or not the opposing statistical forces of entrenchment and preemption also encourage and constrain linguistic productivity in LLMs. We demonstrate across model architectures that larger models recognize and can reproduce with nonce words constructional productivity (entrenchment) in cases of coercion, wherein the broader constructional context coerces an atypical interpretation of a lexical item. However, we also show that even the largest models do not extend negative evidence to novel language, and statistical preemption does not enable models to avoid overgeneralization of patterns that are semantically felicitous, but never observed in data.

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC 版本更新

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE:边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结 提出SCOPE模块化代理,用于自然语言控制的PTZ相机,在边缘部署实现实时感知、规划与控制,并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情
Journal ref
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026
AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估:自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具,并使用部署关键指标(包括延迟、准确性和错误模式)进行评估。我们提出了SCOPE(用于感知和评估的仿真与相机操作),这是一个模块化代理,用于自然语言、开放词汇的云台变焦(PTZ)相机控制和视觉场景理解,专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行,也可在物理PTZ相机上运行,所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试,涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别,在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合,以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合,将Qwen3小语言模型(SLM)与Moondream和Qwen视觉语言模型(VLM)配对。更强的SLM显著减少了幻觉并改善了工具路由,从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM,感知就成为主要的性能瓶颈。在规划和感知方面,混合专家模型在延迟和内存占用与更小网络相当的情况下,始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升,为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

2606.02911 2026-06-03 cs.CL 版本更新

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

幽灵标注者:通过共形预测探索内容审核中人类标签变异的框架

Mirko Lai, Alessandra Urbinati, Simona Frenda, Fabiana Vernero, Marco Antonio Stranisci

发表机构 * Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern University(生物与社会技术系统建模实验室,东北大学) Heriot-Watt University(赫瑞-沃顿大学) aequa-tech Università del Piemonte Orientale(皮埃蒙特东方大学) Università degli Studi di Torino(托斯卡纳大学)

AI总结 提出结合共形预测与协同过滤式标注者表征的框架,通过幽灵预测度量和幽灵标注者表征量化模型预测与所有人类标注的分歧,并发现模型在标注者分歧时不确定性增加,但大型模型对无人类对齐文本更自信,且存在结构性人口统计偏差。

详情
AI中文摘要

当前研究主要关注模型性能,而对不确定性估计的关注相对较少,特别是在LLM越来越多地用于生成标注数据的场景中。我们引入了一个框架,将共形预测与协同过滤式的标注者表征相结合,以建模LLM相对于人类标注者的行为,并分析一致与分歧的模式。利用非一致性分数,我们引入了幽灵预测度量和幽灵标注者表征,以量化模型预测与所有可用人类标注不一致的情况。我们计算余弦相似度度量,以探索模型行为在不同社会人口统计轴上的差异。我们在四个内容审核数据集上评估了四种不同规模和家族的LLM。我们的发现表明,虽然所有模型的不确定性随着标注者分歧的增加而增加,但较大的模型在对与任何人类标注不一致的文本进行分类时往往更自信。最后,幽灵标注者框架揭示了一致且稳健的人口统计错位模式,表明可能存在源于预训练语料库的结构性偏见。

英文摘要

Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data. We introduce a framework combining conformal prediction with Collaborative Filtering-style annotators' representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement. Using Non-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes. We evaluated four LLMs of different size and families across four content moderation datasets. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora.

2606.02908 2026-06-03 cs.CL cs.AI 版本更新

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

WRIT: 面向多轮用户代理的写密集型轨迹合成

Hengrui Gu, Xiaotian Han, Kaixiong Zhou

发表机构 * North Carolina State University(北卡罗来纳州立大学) Case Western Reserve University(凯斯西储大学)

AI总结 针对多轮用户代理在信息收集和决策中面临的证据负担挑战,提出WRIT方法,通过合成写密集型和读密集型轨迹,训练代理在信息负载下做出基于证据的决策,仅用2K轨迹即可提升性能并减少推理时token使用。

详情
AI中文摘要

多轮用户代理必须从不完整的请求中推断用户意图,通过对话和工具收集缺失信息,并执行有效操作。训练轨迹将此过程记录为用户消息、代理响应、工具调用等的交错序列。合成足够复杂的轨迹已成为训练代理的核心途径:现有流程通常通过将多个用户请求组合成更长的任务来增加难度,产生训练顺序执行的写密集型轨迹。我们认为,当代理必须在收集和比较大量读工具证据后才能确定其参数时,单个写决策本身可能很困难,这是仅靠写密集型数据无法解决的挑战。基于这一见解,我们提出WRIT(写-读密集型轨迹合成),这是一个沿两个复杂度轴合成多轮代理训练轨迹的流程:任务中写决策的数量和每个决策的证据负担。WRIT首先生成写密集型和读密集型任务。然后,它多样化用户行为指令以反映真实的对话变化,最后在可执行环境中模拟代理-用户交互以生成完整的训练轨迹。由此产生的数据不仅训练代理执行更长的任务,而且在高信息负载下做出稳健的、基于证据的决策。仅用2K合成轨迹,在WRIT上训练的4B模型在$\tau^2$-bench上优于GPT-5.1 no-think,并大幅减少推理时token使用,表明紧凑的SFT数据可以将部分昂贵的测试时推理转化为高效的代理行为。

英文摘要

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $τ^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

2606.02871 2026-06-03 cs.CL cs.AI 版本更新

Adaptive Latent Agentic Reasoning

自适应潜在智能推理

Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Waterloo(滑铁卢大学) Greenshoe, Inc.(Greenshoe公司)

AI总结 提出双模式框架ALAR,在常规决策步骤使用紧凑潜在推理,仅在需要深入思考时切换至显式思维链,在保持或提升任务准确率的同时显著减少生成令牌数。

详情
AI中文摘要

大型推理模型通过生成扩展的思维链推理来提升性能,但当应用于LLM智能体时,这种行为变得低效。当前的LLM智能体通常在每一步决策中生成冗长的文本推理,并在各轮次中几乎均匀地分配推理努力,导致多轮智能体轨迹中的严重低效。我们提出自适应潜在智能推理(ALAR),一种双模式框架,在常规轮次中使用紧凑的潜在推理,并在需要更深思熟虑时选择性地升级为显式思维链。ALAR通过使用智能体的动作作为监督锚点来学习潜在推理,并进一步优化以在潜在推理足以完成任务时使用它,保留显式CoT用于更困难的决策。在智能体搜索和工具使用基准上的实验表明,ALAR在保持相当或更好任务准确率的同时,显著减少了生成的令牌数,在搜索中最多减少43.6%,在工具使用中最多减少84.6%。这些结果表明,ALAR通过减少不必要的文本推理,同时保留显式思考用于更困难的决策步骤,改善了LLM智能体的准确率-效率权衡。

英文摘要

Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

2606.02866 2026-06-03 cs.AI cs.CL cs.MA 版本更新

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

当帮助有害时以及如何修复:多智能体辩论用于数据清洗

Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar

发表机构 * Meta Platforms, Inc.(Meta公司)

AI总结 研究多智能体辩论在数据清洗中的效果,发现其会降低生成性能但提升错误检测,通过推导辩论收益条件并采用对抗性分离的辩论配置,首次在生成任务上显著超越单智能体。

Comments 27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics

详情
AI中文摘要

多智能体辩论何时有助于数据清洗,何时有害?在三个基准、四个模型家族和超过6000个任务-条件对中,我们发现辩论的效果会反转:通过批评引发的混淆(CIC),即生成器不加批判地接受幻觉性的批评反馈,辩论在所有四个模型上降低了生成性能(-1.6至-15.5个百分点),但提升了错误检测(F1提高27.4个百分点,d=1.0)。我们推导出一个辩论收益条件:当挽救错误输出的概率(由可修复性加权的批评者验证几率)超过破坏正确输出的概率时,辩论有帮助。一个析因实验证明对抗性分离至关重要:使用相同工具的自我验证失败,而一个独立的批评者,结合代码执行基础和证据门控生成,产生了第一个在生成任务上显著超过单智能体的辩论配置(+5.3个百分点,p<0.05)。该条件正确预测了所有九种任务类型,并在七个领域的19个已发表比较中实现了零假阳性泛化。

英文摘要

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

2606.02859 2026-06-03 cs.CL cs.AI cs.MA 版本更新

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

思维经济:具有经济交互的涌现多智能体智能

Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

发表机构 * Harvard University(哈佛大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 受哈耶克经济理论启发,通过拍卖和财富积累的简单经济信号实现去中心化信用分配,使弱智能体群体涌现出多步推理策略,在五个智能体任务中超越强单体基线。

详情
AI中文摘要

一群智能体如何在没有集中控制的情况下自我协调和自适应,形成更强的集体智能?受弗里德里希·哈耶克关于市场中去中心化协调的经济理论启发,我们通过一个智能体经济体来研究这个问题,其中智能体通过拍卖竞争行动权、交换支付,并从环境奖励中积累财富。这些简单的经济信号引出去中心化的信用分配,在没有全局编排或显式通信协议的情况下驱动规划。群体通过经济选择进化:有效的智能体积累财富并通过利用变异,而无效的智能体破产并通过探索被替换。我们表明,从弱智能体初始化,经济体产生涌现的多步推理策略,并在五个智能体任务中超越更强的单体基线,包括数学推理、金融研究、科学研究、加速器设计和分布式系统优化。我们进一步提供了关于经济动态如何塑造智能体行为的理论见解,将局部激励与长期全局表现联系起来。我们的结果指向了多智能体智能的一条新路径:与其设计协调,不如设计去中心化的激励结构,在这种结构下协调会自动涌现。

英文摘要

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

2606.02837 2026-06-03 cs.CL cs.AI 版本更新

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

修复FOLIO和MALLS:经过验证的标注和基于LLM的框架以聚焦人工重新标注

Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

发表机构 * University of Udine(乌迪大学)

AI总结 通过人工检查发现FOLIO和MALLS数据集中存在大量形式化错误,提出基于LLM的框架引导人工审核,显著减少所需审核量并提高数据集准确性。

详情
AI中文摘要

从自然语言到一阶逻辑(NL-to-FOL)的准确翻译是神经符号AI系统和自然语言推理(NLI)的基础,因此NL-to-FOL基准的质量至关重要——然而这些数据集从未经过严格审计。我们的第一个贡献是对 extsf{FOLIO}的验证集和 extsf{MALLS}测试实例子集进行系统性人工检查,发现分别约有39%和36%的条目包含错误的FOL形式化(即真实标签),此外还有一定比例的歧义NL句子(分别为16.4%和48%)以及 extsf{FOLIO}中错误的NLI标签(8.4%)。我们的第二个贡献是开发并发布了这些数据集的修正真实标签,并展示了标注错误如何扭曲参考基准任务上的模型评估:使用修正后的真实标签测试三个最先进的LLM(Gemma~4 31B-it、Qwen3-30B-A3B和GPT-4o-mini),准确率提升了9到22个百分点。受这些发现启发,我们提出了一个基于LLM的框架,以支持人工审查NL-to-FOL数据集。通过将审查者引导至最易出错的实例,我们实验证明,在审查少于24%的实例后即可达到90%的数据集准确率,而无引导的审查则需要审查超过70%的实例。我们发布了所有经过人工验证的标注以及框架代码。

英文摘要

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

2606.02814 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors

神经检索器是否偏好某些文档?学习到的相关性先验的证据

Francisco Valentini, Edgar Altszyler, Martin Fajcik

AI总结 通过分析监督双编码器检索器在文档嵌入中编码的查询无关信号,发现模型从标注数据中学习到文档级相关性先验,导致低先验文档即使相关也更难被检索,揭示了监督检索的结构性局限。

详情
AI中文摘要

神经检索器通过标注的查询-文档对训练来估计查询-文档相关性。然而,标注协议可能并不纯粹反映相关性:它们只选择一部分文档进行标注,并且这种选择可能偏向某些文档类型。我们研究监督双编码器检索器是否隐式学习了一个文档级相关性先验:一个查询无关的信号,作为在标注数据上训练的副作用编码在其表示空间中。我们通过在冻结的文档嵌入上训练简单分类器来估计这个先验,并在多个IR基准上评估三个最先进的检索器。我们发现监督神经检索器编码了能泛化到未见文档且跨模型一致的相关性先验。这些先验造成了可发现性差距:先验较低的文档即使真正相关也更难被检索。这种效应在监督密集检索器中出现,但在BM25中较弱且不一致,并在受控的匹配文档比较下持续存在。利用基于LLM的解释,我们发现被判定为相关的文档往往是主流主题的全面、自包含的摘要,而小众、零碎或高度技术性的内容通常未被评判。检索器内化了这种偏见,将具有这些偏好特征的文档排得比缺乏这些特征的文档更高,而与它们的实际相关性无关。我们的发现揭示了监督检索的结构性局限:在标注数据上训练的模型不仅学习相关性,还学习其训练数据中的隐式文档偏好。

英文摘要

Neural retrievers are trained to estimate query-document relevance from annotated query-document pairs. Yet annotation protocols may not purely reflect relevance: they select only a subset of documents for labeling, and this selection can favor certain document types over others. We investigate whether supervised bi-encoder retrievers implicitly learn a document-level relevance prior: a query-independent signal encoded in their representation space as a side effect of training on annotated data. We estimate this prior by training simple classifiers on frozen document embeddings and evaluate three state-of-the-art retrievers across multiple IR benchmarks. We find that supervised neural retrievers encode relevance priors that generalize to unseen documents and are consistent across models. These priors create a findability gap: documents with lower prior are systematically harder to retrieve, even when genuinely relevant. This effect appears in supervised dense retrievers but is weaker and less consistent in BM25, and it persists under controlled matched-document comparisons. Using LLM-based explanations, we find that judged-relevant documents tend to be comprehensive, self-contained summaries of mainstream topics, while niche, fragmentary, or highly technical content is often left unjudged. Retrievers internalize this bias, ranking documents with these favored features higher than documents that lack them, independently of their actual relevance. Our findings expose a structural limitation of supervised retrieval: models trained on annotated data do not just learn relevance, but also the implicit document preferences in their training data.

2606.02812 2026-06-03 cs.AI cs.CL 版本更新

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve:用于肺癌早期检测中患者轨迹建模的自进化多智能体系统

Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington(华盛顿大学) Fred Hutch Cancer Center(Fred Hutch癌症中心) Google(谷歌)

AI总结 提出Traj-Evolve,一种结合经验池和多智能体强化学习的自进化多智能体系统,通过检索相似患者和参数优化,在肺癌早期检测中优于9个强基线模型。

详情
AI中文摘要

从纵向电子健康记录(EHR)中建模患者轨迹需要对稀疏、嘈杂且长上下文的多模态序列进行推理。现有的基于LLM的多智能体系统解决了上下文长度问题,但孤立地处理患者,未能模拟临床医生如何利用从类似先前病例中积累的经验。我们提出了Traj-Evolve,一个具有两种互补进化机制的自进化多智能体系统。首先,经验池(ExPool)作为非参数记忆,索引拒绝采样的推理轨迹,以检索相似患者作为少样本上下文。其次,通过奖励排序微调的多智能体强化学习(MARL)参数化地优化智能体间和智能体-记忆协作。留一法交叉检索策略统一了这两种机制,在检索增强下对齐训练和推理时的行为。在利用长达五年的多模态EHR的肺癌预测任务中,Traj-Evolve在整体人群和具有挑战性的从不吸烟人群中均优于9个强基线模型。对进化动态的分析揭示了三个关键发现:(1)扩展ExPool将最优检索从多样本转向特定样本;(2)在MARL下,管理智能体的预测损失快速收敛,而工作智能体的时间推理继续受益于更多经过验证的患者;(3)这两种机制在预测风险上互补,ExPool提高特异性,而MARL提高敏感性。

英文摘要

Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

2606.02806 2026-06-03 cs.CL 版本更新

Translating Classical Poetry into Modern Prose

将古典诗歌翻译成现代散文

Chalamalasetti Kranti, Sowmya Vajjala

AI总结 本文介绍Padyam2Gadyam数据集,用于13-17世纪泰卢固语古典诗歌到当代泰卢固语和英语散文的翻译任务,并评估了5种大语言模型的表现,发现仍有较大改进空间。

Comments Preprint

详情
AI中文摘要

我们介绍了Padyam2Gadyam,一个用于诗歌到散文翻译任务的数据集,涵盖13至17世纪泰卢固语古典诗歌到当代泰卢固语和英语散文的翻译。该数据集包含600首诗歌及其人工验证的泰卢固语和英语散文翻译。我们评估了5种当代大语言模型(LLMs)在将诗歌翻译成泰卢固语和英语散文方面的能力。结果表明,尽管不同LLMs之间存在差异,但它们的整体表现在两种语言中仍有很大的改进空间。通过定性分析,我们讨论了当代机器翻译评估方法在此任务中的能力和局限性。

英文摘要

We introduce Padyam2Gadyam, a dataset for the task of poem-to-prose translation from 13th-17th Century Telugu Classical Poetry to contemporary Telugu and English prose. The dataset consists of 600 poems and their human-verified Telugu and English prose translations. We evaluated 5 contemporary Large Language Models (LLMs) on their ability to do poem-to-prose translation into Telugu and English. Our results indicate that while there are differences across LLMs, their overall performance leave a large room for improvement in both languages. Through qualitative analysis, we discuss the the capabilities and limitations of contemporary MT evaluation approaches for this task.

2606.02741 2026-06-03 cs.CL cs.CY 版本更新

Greener Than Humans? Environmental Attitudes in Large Language Models

比人类更绿色?大语言模型中的环境态度

Stefanie Kunkel, Tilman Hartwig, Marcus Voss, Emma K. Schütt, Angelika Gellrich

发表机构 * University of Tübingen(图宾根大学)

AI总结 通过构建基准评估31个LLM的环境认知、情感和行为建议,发现多数模型比德国调查受访者更倾向环保态度,但存在语境敏感性和谄媚偏移。

Comments Code can be found at https://gitlab.opencode.de/uba-ki-lab/llm-questionnaire-benchmarking-framework Benchmark data and results can be found at https://zenodo.org/records/20445903

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于可持续发展相关的决策支持、报告和公共传播,但关于其输出中嵌入的环境态度的系统性证据很少。本文开发了一个基准,用于评估LLMs的环境认知、情感和行为建议,并将其应用于31个广泛使用的专有和开源模型。利用既有环境意识调查中的问题以及额外的可持续发展相关行为测量,我们比较了1)模型之间的LLM响应,以及2)模型与来自德国的人类调查基准之间的响应。我们评估了它们在提示条件下的稳健性。我们发现,许多LLMs比平均调查受访者更倾向于环境进步态度,表现出更高的环境情感和认知水平,并推荐与大量潜在二氧化碳减排相关的行为。同时,我们观察到可持续导向的响应与模型来源、大小或发布背景之间没有系统性的关系。然而,模型表现出语境敏感性,受基于角色的提示控制,并显示出反映用户指定意识形态立场的谄媚偏移,这引发了关于现实部署中可操纵性和规范可靠性的担忧。我们的发现提供了一个可重复使用的评估框架,用于评估LLMs中与可持续发展相关的价值对齐,并强调了随着AI系统日益嵌入可持续转型和公共决策中,治理、透明度和关键监督的重要性。

英文摘要

Large language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs. This paper develops a benchmark for evaluating environmental cognition, affect, and behavioural recommendations in LLMs and applies it to 31 widely used proprietary and open-weight models. Drawing on questions from established environmental awareness surveys and additional sustainability-related behavioural measures, we compare LLM responses 1) among models and 2) between models and human survey benchmarks from Germany. We assess their robustness across prompting conditions. We find that many LLMs align more closely with environmentally progressive attitudes than the average survey respondent, exhibiting higher levels of environmental affect and cognition and recommending behaviours associated with substantial potential CO2 reductions. At the same time, we observe no systematic relationship between sustainability-oriented responses and model origin, size, or release context. However, models exhibit contextual sensitivity, controlled by persona-based prompting and show sycophantic shifts mirroring user-specified ideological positions, which raises concerns about steerability and normative reliability in real-world deployments. Our findings provide a reusable evaluation framework for assessing sustainability-related value alignment in LLMs and highlight the importance of governance, transparency, and critical oversight as AI systems become increasingly embedded in sustainability transformations and public decision-making.

2606.02737 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Attention Calibration for Position-Fair Dense Information Retrieval

面向位置公平的密集信息检索的注意力校准

Andrianos Michail, Elias Schuhmacher, Juri Opitz, Simon Clematide, Rico Sennrich

发表机构 * Department of Computational Linguistics University of Zurich(计算语言学系苏黎世大学)

AI总结 针对密集检索模型的位置偏差问题,提出在推理时通过注意力校准(引入强度系数λ插值原始与完全校准分布)来提升位置公平性,无需重新训练且不牺牲整体检索效果,在多个数据集和模型上验证了部分校准优于完全校准,并提供了默认配置。

详情
AI中文摘要

密集检索模型存在位置偏差:当相关信息出现在段落较后位置时,检索效果会下降(Zeng et al., 2025)。我们探究是否可以在推理时减少这种偏差,无需重新训练且不牺牲整体检索效果。为此,我们将推理时的注意力校准(Schuhmacher et al., 2026)适配到下游检索,并引入强度系数λ,在原始注意力分布和完全校准的注意力分布之间进行插值。在SQuAD-PosQ和FineWeb-PosQ上的三个嵌入模型上,我们考察了篮子大小、校准层集和强度如何影响位置公平性与检索效果之间的权衡,发现部分校准通常优于完全校准。单个配置(B=128, λ=0.5, 50%层深度)在FineWeb-PosQ上提升了所有三个模型跨位置组的nDCG@10的调和平均值,无需逐模型调参,并且适用于<s>-池化和最后token池化两种架构。该默认配置无需修改即可迁移到PosIR(涵盖10种语言和31个领域),在所有16种长度四分位×模型×检索设置组合中降低了位置敏感指数,同时保持或提升了整体nDCG@10。我们在以下网址发布扩展后的代码库:this https URL

英文摘要

Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both <s>-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair-sentence-transformers

2606.02628 2026-06-03 cs.LG cs.CL 版本更新

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

幻觉可从量化LLM中间层隐藏状态线性解码

Aizierjiang Aiersilan

发表机构 * University of Macau(澳门大学)

AI总结 研究开源LLM在4位量化下中间层隐藏状态是否编码线性可分的真实性信号,发现单层线性探针AUROC达0.904-1.000,优于采样方法,且信号近似线性。

详情
AI中文摘要

我们研究开源LLM是否在其隐藏状态中编码线性可分的真实性信号,以及该信号在网络哪一层最强。在三个7B-8B指令微调模型(Llama-3.1-8B、Mistral-7B、Qwen2.5-7B)以4位NF4量化加载的情况下,我们在四个幻觉基准(TruthfulQA、HaluEval-QA、FEVER和一个受控合成集)上提取每层隐藏状态,并比较四种检测方法:线性探针、MLP探针、INSIDE EigenScore、自一致性和注意力熵。单个中间网络层的线性探针在保留分割上达到0.904-1.000 AUROC,而基于采样的检测器在相同协议下不超过0.541 AUROC。真实性信号近似线性:MLP探针很少超过线性探针0.01 AUROC。在自然语言基准上,峰值探测层落在模型家族的一致范围内——Llama和Mistral的32层中第13-18块,Qwen的28层中第19-25块。第一块注意力熵在知识基础设置中提供互补信号(HaluEval-QA上0.866-0.941 AUROC),且无额外推理成本。该协议下采样方法的低区分性反映了配对标签评估与这些方法访问信息之间的结构性不匹配,而非这些方法的固有限制。代码和数据已发布,可在单个8 GB GPU上完全复现。

英文摘要

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~$13$--$18$ of~$32$ for Llama and Mistral, and blocks~$19$--$25$ of~$28$ for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings ($0.866$--$0.941$ AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single $8$\,GB GPU.

2606.02584 2026-06-03 cs.CL cs.AI cs.IR 版本更新

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX:习语理解、检索和解释的多语言基准

Ayman Ali Sharara

AI总结 提出IdiomX,一个大规模多语言习语基准,通过可复现的多阶段流水线构建,涵盖190K+上下文示例和12K+习语,定义四项任务(检测、上下文-习语检索、阿拉伯语-英语习语检索、习语解释),实验表明上下文Transformer模型提升检测,混合检索重排序增强单语和跨语言检索,习语解释可建模为语义检索任务。

Comments 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

详情
AI中文摘要

习语表达仍然是自然语言处理中的持续挑战,因为它们的含义通常是非组合性的、依赖于上下文的,并且难以跨语言对齐。现有的习语资源在规模、上下文多样性或多语言覆盖方面往往有限,限制了它们对现代语言模型的实用性。我们介绍了IdiomX,一个用于习语理解、检索和解释的大规模多语言基准,通过可复现的多阶段流水线构建,结合词汇资源提取、大规模归一化、受控的大语言模型丰富和结构化验证。生成的数据集包含超过190K个上下文示例,涵盖12K+习语,具有对齐的英语、阿拉伯语和法语语义表示、习语和字面用法标签以及丰富的语言元数据。基于这一资源,我们定义了一个统一的四任务基准,涵盖习语检测、上下文到习语检索、阿拉伯语到英语习语检索和习语解释,将评估从比喻识别扩展到语义基础和可解释的含义检索。实验表明,上下文Transformer模型显著提高了习语检测,而混合检索和重排序架构则显著增强了单语和跨语言习语检索。结果进一步表明,习语解释可以有效地建模为语义检索任务,将可解释性作为基准的补充维度。总体而言,IdiomX提供了一个可扩展的基准,用于研究从检测到检索和语义解释的习语语言进展,并提供了一个模块化框架,可扩展到其他语言和比喻推理任务。

英文摘要

Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

2606.02461 2026-06-03 cs.AI cs.CL 版本更新

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

AGENTCL:面向语言代理持续学习的严格评估

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

发表机构 * The Ohio State University(俄亥俄州立大学) Johns Hopkins University(约翰霍普金斯大学) Intuit AI Research(Intuit AI研究)

AI总结 提出AGENTCL评估框架,通过可控任务流和迁移增益指标,严格评估语言代理的持续学习能力,并开发MemProbe探针方法诊断记忆设计的影响。

Comments 10 pages in the main text, 26 pages in total

详情
AI中文摘要

语言代理在解决单个任务上花费大量推理时间,但一个回合中获得的经验在后续回合中往往未被充分利用。持续学习期望代理在任务流中积累可重用经验,随时间改进,并避免无关经验的干扰。不幸的是,现有基准难以严格评估语言代理中的持续学习。大多数工作侧重于长上下文对话或文档的检索和推理,而最近的生命周期适应基准通常依赖于简单的任务流,对跨任务关系的分析有限,使得难以理解代理随时间学习和重用的内容。本文提出了一个用于代理持续学习的评估框架AGENTCL,其核心是受控任务流和迁移增益指标。AGENTCL构建了组合流,其中早期的子解决方案、证据或工作流有意在后续任务中可重用,并与不保证这种可重用性的简单流形成对比。我们使用该基准评估用于持续学习的非参数记忆设计。为了诊断记忆设计选择如何影响持续学习,我们开发了MemProbe,一种探针方法,存储交互、洞察和技能,同时在整合过程中过滤不可靠的经验。跨编码、深度研究和语言理解/推理任务的实证分析表明,简单流区分记忆设计的能力有限,而受控流更清晰地区分其可塑性。同时,简单和保留设置通常产生有限的增益,并可能暴露记忆引起的退化。这些结果突显了需要更强的记忆设计,以平衡可塑性和稳定重用。

英文摘要

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

2606.02437 2026-06-03 cs.LG cs.CL 版本更新

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

论PEFT的扩展:迈向万亿参数百万个性化模型

Mind Lab, :, Vin Bo, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Wenhao Li, Zhihui Li, Allen Lin, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Shiyang Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Adrian Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

AI总结 研究参数高效微调(PEFT)作为共享基础模型上的持久局部状态,通过三个扩展轴(向上、向下、向外)分析其作为个性化模型基质的可行性,并提出MinT基础设施管理适配器生命周期。

详情
AI中文摘要

参数高效微调(PEFT)通常被视为全微调的廉价替代方案。我们研究了一个更广泛的作用:将小型可训练适配器作为强大共享基础模型之上的持久局部状态。在这种框架下,基础模型提供共享能力,而适配器承载实例特定行为,如偏好、技能、工具习惯和类似记忆的更新。我们围绕三个扩展轴组织问题:向上扩展,更强的共享先验使小型局部更新更有用;向下扩展,研究适配器可以多小同时保持可靠性;向外扩展,许多持久化适配实例共存。MinT提供了一个管理适配器身份、修订、来源、评估和服务驻留的基础设施示例。综合来看,结果表明PEFT可以成为持久个性化模型的紧凑基质,而不仅仅是全微调的预算替代方案。

英文摘要

Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters carry instance-specific behavior such as preferences, skills, tool habits, and memory-like updates. We organize the problem around three scaling axes: Scale Up, where stronger shared priors make small local updates more useful; Scale Down, where we study how small adapters can be while remaining reliable; and Scale Out, where many persistent adapted instances coexist. MinT provides one infrastructure example for managing adapter identity, revision, provenance, evaluation, and serving residency. Together, the results suggest that PEFT can be a compact substrate for persistent personal models rather than only a budget substitute for full fine-tuning.

2606.02332 2026-06-03 cs.AI cs.CL cs.LG 版本更新

Forget Attention: Importance-Aware Attention Is All You Need

忘记注意力:重要性感知注意力即你所需

Suhyeong Shin, Yeongwook Yang

发表机构 * Department of Computer Engineering(计算机工程系)

AI总结 提出SISA方法,通过将状态空间模型的重要性信号直接融入注意力分数计算,实现分数级融合,在语言建模中兼顾全局检索与重要性排序。

Comments 20 pages, 6 figures, 25 tables

详情
AI中文摘要

将注意力的全局检索与状态空间模型(SSM)的顺序重要性信号相结合是混合语言建模的开放挑战。Transformer能看见所有位置但无法区分优先级;SSM知道什么重要但无法重新访问。现有混合模型——Jamba(块级)和Hymba(头级)——将两者置于独立模块,因此在注意力计算过程中彼此无法相互影响。我们提出SISA(SSM引导的Softmax注意力),该方法在注意力分数内部直接添加SSM导出的重要性项,并通过在增强的查询/键向量上执行单个SDPA调用来实现完整操作——无需循环状态,无需自定义内核。在152M/5B token上,SISA在LAMBADA-greedy上达到17.3%(对比Transformer的13.9和Mamba-3的15.5),并从第1K步起实现NIAH 100%,比Transformer的检索收敛速度快7倍;在369M规模下,Mamba-3在LAMBADA上领先,而SISA保持完美的NIAH和标准SDPA执行。因此,SISA为SSM-注意力混合模型定义了第三个设计轴——分数级融合——超越了此前主导该领域的块级和头级范式。

英文摘要

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

2606.02240 2026-06-03 cs.CR cs.AI cs.CL cs.ET 版本更新

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

AgentRedBench: 针对SaaS集成的LLM代理的动态红队测试与集成感知防御

Hiskias Dingeto, William Leeney

发表机构 * StackOne Technologies(StackOne技术公司)

AI总结 针对LLM代理在工具使用中面临的间接提示注入威胁,提出动态红队基准AGENTREDBENCH(覆盖24个企业集成、5种攻击类型)和基于集成多样语料训练的防御模型AGENTREDGUARD,将攻击成功率从69.9%降至2.4%,误报率仅0.37%。

详情
AI中文摘要

工具使用代理中的间接提示注入是一个具体的生产威胁:LLM代理读取来自集成(通过工具调用访问的第三方服务,如Gmail、Salesforce或Jira)的响应内容,用户既未编写也无法控制这些内容。现有基准低估了该威胁:大多数仅覆盖少量集成,且每次运行重复相同的攻击载荷,而开源防护模型是在聊天风格数据而非工具响应内容上训练的。我们引入了AGENTREDBENCH,这是一个动态的LLM驱动的红队测试基准,包含215个微妙的未明确授权场景(在用户请求授权边界上的攻击),涵盖9个功能家族、24个企业集成和5种攻击类型。在八模型面板(Anthropic、OpenAI、Google)上,无防护的攻击成功率(ASR)范围从32%(Claude Sonnet 4.6)到81%(Gemini 3 Flash)。为了保持场景集不在训练语料中,并随时间保持标题ASR的意义,我们开源了代码库、集成模式和AGENTREDGUARD模型;规范场景通过维护者中介渠道进行评估,具有不可变版本控制。我们随基准发布了AGENTREDGUARD:一个在集成多样化的对抗性工具响应内容语料上训练的防护模型。AGENTREDGUARD将面板ASR从69.9%降至2.4%,误报率为0.37%,在两个指标上均优于所有具有非平凡检测能力的开源基线(Llama Guard、PromptGuard 2、ProtectAI)。跨集成和跨攻击类型的保留测试均证实了增益在训练子集之外具有迁移性。

英文摘要

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.

2606.02091 2026-06-03 cs.CL 版本更新

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

DFlare: 扩展块扩散推测解码的草稿容量

Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系) Tencent(腾讯)

AI总结 提出DFlare方法,通过轻量级逐层融合机制扩展草稿模型容量,在多个基准上实现平均5.52倍加速。

Comments 12 pages, 3 figures

详情
AI中文摘要

块扩散推测解码通过同时预测一个块内的所有令牌,供目标模型并行验证,从而加速LLM推理。一次性预测整个块需要足够强大的草稿模型和有效利用目标模型的内部知识。然而,最先进的方法DFlash限制所有草稿层共享仅从少数目标层导出的单个融合表示,限制了逐层表达能力并阻碍了草稿容量的进一步扩展。在本文中,我们提出\modelname,通过轻量级逐层融合机制扩展DFlash的狭窄条件瓶颈:每个草稿层关注其自身可学习的广泛目标层组合,开销可忽略,同时注入更丰富的目标知识并为每个草稿层提供不同的输入。这种增强的逐层表达能力使得草稿模型能够扩展到更深的架构并获得一致的性能提升。我们进一步将训练数据从80万样本扩展到240万样本,以充分利用扩大的容量。在涵盖数学推理、代码生成和对话的六个基准上,\modelname在Qwen3-4B上平均加速5.52倍,在Qwen3-8B上平均加速5.46倍,在GPT-OSS-20B上平均加速3.91倍,分别比DFlash提高约11%、8%和5%。我们的代码可在https://github.com/Tencent/AngelSlim获取。

英文摘要

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.

2606.02004 2026-06-03 cs.CL cs.LG 版本更新

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

将零售产品名称编码为消费者价格类别的机器学习:基于规则加词袋的流水线,结合可靠性加权的人工参与标注

Vladimir Beskorovainyi

发表机构 * Besk Tech(Besk科技) Moscow Institute of Physics and Technology (MIPT)(莫斯科物理技术学院)

AI总结 本文提出一种结合规则和词袋模型的流水线方法,并采用可靠性加权的人工参与标注协议,将零售产品名称映射到消费者价格类别(如UN COICOP),实验表明词袋模型在该任务上已接近饱和(F1约0.99),而标注协议中可靠性加权投票仅略优于简单多数投票。

Comments 11 pages, 3 tables. Methodology paper; illustrative experiments only, no proprietary data

详情
AI中文摘要

消费者价格测量越来越多地依赖替代数据源——扫描仪、网络抓取和交易/收据数据。一个反复出现的障碍是,这些来源中的产品描述简短、嘈杂且缩写,没有标准产品代码,因此每个项目必须首先映射到消费分类(例如,联合国COICOP方案),然后才能比较价格。本文将该映射作为一种通用的、可重复的方法进行研究。流水线包括:(i) 对嘈杂项目名称进行文本归一化和分词;(ii) 基于每类关键词和停用词的前缀树(trie)规则预分类器;(iii) 每个类别的二元确认模型,决定一个项目是否属于暂定分配的类别。对于大规模标注,我们使用人工参与协议,其中标注者给出二元有效/拒绝判断,通过动态更新的可靠性权重进行聚合;模型加入相同的规则,实现持续微调。我们的实证发现是通货紧缩的:在一个受控、无泄漏的研究中(一个类别,真实正例与困难负例,五个随机种子),词袋模型基本上饱和了任务(F1约0.99)——线性分类器匹配多层感知器,显式词序(n-gram)特征没有增加任何价值,约67个标注样本已经足够。标注协议的蒙特卡洛研究表明,可靠性加权投票勉强超过简单多数投票(其加性权重饱和),而Dawid-Skene方法明显更好地恢复标签。我们还讨论了价格层面的质量控制和统计办公室考虑交易数据时的设计经验。所有数字均为示意性;未复制任何机密数据、代码或文档。

英文摘要

Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.

2606.01904 2026-06-03 cs.CL cs.AI 版本更新

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

KliniskVestBERT: 针对挪威临床文本特化的BERT模型

Christian Autenried, Cosimo Persia

发表机构 * helse-vest-ikt(赫尔塞-维斯特信息科技)

AI总结 通过继续预训练现有BERT模型,在挪威临床文本上构建KliniskVestBERT模型,在临床NLP任务中显著优于基线模型。

详情
AI中文摘要

自然语言处理(NLP)在医疗保健中的应用日益增长,这要求语言模型特别适应临床语言的复杂性。本文介绍了KliniskVestBERT,这是一套基于BERT的编码器模型套件,在来自Helse Vest的大量真实、去标识化的挪威临床文本语料库上进行预训练。我们在专门的临床数据集上继续预训练现有的语言模型Nb-BERT-large、NorBERT3-large和ModernBERT。该数据集基于Helse Vest患者的代表性人群。包含的文档类型经过精心策划,涵盖了bokmål和nynorsk的广泛临床范围,包括出院小结、手术报告、护理记录等,确保全面代表挪威医疗环境中的语言景观。在三个合成挪威临床基准数据集和两个真实世界问题上的评估表明,每个临床特化模型都持续优于其基线对应模型,突显了领域特定预训练对临床领域NLP任务的显著益处。该项目由所有Helse Vest实体(Helse Bergen、Helse Fonna、Helse Førde和Helse Stavanger)与DIPS在Helse Vest ICT的项目领导下共同完成。

英文摘要

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.

2606.01849 2026-06-03 cs.LG cs.CL cs.CR 版本更新

ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?

ContinuousBench: 差分隐私合成文本能否提升能力?

Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie

发表机构 * Columbia University(哥伦比亚大学) NYU(纽约大学) Google Research(谷歌研究) University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) Google(谷歌)

AI总结 提出ContinuousBench基准,通过持续更新的数据集评估差分隐私合成文本能否传递原始语料库中的新知识,实验表明非隐私合成能有效转移知识,而最先进的DP合成方法即使在高隐私预算下也基本失败。

Comments For datasets, see https://huggingface.co/ContinuousBench; for the evaluation harness, see https://github.com/plau666/ContinuousBenchEval; for an accompanying blog post, see https://peihanliu.com/posts/continuousbench.html

详情
AI中文摘要

差分隐私(DP)文本合成有望为模型训练解锁敏感语料库,但目前尚不清楚DP合成数据是否能传递仅存在于这些语料库中的真正新知识和能力。这是因为现有评估依赖于无需训练即可几乎解决的任务,因此强大的基准性能并不能证明DP合成可以替代原始数据访问。为此,我们引入了ContinuousBench,一个持续自动更新的基准,用于衡量DP合成文本带来的能力提升。每季度发布一次新版本,配对一个从未见过的训练语料库和一个衍生的问答集,其构建满足:(1)无语料库无法解决;(2)在DP下可学习,因为测试的知识由数百条独立记录支持。研究人员从训练语料库生成DP合成数据,并在其合成数据上运行我们的标准化训练和评估框架以衡量增益。我们实例化两个轨道:Geminon,一个关于虚构生物的程序生成数据集;以及News,一个新爬取的公共新闻文章流。尽管标准基准几乎饱和,但在ContinuousBench上,我们发现非隐私合成从原始语料库转移了大量知识,而最先进的DP合成方法即使在高隐私预算(ε=100)下也基本无法做到。

英文摘要

Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically-regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never-before-seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans-corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally-generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non-private synthesis transfers substantial knowledge from the original corpus, while state-of-the-art DP synthesis methods generally fail to do so, even at $\varepsilon=100$.

2606.01629 2026-06-03 cs.CL 版本更新

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

基准测试LLM作为裁判在长文本输出评估中的应用

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) University of Science and Technology Beijing(北京科技大学)

AI总结 提出LongJudgeBench基准,系统评估LLM裁判在长文本输出评估中的可靠性,发现当前模型存在显著可靠性差距。

详情
AI中文摘要

随着大语言模型(LLM)越来越多地用于长文本生成,可靠地评估长文本输出已成为一个关键挑战。LLM作为裁判提供了一种可扩展的替代人工评估的方法,但其在长文本输出评估中的可靠性仍未得到充分检验:现有的元评估基准主要关注短文本输出。与短文本评估相比,长文本评估不仅仅是输出长度的问题;它通常要求裁判处理更复杂的文档级需求。在这项工作中,我们引入了LongJudgeBench,这是一个全面的基准,用于评估LLM裁判在跨多种真实场景和评判协议下的长文本输出表现。我们系统地评估了广泛的LLM裁判,涵盖了多个基础模型和评判设置。我们的结果揭示了显著的可靠性差距:当前的LLM裁判在不同场景下仍然不稳定,而评分标准或参考虽然有所帮助,但并非总是足够。我们希望LongJudgeBench能够支持未来在更稳健、上下文感知且与人类对齐的LLM-as-a-judge方法上的研究。我们的代码可在https://anonymous.4open.science/r/LongJudgeBench-F782获取。

英文摘要

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to make more complex document-level assessments of overall organization, task-relevant coverage and depth, cross-section consistency, and scenario-specific quality criteria. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://github.com/cjj826/LongJudgeBench.

2606.01166 2026-06-03 cs.CR cs.CL 版本更新

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

BraveGuard: 从开放世界威胁到更安全的计算机使用代理

Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding, Ming Wen, Yixu Wang, Yuxiang Xie, Baihui Zheng, Yingshui Tan, Yige Li, Yutao Wu, Kerui Cao, Wenke Huang, Yanming Guo, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Ant Group(蚂蚁集团) Hunan Institute of Advanced Technology(湖南高级技术研究所) Alibaba Group(阿里巴巴集团) Singapore Management University(新加坡管理大学) Deakin University(德肯大学) Nanyang Technological University(南洋理工大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出BraveGuard框架,通过从开放世界威胁信号和真实代理轨迹中训练防护模型,实现轨迹级别的安全检测,显著提升计算机使用代理的安全性。

详情
AI中文摘要

计算机使用代理将语言模型从文本生成扩展到与文件、终端、浏览器和外部工具的持续交互。这种转变带来了安全风险,这些风险难以从孤立的提示或最终响应中检测出来,因为危害通常只在多步执行轨迹中显现,而单个动作在局部看似无害。我们引入了BraveGuard,一个自我进化的防御框架,用于从开放世界威胁信号和真实代理轨迹中训练防护模型。BraveGuard挖掘近期研究来源以识别新兴风险和攻击模式,将其实例化为可执行的计算机使用任务,收集代理轨迹,并为防护模型训练提供轨迹级别的监督。随着新威胁和验证失败的出现,可以重复该流程,形成一个自适应防御循环,而不是静态的、基准驱动的训练过程。我们通过训练多个防护骨干模型(包括Qwen3-Guard和Llama-Guard变体)来实例化BraveGuard,并在轨迹级别的代理安全基准上评估生成的防护模型。BraveGuard在计算机使用轨迹上持续提高了安全检测能力。在AgentHazard上,与现成的防护模型相比,它显著提高了检测准确性,在平均防护模型设置下,准确率从38.79%提升到82.38%。这些结果表明,基于开放世界威胁发现和真实代理执行的防护监督可以超越固定分类法和合成提示级数据,改进安全监控。BraveGuard为面对不断变化的现实世界风险的计算机使用代理提供了一条可扩展的自适应防御路径。

英文摘要

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.

2606.01075 2026-06-03 cs.CL 版本更新

On the Generalization Gap in Self-Evolving Language Model Reasoning

自进化语言模型推理中的泛化差距

Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan, Andrew Tomkins, Tu Vu, Da-Cheng Juan, Cyrus Rashtchian

发表机构 * Google Research(谷歌研究) Harvard University(哈佛大学) Google(谷歌) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究严格闭环设置下自进化算法与完美监督之间的差距,发现多轮批评-修正策略可接近完美监督性能,但自进化仍存在非平凡差距。

Comments Published at ICML 2026

详情
AI中文摘要

近期研究表明,大型语言模型(LLMs)可以通过自进化(SE)来改进,即使用模型自身生成的监督信号。在这项工作中,我们提出疑问:在严格的闭环设置下,自进化算法只能访问未标记的提示集和基础模型,内部生成的监督能多大程度接近完美监督训练?我们在统一的离线自进化框架中分析了四种代表性策略:单轮验证、带反馈的多轮修正、迭代训练和课程学习。我们的主要实验使用骑士与无赖(KK)逻辑推理任务,该任务提供确定性解决方案、可控难度级别以及易于泛化的干净测试平台。我们首先表明,自进化始终优于基础模型,但在投入过多训练计算后趋于平稳,最终仍与完美监督存在非平凡差距。我们发现,使用大型模型的多轮批评-修正可以达到强大的自进化性能,其中Gemma 12B几乎与完美监督训练相匹配。除了骑士与无赖任务,我们还在现实世界的推理基准上评估了自进化,其增益也较为有限。总体而言,我们的结果描述了闭环自进化何时能够提供帮助,并表明在这种最小化设定下,内部生成的监督仍然不足。

英文摘要

Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.

2605.31381 2026-06-03 cs.CL 版本更新

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

LLM 法官在安全标准和危害类别上不一致地存在分歧

Krishnapriya Vishnubhotla, Sowmya Vajjala, Akriti Vij, Isar Nejadgholi

发表机构 * National Research Council, Canada(加拿大国家研究理事会) IMDA, Singapore(新加坡信息通信技术发展局)

AI总结 本研究评估了自动法官在无参考设置下进行多维安全评估的一致性,发现大型语言模型在识别金融等受监管领域中机器生成建议的安全问题时不可靠,且不一致程度因安全标准、内容语言和风格而异,不同法官之间也存在高度分歧。

Comments 8 pages plus appendices, under review

详情
AI中文摘要

我们评估了自动法官在无参考设置下进行多维安全评估的一致性。结果表明,大型语言模型在识别金融等受监管领域中机器生成建议的安全问题时不可靠,尽管它们在识别更明显的危险/有害内容(如暴力)时更为可靠。模型判断的不一致程度可能因所选安全标准而有显著差异,并且可能受内容语言及其语言风格的影响。最后,对于相同的输出,不同法官之间在领域、安全标准和语言上存在高度分歧。这些发现为使用LLM作为评估者的实践提供了新见解,并为从业者如何在实际场景中使用自动法官提供了若干建议。

英文摘要

We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues related to machine-generated advice in regulated domains such as finance, although they are more reliable at identifying more overt forms of unsafe/harmful content such as violence. The degree of inconsistency in a model's judgments can vary significantly by the chosen safety criteria and can be impacted by the language of the content and its linguistic style as well. Finally, there is high disagreement among different judges for the same output, across domains, safety criteria, and languages. These findings provide new insights on the practice of using LLMs as evaluators and offer several recommendations for practitioners on how to use automated judges in practical scenarios.

2605.28910 2026-06-03 cs.CL cs.AI 版本更新

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

基于幻觉检测的偏好优化用于临床摘要生成

Shamanth Kuthpadi Seethakantha, Dung Ngoc Thai, Vara Prasad Gudi, Simran Tiwari, Rami Matar, Avijit Mitra, Wenlong Zhao, Andrew McCallum, Wael Salloum

发表机构 * Manning College of Information and Computer Sciences(Manning信息与计算机科学学院) Ensemble HP Columbia University(哥伦比亚大学) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出利用幻觉检测器指导迭代修正的推理时方法及偏好学习微调,显著减少临床摘要中的幻觉。

详情
AI中文摘要

大型语言模型(LLM)在摘要任务上展现出潜力,但常产生幻觉,即无依据或不正确的陈述,限制了其在专业医疗应用中的可靠性。我们引入\itermodelfull(\itermodel),一种推理时方法,利用幻觉检测器指导迭代摘要修正以实现事实更正。在此基础上,我们提出用于偏好学习的\itermodel(\model),将检测器引导的修正轨迹转化为偏好对以进行模型微调。大量实验表明,我们的方法在总结来自\MimicIV的真实临床笔记时,显著减少了Llama和Gemma模型的幻觉。例如,\itermodel在Llama-3.1-8B-Instruct上减少了24%的幻觉,而\model减少了48%。重要的是,根据人类专家和LLM-Jury评估,两种方法都保持了摘要的流畅性、连贯性和相关性。这些结果共同表明,检测信息驱动的修正和偏好学习为提高临床摘要的事实准确性提供了一种自动化解决方案。

英文摘要

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce Hallucination Detection Guided Self-Refinement (HDSR), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose HDSR for Preference Learning (HDSR-PL), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from MIMIC-IV-Note v2.2. For example, HDSR reduces 24% and HDSR-PL reduces 48% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

2603.29123 2026-06-03 cs.CL 版本更新

Learning Concepts, Not Tokens: Self-Supervised Semantic Alignment for Language Models

学习概念,而非词元:语言模型的自我监督语义对齐

Christine Zhang, Dan Jurafsky, Chen Shani

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种自我监督框架,通过预测语义等价词元集合(概念)替代下一个词元预测,提升语言模型的语义对齐能力,并在分类、聚类、重排序及下游推理任务中取得更优性能。

详情
AI中文摘要

下一个词元预测(NTP)目标训练语言模型在每一步预测单个词元,尽管许多后续表达可以传达相同含义。例如,在句子“this sticker can be placed here”中,positioned、attached 或 put 都是合理的替代。虽然标准 NTP 训练将这些替代视为互斥目标,但我们探索了一种自我监督框架,鼓励模型预测概念(近似为语义等价词元集合)。使用这种概念监督训练的模型在人类相似性判断上对齐更好,改进了分类、聚类和重排序性能,并实现了相当或更强的下游推理。这些提升伴随着语义有意义词元(第 3.2 节)上更低的困惑度,且全局困惑度仅最小增加,表明概念在保持语言建模质量的同时增强了语义对齐。我们的代码可在 https://anonymous.4open.science/r/learning-concepts-9025 获取。

英文摘要

The next-token prediction (NTP) objective trains language models to predict a single token at each step, even though many continuations can express the same meaning. For example, in the sentence ``this sticker can be placed here'', positioned, attached, or put are all plausible alternatives. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a self-supervised framework that encourages models to predict concepts, approximated as sets of semantically equivalent tokens. Models trained with this concept supervision align better with human similarity judgments, improve classification, clustering, and reranking performance, and achieve comparable or stronger downstream reasoning. These gains come with lower perplexity on semantically meaningful words (Section 3.2) and only minimal increases in global perplexity, suggesting that concepts enhance semantic alignment while preserving language modeling quality. Our code is available at https://anonymous.4open.science/r/learning-concepts-9025 .

2512.24000 2026-06-03 cs.CL 版本更新

WISE: Web Information Satire and Fakeness Evaluation

WISE:网络信息讽刺与虚假评估

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

发表机构 * Texas State University San Marcos(德克萨斯州立大学圣马科斯分校)

AI总结 针对假新闻与讽刺文本的区分难题,提出WISE框架,在Fakeddit数据集上评估轻量级Transformer模型,发现MiniLM和RoBERTa-base分别达到最高准确率和ROC-AUC,证明轻量模型在资源受限场景下的有效性。

Comments This is the author's preprint. Accepted to WEB&GRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: https://wsdm-conference.org/2026/ Workshop: https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html

详情
AI中文摘要

区分虚假或不真实的新闻与讽刺或幽默,由于它们重叠的语言特征和不同的意图,带来了独特的挑战。本研究开发了WISE(网络信息讽刺与虚假评估)框架,该框架在来自Fakeddit的20000个样本的平衡数据集上,对八个轻量级Transformer模型以及两个基线模型进行了基准测试,这些样本被标注为假新闻或讽刺。使用分层5折交叉验证,我们在包括准确率、精确率、召回率、F1分数、ROC-AUC、PR-AUC、MCC、Brier分数和期望校准误差在内的综合指标上评估模型。我们的评估显示,轻量级模型MiniLM在所有模型中达到了最高的准确率(87.58%),而RoBERTa-base达到了最高的ROC-AUC(95.42%)和强大的准确率(87.36%)。DistilBERT以86.28%的准确率和93.90%的ROC-AUC提供了出色的效率-准确率权衡。统计检验确认了模型之间的显著性能差异,配对t检验和McNemar检验提供了严格的比较。我们的发现强调,轻量级模型可以匹配或超过基线性能,为在现实世界资源受限的环境中部署错误信息检测系统提供了可行的见解。

英文摘要

Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

2605.23055 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Decomposing and Measuring Evaluation Awareness

分解与度量评估意识

Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko

发表机构 * ETH Zürich(苏黎世联邦理工学院) ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究院) Tübingen AI Center(图宾根人工智能中心) University of Toronto & Vector Institute(多伦多大学及向量研究所) EuroSafeAI(欧洲安全人工智能)

AI总结 本研究借鉴社会心理学,将评估意识分解为环境与模型两部分,通过EvalAwareBench基准测试发现识别率取决于模型与基准的配对,且识别很少导致行为改变,安全评估比能力评估更易受影响。

详情
AI中文摘要

前沿语言模型有时会意识到自己正在被评估并调整行为,从而破坏基准结果的有效性。然而,该领域的研究缺乏共同基础,混淆了评估属性与模型属性,以及检测与行为响应。我们将评估意识植根于社会心理学,将其分解为环境组件(任务的可识别程度)和模型组件(将识别与行动倾向分离)。我们通过八个分类触发因素(如占位符实体和评分式输出格式)来操作化环境组件,并通过思维链监控研究识别和行为。在九个前沿模型和四个基准上,识别率取决于模型和基准的具体配对,而非单独一方。识别很少导致行为改变,即使发生,方向也取决于所感知的评估类型。模型对安全评估的敏感性高于能力评估,使安全基准的有效性面临更大风险。为了研究每个模型对哪些因素敏感以及它们如何相互作用,我们提出了EvalAwareBench,这是一个包含100个配对安全-能力任务的因子控制基准,其中八个因素中的每一个都可以独立切换,在保持底层请求不变的同时改变评估信号。通过EvalAwareBench,我们发现没有单一因素能统一影响所有模型,但叠加因素会逐步提高所有模型的评估意识。我们的框架和EvalAwareBench提供了度量、归因和缓解评估意识的工具,指出在识别下的行为一致性是一条有前景的前进道路。

英文摘要

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.

2605.18740 2026-06-03 cs.CV cs.AI cs.CL cs.LG 版本更新

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD:通过在线策略自蒸馏学习多模态大语言模型的精细细节

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

发表机构 * Tsinghua University(清华大学)

AI总结 提出Vision-OPD框架,通过在线策略自蒸馏将模型自身的局部区域感知能力迁移到全局图像策略,提升多模态大语言模型对细粒度视觉理解的准确性。

Comments Project page: https://github.com/VisionOPD/Vision-OPD

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度视觉理解方面仍然存在困难,答案往往依赖于全图中微小但决定性的证据。我们观察到一种区域到全局的感知差距:当以证据为中心的裁剪图像为条件时,同一MLLM回答细粒度问题的准确率高于以对应全图为条件,这表明许多失败源于难以聚焦于相关证据,而非局部识别能力不足。受此观察启发,我们提出Vision-OPD(视觉在线策略蒸馏),一种区域到全局的自蒸馏框架,将模型自身特权的区域感知迁移到其全图策略。Vision-OPD从同一MLLM实例化两个条件策略:一个以裁剪图像为条件的教师和一个以全图为条件的学生。学生生成在线策略轨迹,Vision-OPD沿这些轨迹最小化教师和学生下一个词元分布之间的词元级差异。这使得模型能够内化视觉放大的好处,而无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明,Vision-OPD模型在性能上可与更大的开源、闭源以及“思考图像”智能体模型相媲美或更优。

英文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

2405.07006 2026-06-03 cs.CL 版本更新

Word-specific tonal realizations in Mandarin

普通话中特定词的声调实现

Yu-Ying Chuang, Melanie J. Bell, Yu-Hsiang Tseng, R. Harald Baayen

发表机构 * National Taiwan Normal University(台湾师范大学) Anglia Ruskin University(安格利亚 Ruskin 大学) Eberhard Karl University of Tübingen(图宾根艾伯哈德·卡尔大学)

AI总结 本研究通过语料库分析和计算建模,发现普通话双字词的声调实现部分由词义决定,且词义对声调实现的预测力超过已知的词形相关因素。

详情
Journal ref
Language 102 (2026) 1-45
AI中文摘要

普通话双字词的音高轮廓通常被认为是由组成单字词的底层声调所决定,并与语速、相邻声调协同发音、音段构成和可预测性等因素施加的发音约束相互作用。本研究表明,声调实现也部分由词义决定。我们首先基于台湾普通话自然对话语料库,使用广义加性回归模型,并聚焦于升降调模式,发现在控制说话者和语境效应后,词类型对声调实现的预测力比所有先前建立的词形相关预测因子之和更强。重要的是,添加关于语境中词义的信息进一步提高了预测准确性。然后,我们通过使用上下文特定的词嵌入进行计算建模,表明在保留数据上,标记特定的音高轮廓以50%的准确率预测词类型,而上下文敏感的标记特定嵌入以40%的准确率预测音高轮廓的形状。这些准确率比随机水平高一个数量级,表明词音高轮廓与其词义之间的关系足够强,可能对语言使用者具有功能性。讨论了这些实证发现的理论意义。

英文摘要

The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' meanings. We first show, on the basis of a corpus of Taiwan Mandarin spontaneous conversations, using a generalized additive regression model, and focusing on the rise-fall tone pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of tonal realization than all the previously established word-form related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 40% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words' pitch contours and their meanings are sufficiently strong to be potentially functional for language users. The theoretical implications of these empirical findings are discussed.

2512.18552 2026-06-03 cs.SE cs.AI cs.CL cs.LG 版本更新

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

通过自我对弈SWE-RL训练超级智能软件代理

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, Sida Wang

发表机构 * Meta FAIR University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta TBD Lab(Meta TBD 实验室) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出自我对弈SWE-RL(SSR)方法,通过强化学习在自对弈环境中训练单一LLM代理,使其在无需人工标注问题或测试的情况下,在真实代码库中迭代注入和修复软件缺陷,在SWE-bench基准上实现显著自我改进并超越人类数据基线。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管当前由大型语言模型(LLM)和智能体强化学习(RL)驱动的软件代理能够提高程序员的生产力,但其训练数据(例如GitHub问题和拉取请求)和环境(例如通过-通过和失败-通过测试)严重依赖人类知识或整理,这构成了通向超级智能的根本障碍。在本文中,我们提出了自我对弈SWE-RL(SSR),这是迈向超级智能软件代理训练范式的第一步。我们的方法仅需最小的数据假设,只需访问带有源代码和已安装依赖项的沙盒化仓库,无需人工标注的问题或测试。基于这些真实世界的代码库,单个LLM代理通过强化学习在自我对弈环境中进行训练,以迭代地注入和修复复杂度逐渐增加的软件缺陷,每个缺陷由测试补丁而非自然语言问题描述正式指定。在SWE-bench Verified和SWE-Bench Pro基准上,SSR实现了显著的自我改进(分别提升+10.4和+7.8分),并在整个训练轨迹中持续优于人类数据基线,尽管其评估的是自我对弈中未出现的自然语言问题。我们的结果虽然尚处于早期阶段,但表明了一条路径,即代理可以从真实软件仓库中自主收集广泛的学习经验,最终实现超越人类能力的超级智能系统,在理解系统构建方式、解决新挑战以及从头开始自主创建新软件方面超越人类。

英文摘要

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.

2603.00029 2026-06-03 cs.CL 版本更新

Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

拥抱各向异性:将大规模激活转化为大型语言模型的可解释控制旋钮

Youngji Roh, Hyunjin Cho, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文提出一种基于幅度的无训练方法识别领域关键维度,将其作为可解释的语义检测器,并通过仅对这些维度进行激活引导,在领域适应和越狱场景中优于传统全维度引导方法。

Comments ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)表现出高度各向异性的内部表示,通常以大规模激活为特征,即一小部分特征维度具有显著大于其余维度的量级。虽然先前的工作主要将这些极端维度视为需要处理的伪影,但我们提出了一个不同的视角:这些维度作为由领域专业化产生的内在可解释功能单元。具体来说,我们提出了一种简单的基于幅度的标准,以无训练方式识别领域关键维度。我们的分析表明,这些维度表现为符号/定量模式或领域特定术语的可解释语义检测器。此外,我们引入了关键维度引导,仅对识别出的维度应用激活引导。实证结果表明,在领域适应和越狱场景中,该方法优于传统的全维度引导。

英文摘要

Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.

2605.15572 2026-06-03 cs.CL 版本更新

Measuring Maximum Activations in Open Large Language Models

测量开放大型语言模型中的最大激活值

Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Linghe Kong, Dawei Yin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Baidu Inc.(百度公司) Nankai University(南开大学)

AI总结 本研究通过统一流程测量8个开放LLM家族的27个检查点中的全局和逐层最大激活值,发现其跨家族、架构和训练阶段差异显著,且与低比特量化误差相关,建议在发布开放权重时报告该指标。

详情
AI中文摘要

激活值的动态范围是低比特量化、激活缩放和稳定LLM推理的一阶约束。先前的工作表征了2024年前LLaMA风格模型上的异常特征和巨大激活值,而下游的激活量化堆栈继承了这一图景,未针对后LLaMA开放模型繁荣进行重新审视。我们提出面向部署的问题:现代开放LLM中激活值能有多大,这种幅度如何随家族、代际和训练阶段变化?在统一流程下(5000样本多领域语料库、家族特定分词、嵌入层、隐藏状态、注意力、MLP/MoE、SwiGLU门和最终归一化层的相同钩子),我们测量了来自8个开放家族的27个检查点(涵盖密集、MoE、视觉语言、中间训练和指令微调变体)的全局和逐层最大值。我们发现:(i) 在可比较的参数数量下,全局最大值跨越近四个数量级,Qwen3.5和MoE检查点在10^2到10^3范围,而Gemma3-27B-it达到约7×10^5;(ii) 跨家族和跨代际比较打破了简单的单调缩放;(iii) MoE检查点的峰值比同等规模的密集对应物低14.0-23.4倍,而残差流在22/24个检查点中承载全局最大值。一个轻量级INT-8完整性检查表明,测量的最大值通过激活尺度选择与低比特重建误差共变。我们得出结论,最大激活幅度是一个与家族、架构和训练阶段相关的模型属性——而不是规模的简单副产品——并且应在低比特部署前与任何开放权重发布一起测量和报告。代码公开于https://github.com/clx1415926/Max_act_llm。

英文摘要

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.

2605.14589 2026-06-03 cs.CL 版本更新

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

EndPrompt: 通过终端锚定实现高效长上下文扩展

Han Tian, Luxuan Chen, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Jinman Zhao, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Linghe Kong, Dawei Yin

发表机构 * Nankai University(南开大学) Baidu Inc.(百度公司) Shanghai Jiao Tong University(上海交通大学) Independent Researcher(独立研究者)

AI总结 提出EndPrompt方法,仅使用短训练序列通过终端锚定实现长上下文扩展,在LLaMA模型上从8K扩展到64K,性能超越全长度微调等方法。

详情
AI中文摘要

扩展大型语言模型的上下文窗口通常需要在目标长度序列上进行训练,这会带来二次方内存和计算成本,使得长上下文适应变得昂贵且难以复现。我们提出EndPrompt,一种仅使用短训练序列即可实现有效上下文扩展的方法。核心洞察是,让模型暴露于长程相对位置距离并不需要构建完整长度的输入:我们将原始短上下文保留为完整的第一个片段,并附加一个简短的终端提示作为第二个片段,为其分配接近目标上下文长度的位置索引。这种两段式结构在短物理序列中引入了局部和长程相对距离,同时保持了训练文本的语义连续性——这是将连续上下文分割的基于块的模拟方法所缺乏的特性。我们提供了基于旋转位置编码和伯恩斯坦不等式的理论分析,表明位置插值对注意力函数施加了严格的平滑性约束,共享的Transformer参数进一步抑制了到未观察中间距离的不稳定外推。应用于将LLaMA系列模型的上下文窗口从8K扩展到64K,EndPrompt实现了平均RULER分数76.03,并在LongBench上获得最高平均分,超越了LCEG(72.24)、LongLoRA(72.95)和全长度微调(69.23),同时所需计算量大幅减少。这些结果表明,长上下文泛化可以从稀疏位置监督中诱导出来,挑战了密集长序列训练对于可靠上下文窗口扩展是必要的普遍假设。代码可在https://github.com/clx1415926/EndPrompt获取。

英文摘要

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.

2604.22891 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Quantifying and Mitigating Self-Preference Bias of LLM Judges

量化与缓解LLM评判者的自我偏好偏差

Jinming Yang, Zheng Hu, Chuxian Qiu, Zhenyu Deng, Xinshan Jiao, Tao Zhou

发表机构 * CompleX Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China(复杂实验室、计算机科学与工程学院、电子科技大学、成都、中国)

AI总结 提出自动化框架量化LLM自我偏好偏差,并通过认知负荷分解的多维评估策略平均降低31.5%的偏差。

详情
AI中文摘要

LLM-as-a-Judge已成为自动评估系统中的主导方法,在模型对齐、排行榜构建、质量控制等方面发挥关键作用。然而,该方法的可扩展性和可信度可能因自我偏好偏差(SPB)而严重失真,SPB是一种定向评估偏差,即LLM在评估时系统性地偏好或排斥自身生成的输出。现有测量方法依赖昂贵的人工标注,并将生成能力与评估立场混为一谈,因此不适用于实际系统中的大规模部署。为解决此问题,我们引入了一个完全自动化的框架来量化和缓解SPB,该框架构建质量差异可忽略的等质量回答对,从而在无需人工黄金标准的情况下,从偏差倾向中统计分离出可区分性。对20个主流LLM的实证分析表明,先进能力通常与低SPB不相关,甚至负相关。为缓解此偏差,我们提出了一种基于认知负荷分解的结构化多维评估策略,平均降低SPB 31.5%。

英文摘要

LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.

2605.10328 2026-06-03 cs.CL 版本更新

ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models

ANCHOR: 基于层次编排的溯因网络构建用于大型语言模型中可靠概率推断

Wentao Qiu, Guanran Luo, Zhongquan Jian, Jingqi Gao, Meihong Wang, Qingqiang Wu

AI总结 针对LLM概率估计中因子空间稀疏导致“未知”预测和因子增多引入噪声的问题,提出ANCHOR框架,通过层次因子空间构建、层次检索与精炼以及因果贝叶斯网络增强朴素贝叶斯,显著减少未知预测并提高概率估计可靠性。

Comments Accepted by ICML 2026

详情
AI中文摘要

在不完全信息下的大规模决策中,一个核心挑战是估计可靠的概率。最近的方法使用大型语言模型(LLM)生成解释因子和粗粒度概率估计,然后通过朴素贝叶斯模型在因子组合上进行精炼。然而,稀疏的因子空间常常产生“未知”预测,而扩展因子会增加噪声和虚假相关性,削弱条件独立性并降低可靠性。为了解决这些局限性,我们提出了 extsc{Anchor},一个在层次因子空间上的聚合贝叶斯推断框架。它通过迭代生成和聚类构建密集的因子层次结构,通过层次检索和精炼映射上下文,并通过因果贝叶斯网络增强朴素贝叶斯以建模潜在因子依赖关系。实验表明, extsc{Anchor}显著减少了“未知”预测,并产生了比直接LLM基线更可靠的概率估计,实现了最先进的性能,同时大幅减少了时间和令牌开销。

英文摘要

A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches use Large Language Models (LLMs) to generate explanatory factors and coarse-grained probability estimates, which are then refined by a Naïve Bayes model over factor combinations. However, sparse factor spaces often yield ``unknown'' predictions, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textsc{Anchor}, an aggregated Bayesian inference framework over a hierarchical factor space. It constructs dense factor hierarchies through iterative generation and clustering, maps contexts via hierarchical retrieval and refinement, and augments Naïve Bayes with a Causal Bayesian Network to model latent factor dependencies. Experiments show that \textsc{Anchor} markedly reduces ``unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.

2602.22480 2026-06-03 cs.AI cs.CL cs.LG 版本更新

VeRO: A Harness for Agents to Optimize Agents

VeRO: 用于优化智能体的智能体框架

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton

发表机构 * arXiv

AI总结 提出 VeRO 框架和 VeRO-Bench 基准,通过版本化快照、预算控制评估和结构化执行轨迹来优化智能体代码,并实验比较不同优化器对目标智能体的改进效果。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

编码智能体的一个重要新兴应用是智能体框架优化:通过编辑和评估目标智能体的代码来迭代改进它。尽管具有相关性,但社区对编码智能体在此任务上的表现缺乏系统理解。框架优化与传统软件工程不同:智能体框架将确定性代码与随机 LLM 完成交错,需要结构化捕获中间执行轨迹和下游结果。为了解决这些挑战,我们引入了 (1) VeRO(版本化、奖励和观察),一个外部框架,提供目标框架的版本化快照、预算控制评估和结构化执行轨迹,以及 (2) VeRO-Bench,一个包含参考评估程序的目标智能体和任务的基准套件。使用 VeRO,我们进行了一项实证研究,比较了不同任务上的优化器,并分析了哪些修改能可靠地改进目标智能体框架。我们发布 VeRO 以支持作为编码智能体核心能力的智能体优化研究。代码可在 https://github.com/scaleapi/vero 获取。

英文摘要

An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

2605.05629 2026-06-03 stat.ML cs.CL cs.LG 版本更新

Spherical Flows for Sampling Categorical Data

用于分类数据采样的球面流

Jannis Chemseddine, Gregor Kornhardt, Gabriele Steidl

发表机构 * Technische Universität Berlin(柏林技术大学)

AI总结 提出在球面上利用von Mises-Fisher分布进行离散序列生成建模,通过径向对称性简化连续性方程为标量ODE,结合后验加权切线和与预测-校正采样实现高效采样。

详情
AI中文摘要

我们研究了在连续嵌入空间中学习离散序列生成模型的问题。以往的方法通常在欧几里得空间或概率单纯形上操作,而我们则在球面$\mathbb S^{d-1}$上工作。在那里,von Mises-Fisher (vMF)分布诱导了一个自然的噪声过程,并允许闭式条件得分。条件速度通常是难以处理的。利用vMF密度的径向对称性,我们将$\mathbb S^{d-1}$上的连续性方程简化为关于余弦相似度的标量ODE,其唯一有界解决定了速度。$\mathbb S^{d-1}$上的边际速度和边际得分都分解为后验加权的切线和,仅因每个token的标量权重不同。这提供了ODE和预测-校正(PC)采样两种途径。后验是唯一需要学习的对象,通过交叉熵损失训练。实验将vMF路径与测地线和欧几里得替代方案进行了比较。vMF与PC采样的结合显著改善了数独和语言建模的结果。

英文摘要

We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.

2605.01386 2026-06-03 cs.CL 版本更新

MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents

MemORAI: 通过自适应图智能实现LLM对话代理的记忆组织与检索

Hung Pham Van, Nguyen Manh Hieu, Khang Pham Tran Tuan, Nam Le Hai, Linh Ngo Van, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Independent Researcher(独立研究者) Hanoi University of Science and Technology(河内科学技术大学) VNU University of Engineering and Technology(VNU工程大学) Monash University(墨尔本大学)

AI总结 提出MemORAI框架,通过选择性记忆过滤、富来源追踪的多关系图存储和查询自适应子图检索,解决LLM长期对话中记忆缺失、信息稀释和检索不精准的问题,在LOCOMO和LongMemEval基准上达到最优性能。

Comments ACL Findings

详情
AI中文摘要

大型语言模型(LLM)缺乏用于长期个性化对话的持久记忆。现有的基于图的记忆系统存在信息稀释、缺乏来源追踪以及忽略查询上下文的统一检索问题。我们引入MemORAI(通过自适应图智能实现记忆组织与检索),该框架整合了三个创新:具有双层压缩的选择性记忆过滤以保留用户个性相关内容;富来源追踪的多关系图,在对话轮次级别跟踪事实来源;以及查询自适应子图检索,采用动态加权PageRank应用查询条件化的边权重。在LOCOMO和LongMemEval基准上评估,MemORAI在记忆检索和个性化响应生成方面达到了最先进的性能,表明选择性存储、富表示和自适应检索对于连贯、个性化的LLM代理至关重要。

英文摘要

Large Language Models (LLMs) lack persistent memory for long-term personalized conversations. Existing graph-based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual-layer compression to retain user-persona-relevant content, a provenance-enriched multi-relational graph tracking factual origins at the turn level, and query-adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query-conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state-of-the-art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.

2605.01374 2026-06-03 cs.CL 版本更新

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

MTA:面向大型语言模型蒸馏的多粒度轨迹对齐

Pham Khanh Chi, Quoc Phong Dao, Thuat Nguyen, Linh Ngo Van, Trung Le, Thanh Hong Nguyen

发表机构 * Hanoi University of Science and Technology(河内理工大学) Monash University(墨尔本大学) University of Oregon(俄勒冈大学)

AI总结 提出多粒度轨迹对齐(MTA)框架,通过层自适应策略对齐师生模型的层间变换轨迹,结合动态结构对齐损失和隐藏表示对齐损失,提升知识蒸馏效果。

Comments ACL 2026

详情
AI中文摘要

知识蒸馏是压缩大型语言模型(LLMs)的关键技术,但现有方法大多在固定层或token级输出上对齐表示,忽略了表示随深度演化的过程。因此,学生在蒸馏过程中仅能弱化地捕捉教师的内部关系结构,限制了知识迁移。为解决这一局限,我们提出多粒度轨迹对齐(MTA)框架,该框架沿层间变换轨迹对齐师生表示。MTA采用层自适应策略:低层在词级别对齐以保留词汇信息,而高层在短语级跨度(如名词短语和动词短语)上操作以捕获组合语义。我们通过动态结构对齐损失实例化这一思想,该损失匹配每层内语义单元之间的相对几何关系。该设计基于Transformer表示随深度逐渐抽象的实验发现,也与语言学观点一致,即高层意义通过低层词汇单元的组合涌现。我们进一步引入隐藏表示对齐损失以直接对齐选定的师生层。实验表明,MTA在标准基准上持续优于最先进的基线,消融实验证实了每个组件的贡献。

英文摘要

Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher's internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative geometry among semantic units within each layer. This design is motivated by empirical findings that Transformer representations become increasingly abstract with depth, and is also consistent with linguistic views in which higher-level meaning emerges through the composition of lower-level lexical units. We further incorporate a Hidden Representation Alignment loss to directly align selected teacher-student layers. Experiments show that MTA consistently outperforms state-of-the-art baselines on standard benchmarks, with ablations confirming the contribution of each component.

2605.01205 2026-06-03 cs.CL 版本更新

SRA: Span Representation Alignment for Large Language Model Distillation

SRA: 面向大型语言模型蒸馏的跨度表示对齐

Quoc Phong Dao, Hoang Son Nguyen, Pham Khanh Chi, Tung Nguyen, Linh Ngo Van, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) VNU University of Engineering and Technology(VNU工程大学) Monash University(莫纳什大学)

AI总结 提出SRA框架,通过将蒸馏对齐单元从token转为跨度的质心表示,并引入几何正则化和对齐跨度logit蒸馏,显著提升跨分词器知识蒸馏性能。

Comments ACL 2026

详情
AI中文摘要

跨分词器知识蒸馏(CTKD)使得大型语言模型与较小的学生模型之间能够进行知识迁移,即使它们使用不同的分词器。现有方法主要关注token级别的对齐策略,这些策略往往脆弱且对分词器之间的差异敏感。我们认为,在蒸馏之前将token聚合成更稳健的表示同样重要。本文提出SRA(面向大型语言模型蒸馏的跨度表示对齐),这是一个通过多粒子动力系统的物理视角重新定义CTKD的新框架。SRA将对齐的基本单元从token转移到稳健、与分词器无关的跨度。我们将每个跨度建模为一个粒子簇,并通过其质心(CoM)——一种捕捉丰富语义信息的注意力加权平均值——来表示其状态。我们利用跨度质心的概念,结合注意力导出的权重来优先考虑最显著的跨度。此外,我们采用几何正则化器来保持表示空间的结构完整性,并引入对齐跨度logit蒸馏以增强跨模型的知识迁移。在具有挑战性的跨架构蒸馏实验中,SRA始终显著优于最先进的CTKD基线,验证了我们基于物理的方法的有效性。

英文摘要

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

2604.27232 2026-06-03 cs.CL 版本更新

Targeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs

基于最小翻译对的手语语言模型目标语言分析

Serpil Karabüklü, Kanishka Misra, Shester Gueuwou, Diane Brentari, Greg Shakhnarovich, Karen Livescu

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Linguistics Department, The University of Texas at Austin(德克萨斯大学奥斯汀分校语言学系) Linguistics Department, The University of Chicago(芝加哥大学语言学系)

AI总结 针对手语翻译模型,通过构建美国手语最小翻译对数据集(ASL-MTP)并消融输入线索,分析模型对不同手语现象(尤其是非手动线索)的捕捉能力。

Comments It is accepted to CVPR 2026 Workshop GenSign: Generative AI for Sign Language

详情
AI中文摘要

手语模型历来落后于口语(文本和语音)模型。最近的工作极大地提高了它们在手语翻译和孤立手语识别等任务上的性能。然而,现有模型在多大程度上捕捉了手语的各种语言现象,以及它们如何利用手语中多个发音器官(手、上半身、面部)的线索,仍不清楚。我们引入了一个新的美国手语基准数据集,ASL最小翻译对(ASL-MTP),分为多种手语现象类型和相应的最小翻译对,用于进行此类语言分析。作为案例研究,我们使用ASL-MTP分析了一个最先进的ASL到英语翻译模型。我们通过在训练和推理期间消融各种输入线索,并在ASL-MTP的现象上进行评估,对模型进行了目标分析。我们的结果表明,虽然模型在大多数现象上表现高于随机水平,但它强烈依赖手动线索,而常常错过关键的非手动线索。

英文摘要

Models of sign language have historically lagged behind those for spoken language (text and speech). Recent work has greatly improved their performance on tasks like sign language translation and isolated sign recognition. However, it remains unclear to what extent existing models capture various linguistic phenomena of sign language, and how well they use cues from the multiple articulators used in sign language (hands, upper body, face). We introduce a new benchmark dataset for American Sign Language, ASL Minimal Translation Pairs (ASL-MTP), divided into multiple types of sign language phenomena and corresponding minimal pairs of translations, for performing such linguistic analyses. As a case study, we use ASL-MTP to analyze a state-of-the-art ASL-to-English translation model. We conduct a targeted analysis of the model by ablating various input cues during training and inference and evaluating on the phenomena in ASL-MTP. Our results show that, while the model performs above chance level on most of the phenomena, it relies strongly on manual cues while often missing crucial non-manual cues.

2604.25928 2026-06-03 cs.CL 版本更新

CogRAG: Tackling Heterogeneous Cognitive Demands in RAG via Stratified Retrieval and Reasoning

CogRAG:通过分层检索与推理应对RAG中的异构认知需求

Xudong Wang, Zilong Wang, Kui Su, Zhaoyan Ming

发表机构 * School of Computer and Computing Science, Hangzhou City University(杭州城市大学计算机与计算科学学院) Innovation Center of Yangtze River Delta, Zhejiang University(长三角创新中心,浙江大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究院)

AI总结 提出CogRAG框架,基于布鲁姆分类法预测查询的认知负荷,协调认知自适应证据精炼与认知分层结构化推理,解决RAG中不同任务的异构认知需求,在注册营养师资格考试中将Qwen3-8B准确率提升至85.8%。

详情
AI中文摘要

检索增强生成(RAG)框架通常通过一刀切的流水线处理所有查询,忽略了不同任务的异构认知需求。这种认知盲区方法导致两种失败模式:当低层级事实缺口引发幻觉推理时的级联错误,以及高阶分析任务中的推理-答案不一致。我们提出CogRAG,一个无需训练、领域无关的框架,通过分层检索与推理应对这些异构认知需求。受布鲁姆分类法启发,CogRAG使用查询的预测认知负荷作为中央控制信号,协调两个模块:认知自适应证据精炼通过以事实为中心或以选项为中心的路径补充缺失上下文,以及认知分层结构化推理用认知对齐的推理模板替代无约束的思维链。我们在一个高要求的专业测试平台——注册营养师资格考试上评估CogRAG。CogRAG有效减少了早期阶段的事实错误并消除了推理-答案不一致,在单选题模式下将Qwen3-8B准确率从73.4%提升至85.8%,在场景模式下从63.3%提升至80.5%。这些结果突显了认知分层控制作为大型语言模型中可靠复杂推理的一种有效且可泛化的范式。

英文摘要

Retrieval-Augmented Generation (RAG) frameworks typically process all queries through a one-size-fits-all pipeline, ignoring the heterogeneous cognitive demands of different tasks. This cognitive-blind approach causes two failure modes: cascading errors when low-level factual gaps trigger hallucinated reasoning, and reasoning-answer inconsistency in higher-order analytical tasks. We introduce CogRAG, a training-free, domain-agnostic framework that tackles these heterogeneous cognitive demands via stratified retrieval and reasoning. Inspired by Bloom's Taxonomy, CogRAG uses the predicted cognitive load of a query as a central control signal that coordinates two modules: Cognition-Adaptive Evidence Refinement supplements missing context via fact-centric or option-centric paths, and Cognition-Stratified Structured Reasoning replaces unconstrained chain-of-thought with cognition-aligned reasoning templates. We evaluate CogRAG on a demanding professional testbed, the Registered Dietitian qualification examination. CogRAG effectively reduces early-stage factual errors and eliminates reasoning-answer inconsistency, raising Qwen3-8B accuracy from 73.4\% to 85.8\% in single-choice mode and from 63.3\% to 80.5\% in scenario mode. These results highlight cognitive-stratified control as an effective, generalizable paradigm for reliable complex reasoning in large language models.

2604.24374 2026-06-03 cs.CL 版本更新

MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

MIPIC: 通过自蒸馏内部关系与渐进信息链的套娃表示学习

Phung Gia Huy, Hai An Vu, Minh-Phuc Truong, Thang Duc Tran, Linh Ngo Van, Thanh Hong Nguyen, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) University of Oregon(俄勒冈大学) Monash University(莫纳什大学)

AI总结 提出MIPIC框架,通过自蒸馏内部关系对齐和渐进信息链,实现嵌套嵌入的跨维度结构一致性与深度语义整合,在低维下显著提升性能。

Comments ACL Findings

详情
AI中文摘要

表示学习是NLP的基础,但构建在不同计算预算下均表现良好的嵌入具有挑战性。套娃表示学习(MRL)通过嵌套嵌入提供了灵活的推理范式;然而,学习此类结构需要明确协调信息在嵌入维度和模型深度上的排列方式。本文提出MIPIC(通过自蒸馏内部关系对齐与渐进信息链的套娃表示学习),一个统一的训练框架,旨在生成结构连贯且语义紧凑的套娃表示。MIPIC通过自蒸馏内部关系对齐(SIA)促进跨维度结构一致性,该对齐利用top-k CKA自蒸馏,对齐完整表示与截断表示之间的token级几何和注意力驱动关系。互补地,它通过渐进信息链(PIC)实现深度语义整合,这是一种支架式对齐策略,逐步将成熟的深层任务语义迁移到浅层。在STS、NLI和分类基准(涵盖从TinyBERT到BGEM3、Qwen3的模型)上的大量实验表明,MIPIC生成的套娃表示在所有容量下均具有高度竞争力,并在极端低维下展现出显著的性能优势。

英文摘要

Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.

2508.06165 2026-06-03 cs.CL cs.AI 版本更新

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

UR$^2$:通过强化学习统一检索增强生成与推理

Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China(计算机科学与技术系,人工智能研究院,清华大学,北京,中国) Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China(人工智能产业研究机构(AIR),清华大学,北京,中国) School of Management Science & Information Engineering, Hebei University of Economics and Business, Hebei, China(管理科学与信息工程学院,河北经贸大学,河北,中国)

AI总结 提出UR$^2$框架,通过强化学习动态协调检索与推理,结合难度感知课程和混合知识访问策略,在开放域问答、MMLU-Pro、医学和数学推理任务上优于现有基线,性能接近GPT-4o-mini和GPT-4.1-mini。

详情
AI中文摘要

大型语言模型(LLM)通过两种互补范式展现了强大能力:用于知识基础的检索增强生成(RAG)和用于复杂推理的可验证奖励强化学习(RLVR)。然而,现有统一这些范式的尝试范围狭窄,通常局限于具有固定检索设置的开放域问答,限制了向更广泛领域的泛化。为解决这一局限,我们提出UR$^2$(统一RAG与推理),一个通用的强化学习框架,动态协调检索与推理。UR$^2$引入了两个关键设计:一个难度感知课程,仅对困难实例选择性调用检索;以及一个混合知识访问策略,结合领域特定的离线语料库和即时生成的LLM摘要。这些组件共同缓解了检索与推理之间的不平衡,并提高了对噪声信息的鲁棒性。在开放域问答、MMLU-Pro、医学和数学推理任务上的实验表明,基于Qwen-2.5-3/7B和LLaMA-3.1-8B构建的UR$^2$持续优于现有RAG和RL基线,并在多个基准上达到与GPT-4o-mini和GPT-4.1-mini相当的性能。我们的代码可在https://github.com/Tsinghua-dhy/UR2获取。

英文摘要

Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR$^2$ (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR$^2$ introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR$^2$, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.

2604.20183 2026-06-03 cs.CL 版本更新

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

双簇记忆智能体:解决优化问题求解中的多范式歧义

Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China(教育部智能网络与网络security重点实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China(陕西省大数据知识工程重点实验室)

AI总结 提出双簇记忆智能体(DCM-Agent),通过无训练方式利用历史解决方案构建双簇记忆,并采用记忆增强推理动态导航解路径,在七个优化基准上平均性能提升11%-21%。

详情
AI中文摘要

大型语言模型(LLMs)在优化问题中常面临结构歧义,即单个问题存在多个相关但冲突的建模范式,阻碍了有效解的生成。为解决此问题,我们提出双簇记忆智能体(DCM-Agent),以无训练方式利用历史解决方案提升性能。其核心是双簇记忆构建:该智能体将历史解决方案分配到建模和编码两个簇中,然后将每个簇的内容提炼为三种结构化类型:方法、检查表和陷阱。该过程推导出可泛化的指导知识。此外,该智能体引入记忆增强推理,以动态导航解路径、检测并修复错误,并利用结构化知识自适应切换推理路径。在七个优化基准上的实验表明,DCM-Agent平均性能提升11%-21%。值得注意的是,我们的分析揭示了“知识继承”现象:由更大模型构建的记忆可以引导较小模型获得更优性能,凸显了该框架的可扩展性和效率。

英文摘要

Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

2604.19005 2026-06-03 cs.CL 版本更新

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

辩论未尽之言:基于角色锚定的多智能体推理用于半真半假检测

Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung

发表机构 * National University of Singapore(国立新加坡大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出RADAR框架,通过角色锚定的多智能体辩论(政治家与科学家对抗推理,法官中立裁决)和双阈值早停控制,在噪声检索下有效检测因省略上下文而误导的半真半假陈述。

详情
AI中文摘要

半真半假,即事实正确但因省略上下文而具有误导性的主张,对于专注于显式虚假的事实核查系统而言仍然是一个盲点。解决这种基于省略的操纵需要不仅推理所说内容,还要推理未说内容。我们提出RADAR,一个角色锚定的多智能体辩论框架,用于在现实噪声检索下进行省略感知的事实核查。RADAR为政治家和科学家分配互补角色,他们在共享检索证据上进行对抗性推理,并由中立法官主持。双阈值早停控制器自适应地决定何时达到足够推理以做出裁决。实验表明,RADAR在数据集和骨干网络上始终优于强单智能体和多智能体基线,提高了遗漏检测准确性,同时降低了推理成本。这些结果表明,具有自适应控制的角色锚定、基于检索的辩论是揭示事实核查中缺失上下文的有效且可扩展的框架。

英文摘要

Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification.

2604.18995 2026-06-03 cs.CL cs.AI cs.LG 版本更新

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

$R^2$-dLLM: 通过时空冗余减少加速扩散大语言模型

Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin

AI总结 提出 $R^2$-dLLM 框架,通过推理和训练两阶段减少扩散大语言模型解码中的空间和时间冗余,实现高达 88% 的解码步数减少并保持生成质量。

详情
AI中文摘要

扩散大语言模型(dLLMs)通过并行令牌预测成为自回归生成的有前途的替代方案。然而,实际的 dLLM 解码仍然遭受高推理延迟,限制了部署。在这项工作中,我们观察到这种低效率的很大一部分来自解码过程中反复出现的冗余,包括由置信度聚类和位置模糊性引起的空间冗余,以及由重复重新掩蔽已经稳定的预测引起的时间冗余。受这些模式的启发,我们提出了 $R^{2}$-dLLM,一个从推理和训练两个角度减少解码冗余的统一框架。在推理时,我们引入了无需训练的解码规则,聚合局部置信度和令牌预测,并最终确定时间稳定的令牌以避免冗余解码步骤。我们进一步提出了一个冗余感知的监督微调流程,使模型与高效解码轨迹对齐,并减少对手动调整阈值的依赖。实验表明,与现有解码策略相比,$R^{2}$-dLLM 一致地将解码步数减少高达 88%,同时在不同模型和任务上保持有竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈,明确减少它能够带来显著的实用效率提升。我们的代码和模型可在 https://github.com/GATECH-EIC/R2-dLLM 获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

2604.08782 2026-06-03 cs.CL 版本更新

MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

MT-OSC:解决大语言模型在多轮对话中迷失的路径

Jyotika Singh, Fang Tu, Miguel Ballesteros, Weiyi Sun, Sandip Ghoshal, Michelle Yuan, Yassine Benajiba, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 提出MT-OSC框架,通过后台自动压缩对话历史(Condenser Agent)减少token量,提升多轮对话性能,在13个LLM上验证有效性。

详情
AI中文摘要

大型语言模型(LLMs)在用户指令和上下文分布在多个对话轮次中时,性能会显著下降,然而多轮(MT)交互主导着聊天界面。将完整聊天历史附加到提示中的常规方法会迅速耗尽上下文窗口,导致延迟增加、计算成本升高,并且随着对话延长收益递减。我们引入了MT-OSC,一种一次性顺序压缩框架,可以在不干扰用户体验的情况下,高效自动地在后台压缩聊天历史。MT-OSC采用了一个压缩代理,该代理使用基于少样本推理的压缩器和轻量级决策器,选择性地保留必要信息,在10轮对话中减少高达72%的token数量。在13个最先进的LLM和多样化的多轮基准测试中评估,MT-OSC持续缩小了多轮性能差距——在数据集上保持或提高了准确性,同时对干扰项和无关轮次保持鲁棒性。我们的结果确立了MT-OSC作为多轮聊天的可扩展解决方案,能够在受限的输入空间内实现更丰富的上下文,降低延迟和运营成本,同时平衡性能。

英文摘要

Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.

2604.16029 2026-06-03 cs.CL cs.LG 版本更新

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

减少损失!学习早期剪枝路径以实现高效并行推理

Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环形区研究所) USTB(中国地质大学) DualityRL

AI总结 提出一种基于可学习内部信号的路径剪枝方法STOP,通过前缀级剪枝减少并行推理中的无效路径,显著提升大型推理模型的效率与性能。

Comments 9 pages, 7 figures

详情
AI中文摘要

并行推理增强了大型推理模型(LRMs),但由于早期错误导致的无效路径,其成本高昂。为了缓解这一问题,前缀级的路径剪枝至关重要,然而现有研究缺乏标准化框架,较为零散。在这项工作中,我们提出了第一个系统的路径剪枝分类法,根据信号来源(内部与外部)和可学习性(可学习与不可学习)对方法进行分类。这种分类揭示了可学习内部方法的未开发潜力,促使我们提出STOP(用于剪枝的超级令牌)。在参数规模从1.5B到20B的LRMs上的广泛评估表明,与现有基线相比,STOP在效果和效率上均表现出优越性。此外,我们在不同计算预算下严格验证了STOP的可扩展性——例如,在固定计算预算下,将GPT-OSS-20B在AIME25上的准确率从84%提升至近90%。最后,我们将发现提炼为形式化的经验指南,以促进最优的实际部署。代码、数据和模型可在 https://bijiaxihh.github.io/STOP 获取。

英文摘要

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

2604.15744 2026-06-03 cs.CL 版本更新

Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

语言、地点与社交媒体:新西兰的地理方言对齐

Sidney Wong

发表机构 * Computer Science and Software Engineering, University of Canterbury(坎特伯雷大学计算机科学与软件工程系) Department of Linguistics, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校语言学系) School of Psychology, Speech and Hearing, University of Canterbury(坎特伯雷大学心理学、语音与听力学校) Department of Linguistics, University of Canterbury(坎特伯雷大学语言学系)

AI总结 本研究通过整合用户感知的定性分析与计算方法,探究新西兰Reddit社区中地理方言对齐现象,发现地点相关社区形成连续言语社区,且高级语言建模揭示了语义变异与变化。

Comments PhD thesis

详情
AI中文摘要

本论文研究了基于地点的社交媒体社区中的地理方言对齐现象,重点关注新西兰相关的Reddit社区。通过将用户感知的定性分析与计算方法相结合,研究考察了语言使用如何反映地点身份,以及基于用户提供的词汇、形态句法和语义变量的语言变异与变化模式。研究结果表明,用户通常将语言与地点联系起来,地点相关社区形成了一个连续的言语社区,尽管地理方言社区与地点相关社区之间的对齐仍然复杂。包括静态和历时Word2Vec语言嵌入在内的高级语言建模揭示了基于地点的社区之间的语义变异,以及新西兰英语中有意义的语义变化。研究创建了一个包含42.6亿未处理单词的语料库,为未来研究提供了宝贵资源。总体而言,结果凸显了社交媒体作为社会语言学自然实验室的潜力。

英文摘要

This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

2604.15097 2026-06-03 cs.SE cs.CL 版本更新

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

从程序技能到策略基因:迈向经验驱动的测试时进化

Junjie Wang, Yiming Ren, Haoyang Zhang

发表机构 * Infinite Evolution Lab, EvoMap(无限进化实验室,EvoMap) Tsinghua University(清华大学)

AI总结 本文通过45个科学代码求解场景中的4590次受控试验,研究如何将经验表示为紧凑、面向控制且可迭代进化的对象(Gene),相比文档导向的Skill包,Gene在一次性控制和迭代积累中均表现更优。

Comments Technical Report

详情
AI中文摘要

这份测试版技术报告探讨了可重用经验应如何表示,以便作为有效的测试时控制和迭代进化的基础。我们在45个科学代码求解场景中的4590次受控试验中研究了这一问题。我们发现,面向文档的Skill包提供了不稳定的控制:它们的有用信号稀疏,将紧凑的经验对象扩展为更完整的文档包通常无助于甚至可能降低整体平均值。我们进一步表明,表示本身是一个首要因素。紧凑的Gene表示产生了最强的整体平均值,在显著的结构扰动下仍具有竞争力,并优于预算匹配的Skill片段,而重新附加面向文档的材料通常会削弱而非改进它。除了单次控制外,我们还表明Gene是迭代经验积累的更好载体:附加的失败历史在Gene中比在Skill或自由格式文本中更有效,可编辑的结构比内容本身更重要,并且失败信息在提炼为紧凑警告时比简单附加更有用。在CritPt上,基因进化系统相对于其配对基础模型从9.1%提升至18.57%,从17.7%提升至27.14%。这些结果表明,经验复用的核心问题不是如何提供更多经验,而是如何将经验编码为紧凑、面向控制、可进化的对象。

英文摘要

This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

2604.07123 2026-06-03 cs.CL 版本更新

Language Bias under Conflicting Information in Multilingual LLMs

多语言大模型中冲突信息下的语言偏见

Robert Östling, Murathan Kurfalı

发表机构 * Stockholm University(斯德哥尔摩大学) RISE Research Institutes of Sweden(瑞典RISE研究机构)

AI总结 本研究通过扩展“干草堆中的冲突针”范式至多语言环境,评估了不同规模的多语言大模型在回答问题时对冲突信息中不同语言的偏好,发现模型普遍存在语言偏见,尤其是对俄语的普遍偏见和对中文的偏好,且提示语言与信息语言匹配时更受青睐。

详情
AI中文摘要

大型语言模型(LLMs)在整合冲突信息回答问题过程中已被证明存在偏见。本文探讨这种偏见是否也存在于冲突信息所使用的语言上。为此,我们将“干草堆中的冲突针”范式扩展到多语言环境,并使用五种不同语言的自然新闻领域数据,对一系列不同规模的多语言LLMs进行了全面评估。我们发现,所有测试的LLMs,包括GPT-5.2,在绝大多数情况下都会忽略冲突,并自信地只断言其中一个可能的答案。此外,在模型和提示语言之间,存在一致的语言偏好偏见,普遍对俄语存在偏见,而在最长上下文长度下,则偏好中文。语言偏好在中国大陆内外训练的模型之间一致,但前者稍强。模型还普遍倾向于优先考虑与提示语言匹配的信息。我们希望让多语言LLMs的用户和开发者意识到这类偏见,以促进对其成因及可能缓解方法的进一步研究。

英文摘要

Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models and prompting languages in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. The language preferences are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category. There is also a general tendency among models to prioritize information that matches the language used for prompting. We hope to make users and developers of multilingual LLMs aware of this category of biases, to spur further research on their causes and possible mitigation.

2604.01410 2026-06-03 cs.CL 版本更新

Assessing Pause Thresholds for empirical Translation Process Research

评估经验翻译过程研究中的暂停阈值

Devi Sri Bandaru, Michael Carl, Xinyue Ren

AI总结 本文比较了五种计算暂停阈值的方法,并提出了一种新的生产单元断点计算方法,以区分自动化与反思性翻译过程。

Comments In Proceedings of "Translation in Transition 8", 2026

详情
AI中文摘要

文本生产(及翻译)以打字片段的形式进行,其间被按键暂停打断。通常认为,快速打字反映了无挑战/自动化的翻译生产,而较长(或更长)的打字暂停则表明翻译问题、障碍或困难。基于关于确定区分自动化与推测性反思翻译过程的暂停阈值的长期讨论(O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016),本文比较了计算这些暂停阈值的五种方法,并提出并评估了一种计算生产单元断点的新方法。

英文摘要

Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares five approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.

2603.26738 2026-06-03 cs.CV cs.AI cs.CL 版本更新

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM:基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结 提出SleepVLM,一种基于规则驱动的视觉语言模型,通过多通道PSG波形图像进行睡眠分期,并生成符合AASM评分标准的临床可读解释,在保持高准确率的同时提升可解释性。

Comments Under review

详情
AI中文摘要

尽管自动睡眠分期已达到专家级准确率,但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM,一种基于规则驱动的视觉语言模型(VLM),它通过多通道多导睡眠图(PSG)波形图像进行睡眠分期,并基于美国睡眠医学学会(AASM)评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调,SleepVLM在保留测试集(MASS-SS1)上实现了0.767的Cohen's kappa,在外部队列(ZUAMHCS)上实现了0.743,达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量,在两个数据集上,事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间(满分5分)。通过将竞争性性能与透明、基于规则的解释相结合,SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究,我们发布了MASS-EX,一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

2603.29346 2026-06-03 cs.CL 版本更新

L-ReLF: A Framework for Lexical Dataset Creation

L-ReLF:词汇数据集创建框架

Anass Sedrati, Mounir Afifi, Reda Benkhadra

AI总结 本文提出L-ReLF框架,通过可复现的技术流程为低资源语言创建高质量结构化词汇数据集,以解决标准化术语缺失问题,并支持下游NLP应用。

Comments Accepted to the 2026 International Conference on Natural Language Processing (ICNLP). 6 pages, 1 figure

详情
Journal ref
Proc. 2026 8th International Conference on Natural Language Processing (ICNLP), pp. 261-265, IEEE
AI中文摘要

本文介绍了L-ReLF(低资源词汇框架),一种新颖、可复现的方法论,用于为服务不足的语言创建高质量、结构化的词汇数据集。以摩洛哥达里贾语为例,标准化术语的缺乏对维基百科等平台的知识公平构成了关键障碍,常常迫使编辑者依赖不一致的临时方法在其语言中创造新词。我们的研究详细阐述了为克服这些挑战而开发的技术流程。我们系统地解决了处理低资源数据的困难,包括源识别、利用光学字符识别(OCR)尽管其偏向现代标准阿拉伯语,以及严格的后处理以纠正错误并标准化数据模型。最终的结构化数据集与维基数据词位完全兼容,作为一项重要的技术资源。L-ReLF方法论设计具有通用性,为其他语言社区提供了清晰的路径,以构建用于下游NLP应用(如机器翻译和形态分析)的基础词汇数据。

英文摘要

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

2603.26791 2026-06-03 cs.DL cs.AI cs.CL cs.CY 版本更新

Crystal: Characterizing Relative Impact of Scholarly Publications

Crystal: 表征学术出版物的相对影响力

Hannah Collison, Benjamin Van Durme, Daniel Khashabi

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出Crystal方法,利用大语言模型对引用论文进行联合排序,通过多数投票消除位置偏差,以更准确地区分高影响力引用,在人工标注数据集上准确率提升9.5%,F1提升8.3%。

详情
AI中文摘要

评估被引论文的影响力通常是通过在施引论文中单独分析其引用上下文来完成的。虽然这聚焦于最直接相关的文本,但它阻止了对一篇论文引用的所有作品进行相对比较。我们提出Crystal,它使用大语言模型(LLMs)联合排序施引论文中的所有被引论文。为了减轻LLMs的位置偏差,我们以随机顺序对每个列表进行三次排序,并通过多数投票聚合影响力标签。这种联合方法利用了完整的引用上下文,而不是独立评估引用,从而更可靠地区分有影响力的参考文献。Crystal在人工标注的引用数据集上,准确率比先前最先进的影响力分类器高出9.5%,F1高出8.3%。Crystal通过更少的LLM调用进一步提高了效率,并使用开放权重模型优于先前的基线,实现了可扩展、成本效益高的引用影响力分析。在对ACL时间检验奖获奖论文的案例研究中,我们发现Crystal的影响力特征与长期科学认可高度一致。我们发布了Crystal-Bank,一个包含46.8k篇论文的排名和影响力标签的数据集,以及代码。

英文摘要

Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. Crystal outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. Crystal further gains efficiency through fewer LLM calls and outperforms prior baselines using an open-weight model, enabling scalable, cost-effective citation impact analysis. In a case study of ACL Test-of-Time award-winning papers, we find that Crystal's impact characterizations align closely with long-term scientific recognition. We release Crystal-Bank, a 46.8k-paper dataset with rankings and impact labels, along with code.

2603.20508 2026-06-03 cs.MA cs.AI cs.CL 版本更新

Measuring Weak-to-Strong Legibility of Reasoning Models

衡量推理模型的弱到强可读性

Dani Roytburg, Shreya Sridhar, Daphne Ippolito

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对推理语言模型在多智能体场景中生成的中间思维链,提出“弱到强可读性”概念,并设计衡量指标以评估强模型输出对弱模型的易理解性。

Comments Accepted to Trustworthy AI4GOOD Workshop @ ICML 2026

详情
AI中文摘要

推理语言模型及其生成的中间思维链在多智能体设置(如模型间监控或蒸馏到较小模型)中扮演着越来越核心的角色。当不同能力层级的智能体必须合作时,强模型需要产生能被弱模型消化的轨迹。我们将此目标称为“弱到强可读性”。大模型的可信度部分依赖于这种可读性属性。特别是在安全监督方面,采用弱监控器可能成为健康预算下可靠性支架的标准。可读性要求这些决策轨迹的形状采取某种弱监控器可访问的形式。现有的基于效率的可读性指标未能捕捉“彻底性”,而是侧重于简洁性。

英文摘要

Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at different capability tiers must cooperate, strong models need to produce traces digestible by weaker ones. We refer to this goal as "weak-to-strong legibility". Trustworthiness of large models depends in part on this legibility property. For safety oversight in particular, adoption of weak monitors may become a standard for reliability scaffolds on a healthy budget. Legibility requires that the shape of these decision-making traces takes some form accessible to weaker monitors. Existing efficiency-based metrics for legibility fail to capture "thoroughness", instead focusing on conciseness.

2603.19250 2026-06-03 cs.CL 版本更新

Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams

结构线索能否拯救LLM?评估大规模文档流中的语言模型

Yukyung Lee, Yebin Lim, Woojun Jung, Wonjun Choi, Susik Yoon

发表机构 * Boston University(波士顿大学) Korea University(韩国大学)

AI总结 本文提出StreamBench基准,通过主题聚类、时序问答和摘要任务评估语言模型在混合多事件的文档流中的表现,发现结构线索能提升聚类和时序QA性能,但时序推理仍是挑战。

Comments KDD 2026

详情
AI中文摘要

评估流式环境中的语言模型至关重要,但尚未充分探索。现有基准要么关注单个复杂事件,要么为每个查询提供精心策划的输入,并且未评估模型在多个并发事件混合在同一文档流中产生的冲突下的表现。我们引入了StreamBench,这是一个基于2016年和2025年主要新闻故事构建的基准,包含605个事件和15,354个文档,涵盖三个任务:主题聚类、时序问答和摘要。为了诊断模型失败的原因,我们比较了有无结构线索(按事件组织关键事实)的性能。我们发现,结构线索提高了聚类(最高+4.37%)和时序问答(最高+9.63%)的性能,帮助模型定位相关信息并区分不同事件。虽然时序推理仍然是当前LLM固有的开放挑战,但任务中的一致改进表明,结构线索是未来大规模文档流研究的一个有希望的方向。

英文摘要

Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.

2602.07075 2026-06-03 physics.chem-ph cs.AI cs.CL cs.LG 版本更新

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

LatentChem: 从文本思维链到化学推理中的潜在思考

Xinwu Ye, Yicheng Mao, Yuxuan Liao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对化学大语言模型依赖显式思维链导致的模态不匹配问题,提出LatentChem推理接口,通过连续思维向量和动态感知解耦化学逻辑与语言生成,在ChemCoTBench上以59.88%非平局胜率超越强CoT基线,并实现平均10.84倍推理步骤开销降低(5.96倍实际加速)。

Comments Accepted at ICML 2026

详情
AI中文摘要

当前的化学大语言模型主要依赖显式的思维链来解决复杂推理问题。然而,将非语言的隐性化学逻辑强制转化为离散的自然语言,造成了根本性的“模态不匹配”,为推理带来了人为瓶颈。我们提出了LatentChem,一种将化学逻辑与语言生成解耦的推理接口,使模型能够通过连续思维向量和动态感知来处理信息。我们的研究揭示了一个关键涌现行为:自发内化,这里定义为在仅结果优化下的自我选择。当为任务成功进行优化时,模型放弃冗长的文本推导,转而采用隐式的潜在计算,这表明模型将连续流形视为化学逻辑更自然的载体。这一范式转变也被证明是一种更优的计算策略:在严格的ChemCoTBench基准上,LatentChem对强CoT基线取得了59.88%的非平局胜率,同时在所有评估基准上实现了平均10.84倍的推理步骤开销降低(5.96倍实际加速)。我们的结果提供了经验证据,表明化学推理更自然、更有效地实现为连续潜在动力学,而非离散的语言轨迹。

英文摘要

Current chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) to solve complex reasoning problems. However, forcing nonverbal tacit chemical logic into discrete natural language imposes a fundamental ``modality mismatch,'' creating an artificial bottleneck for reasoning. We introduce LatentChem, a reasoning interface that decouples chemical logic from linguistic generation, enabling the model to process information via continuous thought vectors and dynamic perception. Our investigation reveals a pivotal emergent behavior: spontaneous internalization, defined here as self-selected under outcome-only optimization. When optimized for task success, the model abandons verbose textual derivations in favor of implicit latent computation, suggesting that it identifies the continuous manifold as a more native substrate for chemical logic. This paradigm shift also proves to be a superior computational strategy: LatentChem achieves a 59.88\% non-tie win rate against the strong CoT baseline on the rigorous ChemCoTBench, while delivering a broad 10.84$\times$ average reduction in reasoning step overhead (5.96$\times$ wall-clock speedup) across all evaluated benchmarks. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

2603.05207 2026-06-03 cs.IR cs.CL 版本更新

Core-based Hierarchies for Efficient GraphRAG

基于核心的高效图RAG层次结构

Jakir Hossain, Ahmet Erdem Sarıyüce

发表机构 * University at Buffalo(布法罗大学)

AI总结 针对图RAG中Leiden聚类不可复现的问题,提出用k-core分解替代,构建确定性、密度感知的层次结构,并设计轻量级启发式方法,在保证连接性的同时降低LLM成本,提升答案全面性和多样性。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

检索增强生成(RAG)通过引入外部知识增强了大型语言模型。然而,现有的基于向量的方法通常无法处理需要跨多个文档推理的全局理解任务。GraphRAG通过将文档组织成具有层次化社区的知识图谱来解决这一问题,这些社区可以被递归总结。当前的GraphRAG方法依赖Leiden聚类进行社区检测,但我们证明,在平均度数为常数且大多数节点度数较低的稀疏知识图谱上,模块度优化允许指数级数量的近似最优划分,使得基于Leiden的社区本质上不可复现。为了解决这个问题,我们提出用k-core分解替代Leiden,它在线性时间内产生确定性的、密度感知的层次结构。我们引入一组轻量级启发式方法,利用k-core层次结构构建大小有界、保持连接性的社区用于检索和总结,同时采用一种令牌预算感知的采样策略来降低LLM成本。我们在包括金融收益报告、新闻文章和播客在内的真实世界数据集上评估了我们的方法,使用三个LLM进行答案生成,并由五个独立的LLM裁判进行逐项比较评估。跨数据集和模型,我们的方法一致地提高了答案的全面性和多样性,同时减少了令牌使用量,证明了基于k-core的GraphRAG是一种有效且高效的全局理解框架。

英文摘要

Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. However, existing vector-based methods often fail on global sensemaking tasks that require reasoning across many documents. GraphRAG addresses this by organizing documents into a knowledge graph with hierarchical communities that can be recursively summarized. Current GraphRAG approaches rely on Leiden clustering for community detection, but we prove that on sparse knowledge graphs, where average degree is constant and most nodes have low degree, modularity optimization admits exponentially many near-optimal partitions, making Leiden-based communities inherently non-reproducible. To address this, we propose replacing Leiden with k-core decomposition, which yields a deterministic, density-aware hierarchy in linear time. We introduce a set of lightweight heuristics that leverage the k-core hierarchy to construct size-bounded, connectivity-preserving communities for retrieval and summarization, along with a token-budget-aware sampling strategy that reduces LLM costs. We evaluate our methods on real-world datasets including financial earnings transcripts, news articles, and podcasts, using three LLMs for answer generation and five independent LLM judges for head-to-head evaluation. Across datasets and models, our approach consistently improves answer comprehensiveness and diversity while reducing token usage, demonstrating that k-core-based GraphRAG is an effective and efficient framework for global sensemaking.

2603.03612 2026-06-03 cs.LG cs.CC cs.CL cs.FL 版本更新

Why Are Linear RNNs More Parallelizable?

为什么线性RNN更易于并行化?

William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

发表机构 * GitHub

AI总结 本文通过将RNN类型与标准复杂度类紧密关联,揭示了线性RNN(LRNN)因可视为对数深度算术电路而易于并行化,而非线性RNN因能解决L-完全问题而存在并行化障碍。

Comments To appear at ICML 2026

详情
AI中文摘要

社区越来越多地探索线性RNN(LRNN)作为语言模型,受其表达能力和并行化能力的驱动。虽然先前的工作确立了LRNN相对于Transformer的表达优势,但尚不清楚是什么使得LRNN——而非传统的非线性RNN——在实践中与Transformer一样易于并行化。我们通过提供RNN类型与标准复杂度类之间的紧密联系来回答这个问题。我们表明,LRNN可以看作是对数深度(有界扇入)算术电路,相对于Transformer所允许的对数深度布尔电路,这仅代表轻微深度开销。此外,我们表明非线性RNN可以解决$\mathsf{L}$-完全问题(甚至在多项式精度下解决$\mathsf{P}$-完全问题),揭示了将它们与Transformer一样高效并行化的根本障碍。我们的理论还识别了近期流行LRNN变体之间的细粒度表达差异:置换对角LRNN是$\mathsf{NC}^1$-完全的,而对角加低秩LRNN更具表达性($\mathsf{PNC}^1$-完全)。我们通过将每种RNN类型与它可以模拟的相应自动机理论模型相关联,提供了进一步见解。总之,我们的结果揭示了非线性RNN与不同LRNN变体之间的基本权衡,为设计在表达性和并行性之间实现最佳平衡的LLM架构提供了基础。

英文摘要

The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

2602.17063 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

符号锁定:随机初始化的权重符号持续存在并成为亚比特模型压缩的瓶颈

Akira Sakai, Yuma Ichikawa

发表机构 * Fujitsu Limited(富士通株式会社) Tokai University(静冈大学) Riken Center for AIP(理化学研究所AIP研究中心)

AI总结 研究亚比特模型压缩中符号位的瓶颈问题,通过符号锁定理论解释权重符号的随机性来源,并提出一种从头开始的低秩符号模板训练方法以突破该瓶颈。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

亚比特模型压缩的目标是将每个权重的存储降至1比特以下;当幅度被激进压缩时,符号位成为固定成本的瓶颈。在Transformer、CNN和MLP中,学习到的符号矩阵抵抗低秩近似,并且在频谱上与i.i.d. Rademacher基线无法区分。这种随机性导致了亚比特模型压缩的下界——1比特墙。尽管存在这种明显的随机性,大多数权重仍保留其初始化符号;翻转主要通过罕见的近零边界穿越发生,表明符号模式的随机性很大程度上继承自初始化。我们通过符号锁定理论形式化了这一行为,这是对SGD噪声下符号翻转的停时分析。在有界更新和零的小邻域内罕见重新进入的条件下,有效符号翻转的数量呈现几何尾部。基于这一机制,我们引入了一种从头开始的低秩符号模板训练方法,以防止这种1比特墙的出现。

英文摘要

Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. This randomness gives rise to the lower bound of sub-bit model compression -- the one-bit wall. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood of zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a from-scratch low-rank sign-template training method that prevents the emergence of this one-bit wall.

2602.14279 2026-06-03 cs.LG cs.AI cs.CL cs.SI 版本更新

Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions

为谁查询什么:通过多轮LLM交互的自适应群体征询

Ruomeng Ding, Tianwei Gao, Thomas P. Zollo, Eitan Bachmat, Richard Zemel, Zhun Deng

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Columbia University(哥伦比亚大学) Ben-Gurion University of the Negev(贝内-约尔大学内盖夫分校)

AI总结 针对有限预算下群体属性不确定性降低问题,提出结合LLM期望信息增益与异构图神经网络传播的自适应群体征询框架,实现问题与受访者联合选择,在三个真实数据集上显著提升群体响应预测。

Comments Published as a conference paper at ICML 2026

详情
AI中文摘要

从调查和其他集体评估中征询信息以减少关于潜在群体属性的不确定性,需要在实际成本和缺失数据下分配有限的提问努力。尽管大型语言模型支持自然语言中的自适应多轮交互,但大多数现有征询方法优化了在固定受访者池中询问什么,并且在响应部分或不完整时不会调整受访者选择或利用群体结构。为解决这一差距,我们研究了自适应群体征询,这是一个多轮设置,其中智能体在明确的查询和参与预算下自适应地选择问题和受访者。我们提出了一个理论基础的框架,该框架结合了(i)基于LLM的期望信息增益目标,用于评分候选问题,以及(ii)异构图神经网络传播,该传播聚合观察到的响应和参与者属性,以插补缺失响应并指导每轮受访者选择。这种闭环过程查询一个小的、信息丰富的个体子集,同时通过结构化相似性推断群体级别的响应。在三个真实世界意见数据集上,我们的方法在预算受限的情况下持续提高了群体级别响应预测,包括在10%受访者预算下CES上相对提升超过12%。

英文摘要

Eliciting information to reduce uncertainty about latent group-level properties from surveys and other collective assessments requires allocating limited questioning effort under real costs and missing data. Although large language models enable adaptive, multi-turn interactions in natural language, most existing elicitation methods optimize what to ask with a fixed respondent pool, and do not adapt respondent selection or leverage population structure when responses are partial or incomplete. To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets. We propose a theoretically grounded framework that combines (i) an LLM-based expected information gain objective for scoring candidate questions with (ii) heterogeneous graph neural network propagation that aggregates observed responses and participant attributes to impute missing responses and guide per-round respondent selection. This closed-loop procedure queries a small, informative subset of individuals while inferring population-level responses via structured similarity. Across three real-world opinion datasets, our method consistently improves population-level response prediction under constrained budgets, including a >12% relative gain on CES at a 10% respondent budget.

2602.11908 2026-06-03 cs.AI cs.CL cs.LG 版本更新

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

LLM何时应降低具体性?面向可靠长文本生成的选择性抽象

Shani Goren, Ido Galil, Ran El-Yaniv

发表机构 * Technion(技术离子大学) NVIDIA(英伟达)

AI总结 针对LLM在长文本生成中因低置信度而丢弃有价值信息的问题,提出选择性抽象框架,通过原子级抽象替换不确定内容,在保持语义的同时提升准确性和可靠性。

详情
AI中文摘要

LLM被广泛使用,但仍容易出现事实错误,这削弱了用户信任并限制了在高风险场景中的采用。缓解这一风险的一种方法是为模型配备不确定性估计机制,在置信度低时弃权。然而,这种二元的“全有或全无”方法在长文本场景中过于严格,常常丢弃有价值的信息。我们引入了选择性抽象(SA),这是一个框架,使LLM能够通过选择性地降低不确定内容的细节来用具体性换取可靠性。我们首先通过选择性风险和覆盖率的视角形式化SA。然后,我们提出原子级选择性抽象,这是一种声明级别的实例化,将响应分解为原子声明(简短、自包含的陈述,每个表达一个单一事实),并用更高置信度、更低具体性的抽象替换不确定的原子。为了评估这一框架,我们开发了一个新颖的端到端流水线用于开放式生成,将风险实例化为事实正确性,并使用信息论度量保留信息来衡量覆盖率。在FactScore和LongFact-Objects基准测试上的六个开源模型中,原子级SA始终优于现有基线,在风险-覆盖率曲线下面积(AURC)上比声明移除方法提升高达27.73%,表明降低具体性可以在保留大部分原始含义的同时提升准确性和可靠性。

英文摘要

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

2602.10352 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

从可解释性工件中学习自我解释:在向量-标签对上训练轻量级适配器

Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena

发表机构 * University of Washington(华盛顿大学)

AI总结 通过训练轻量级适配器(标量仿射适配器,仅需d_model+1参数)在可解释性工件上,保持语言模型完全冻结,实现了跨任务和模型族的可靠自我解释,在稀疏自编码器特征标注、主题识别和多跳推理桥接实体解码等任务上显著优于未训练基线。

Comments 26 pages, 18 tables, 17 figures. Code and data at https://github.com/agencyenterprise/selfie-adapters

详情
AI中文摘要

自我解释方法促使语言模型描述其内部状态,但由于超参数敏感性而仍然不可靠。我们表明,在可解释性工件上训练轻量级适配器,同时保持语言模型完全冻结,可以在任务和模型族中产生可靠的自我解释。一个仅需$d_\text{model}+1$个参数的标量仿射适配器就足够了:训练后的适配器生成稀疏自编码器特征标签,其性能优于训练标签本身(在70B规模下,生成评分为70% vs 50%),以94%的召回率@1识别主题(未训练基线为1%),并在多跳推理中解码既不在提示中也不在响应中出现的桥接实体,从而无需思维链即可揭示隐式推理。仅学习到的偏置向量就占了改进的85%,更简单的适配器比更具表达力的替代方案具有更好的泛化能力。通过提示描述控制模型知识,我们发现从7B到72B参数,自我解释的提升超过了能力提升。我们的结果表明,自我解释随着规模扩大而改善,且无需修改被解释的模型。

英文摘要

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (70% vs 50% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

2602.06960 2026-06-03 cs.CL cs.AI 版本更新

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

InftyThink+:通过强化学习实现高效且有效的无限时域推理

Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen

发表机构 * Tsinghua University(清华大学)

AI总结 提出InftyThink+框架,通过强化学习优化迭代推理的总结时机、保留内容和恢复策略,在DeepSeek-R1-Distill-Qwen-1.5B上提升AIME24准确率21%,并降低推理延迟。

Comments ICML 2026: https://openreview.net/forum?id=tyul8kXaJU Project Page: https://zju-real.github.io/InftyThink-Plus Code: https://github.com/ZJU-REAL/InftyThink-Plus Models: https://huggingface.co/collections/yanyc/inftythink

详情
AI中文摘要

大型推理模型通过扩展推理时的思维链取得了强劲性能,但这种范式存在二次成本、上下文长度限制以及因中间丢失效应导致的推理退化问题。迭代推理通过定期总结中间思考缓解了这些问题,但现有方法依赖监督学习或固定启发式,未能优化何时总结、保留什么以及如何恢复推理。我们提出InftyThink+,一个端到端强化学习框架,它优化整个迭代推理轨迹,基于模型控制的迭代边界和显式总结。InftyThink+采用两阶段训练方案:监督冷启动后接轨迹级强化学习,使模型学习策略性总结和继续决策。在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24上准确率提升21%,显著优于传统长思维链强化学习,同时在分布外基准上泛化能力更强。此外,InftyThink+大幅降低推理延迟并加速强化学习训练,展示了更强的推理效率与性能提升。

英文摘要

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

2602.07842 2026-06-03 cs.CL 版本更新

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

评估和校准LLM在多个正确答案问题上的置信度

Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi

发表机构 * State Key Laboratory of AI Safety(人工智能安全国家重点实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Renmin University of China(中国人民大学)

AI总结 针对多正确答案问题导致现有置信度校准方法失效的问题,提出语义置信度聚合(SCA)方法,通过聚合多个高概率采样响应的置信度,在混合答案设置下实现最优校准性能。

详情
AI中文摘要

置信度校准对于使大型语言模型(LLM)可靠至关重要,然而现有的无训练方法主要在单答案问答场景下研究。本文表明,这些方法在存在多个有效答案时会失效,因为同等正确响应之间的分歧导致置信度系统性低估。为了系统研究这一现象,我们引入了MACE基准,包含跨越六个领域的12,000个事实性问题,每个问题有不同数量的正确答案。在15种代表性校准方法和四个LLM系列(7B-72B)上的实验表明,虽然准确率随答案基数增加而提高,但估计置信度持续下降,导致对于混合答案数量的问题出现严重校准偏差。为解决此问题,我们提出语义置信度聚合(SCA),该方法聚合多个高概率采样响应的置信度。SCA在混合答案设置下实现了最先进的校准性能,同时在单答案问题上保持强校准能力。

英文摘要

Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.

2602.07639 2026-06-03 cs.CL 版本更新

Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization

让导师角色为LLMs发声:通过偏好优化从对话中学习引导向量

Jaewook Lee, Alexander Scarlatos, Simon Woodhead, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) Eedi

AI总结 本文提出使用偏好优化训练引导向量,从人类导师-学生对话中提取导师角色信息,以控制大语言模型的行为,实现多样化的教学风格。

Comments Accepted to ACL 2026 BEA Workshop

详情
AI中文摘要

随着大语言模型(LLMs)作为一类强大的生成式人工智能(AI)的出现,它们在辅导中的应用日益突出。先前基于LLM的辅导工作通常学习单一的辅导策略,未能捕捉辅导风格的多样性。在现实世界的导师-学生互动中,教学意图通过适应性教学策略实现,导师根据学习者的需求调整支架式教学、指导性、反馈和情感支持的级别。这些差异都会影响对话动态和学生参与度。在本文中,我们探讨如何利用嵌入在人类导师-学生对话中的导师角色来引导LLM行为,而不依赖于显式提示指令。我们使用偏好优化训练一个引导向量:一个激活空间方向,用于引导模型响应朝向特定的导师角色。我们发现,这个引导向量捕捉了跨对话上下文的导师特定变化,提高了与真实导师话语的语义对齐,并增加了基于偏好的评估,同时很大程度上保留了词汇相似性。对学习到的缩放系数的进一步分析揭示了跨导师的可解释结构,对应于辅导行为的一致差异。这些结果表明,激活引导提供了一种有效且可解释的方式,利用直接从人类对话数据中获得的信号来控制LLM中导师特定的变化。

英文摘要

With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We train a steering vector using preference optimization: an activation-space direction that guides model responses toward specific tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned scaling coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.

2511.16275 2026-06-03 cs.CL cs.AI 版本更新

SeSE: Black-Box Uncertainty Quantification for Large Language Models Based on Structural Information Theory

SeSE: 基于结构信息理论的大语言模型黑盒不确定性量化

Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu

发表机构 * School of Cyber Science and Technology Beihang University(北航信息科学与技术学院) School of Computer Science and Engineering Beihang University(北航计算机科学与工程学院) Didi Chuxing(滴滴出行) Laboratory for Big Data and Decision National University of Defense Technology(国防科技大学大数据与决策实验室) Department of Computer Science University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 提出SeSE框架,通过构建语义空间的最优层次抽象并计算结构熵,实现大语言模型的黑盒不确定性量化,理论推广了语义熵并在长文本生成中优于现有方法。

Comments Accepted by UAI 2026

详情
AI中文摘要

可靠的不确定性量化(UQ)对于在安全关键场景中部署大语言模型(LLMs)至关重要,因为它使模型能够在不确定时避免回应,从而避免产生幻觉,即看似合理但事实错误的回应。然而,尽管语义UQ方法取得了先进性能,它们忽略了可能实现更精确不确定性估计的潜在语义结构信息。在本文中,我们提出了语义结构熵(SeSE),一个适用于开源和闭源LLMs的原则性黑盒UQ框架。为了揭示语义空间的内在结构,SeSE通过具有最小结构熵的编码树构建其最优层次抽象。因此,该编码树的结构熵量化了最优压缩后LLM语义空间内的固有不确定性。此外,与主要关注简单短文本生成的现有方法不同,我们将SeSE扩展到为长文本输出提供可解释的、细粒度的不确定性估计。我们从理论上证明SeSE推广了语义熵(LLM中UQ的金标准),并通过24个模型-数据集组合的实验证明其优于强基线的性能。

英文摘要

Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinations, i.e., plausible yet factually incorrect responses. However, while semantic UQ methods have achieved advanced performance, they overlook latent semantic structural information that could enable more precise uncertainty estimates. In this paper, we propose \underline{Se}mantic \underline{S}tructural \underline{E}ntropy ({SeSE}), a principled black-box UQ framework applicable to both open- and closed-source LLMs. To reveal the intrinsic structure of the semantic space, SeSE constructs its optimal hierarchical abstraction through an encoding tree with minimal structural entropy. The structural entropy of this encoding tree thus quantifies the inherent uncertainty within LLM semantic space after optimal compression. Additionally, unlike existing methods that primarily focus on simple short-form generation, we extent SeSE to provide interpretable, granular uncertainty estimation for long-form outputs. We theoretically prove that SeSE generalizes semantic entropy, the gold standard for UQ in LLMs, and empirically demonstrate its superior performance over strong baselines across 24 model-dataset combinations.

2507.10419 2026-06-03 cs.LG cs.AI cs.CL stat.ML 版本更新

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

低秩适配器的多选学习用于语言建模

Victor Letzelter, Hugo Malard, Mathieu Fontaine, Gaël Richard, Slim Essid, Andrei Bursuc, Patrick Pérez

发表机构 * Institut National de la Recherche Scientifique (INRS)(国家科学研究院)

AI总结 提出LoRA-MCL训练方案,通过多选学习和低秩适配扩展语言模型的下一词预测,以在推理时解码多样且合理的句子延续。

Comments ICML 2026

详情
AI中文摘要

我们提出LoRA-MCL,一种训练方案,通过一种旨在推理时解码多样、合理的句子延续的方法,扩展语言模型中的下一词预测。传统语言建模是一个本质上不适定的问题:给定一个上下文,多个未来可能同样合理。我们的方法利用多选学习(MCL)和胜者全得损失,通过低秩适配有效处理歧义。我们提供了将MCL应用于语言建模的理论解释,假设数据来自混合分布。我们使用马尔可夫链混合来说明所提出的方法。然后,我们通过音频和视觉字幕以及机器翻译的实验证明,我们的方法在生成输出中实现了高多样性和相关性。我们发布了将LoRA-MCL应用于广泛语言模型的代码。

英文摘要

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the winner-takes-all loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on audio and visual captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs. We release the code for applying LoRA-MCL to a wide range of language models.

2602.03681 2026-06-03 cs.CL cs.LG 版本更新

Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

神经注意力搜索线性:迈向自适应令牌级混合注意力模型

Difan Deng, Andreas Bentzen Winje, Lukas Fehring, Marius Lindauer

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 提出NAtS-L框架,在同一层内对不同令牌自适应选择线性注意力或softmax注意力,以平衡效率与表达能力。

Comments 21 pages, 12 figures

详情
AI中文摘要

softmax变换器的二次计算复杂度已成为长上下文场景的瓶颈。相比之下,线性注意力模型系列为更高效的序列模型提供了有希望的方向。这些线性注意力模型将过去的KV值压缩成单个隐藏状态,从而在训练和推理期间有效降低复杂度。然而,它们的表达能力仍然受限于隐藏状态的大小。先前的工作提出交错使用softmax和线性注意力层,以在保持表达能力的同时降低计算复杂度。然而,这些模型的效率仍然受限于其softmax注意力层。在本文中,我们提出神经注意力搜索线性(NAtS-L)框架,该框架在同一层内对不同令牌同时应用线性注意力和softmax注意力操作。NAtS-L自动判断一个令牌是否可以由线性注意力模型处理(即仅具有短期影响且可编码为固定大小隐藏状态的令牌),或者是否需要softmax注意力(即包含与长期检索相关信息且需要为未来查询保留的令牌)。通过跨令牌搜索最优的门控DeltaNet和softmax注意力组合,我们展示了NAtS-L提供了一种强大而高效的令牌级混合架构。

英文摘要

The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.

2512.00956 2026-06-03 cs.LG cs.CL 版本更新

WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

WUSH: 面向LLM量化的近最优自适应变换

Jiale Chen, Vage Egiazarian, Roberto L. Castro, Torsten Hoefler, Dan Alistarh

发表机构 * University of Tartu(塔尔图大学)

AI总结 提出一种结合Hadamard基与数据依赖二阶矩的非正交变换WUSH,在标准RTN AbsMax缩放块量化器下实现权重-激活联合量化的闭式最优解,显著提升低比特量化精度并支持高效GPU实现。

Comments Published as a conference paper at the 43rd International Conference on Machine Learning (ICML 2026): https://openreview.net/forum?id=ZsECxUkbKB

详情
AI中文摘要

量化LLM权重和激活是实现高效部署的标准方法,但少数极端异常值会拉伸动态范围并放大低比特量化误差。先前的基于变换的缓解方法(例如Hadamard旋转)是固定的且与数据无关,其量化最优性尚不明确。我们推导了在标准RTN AbsMax缩放块量化器下,用于联合权重-激活量化的闭式最优线性块变换,涵盖整数和浮点格式。由此产生的构造WUSH将Hadamard骨干与数据依赖的二阶矩分量相结合,形成一种非正交变换,在温和假设下对FP和INT量化器证明是近最优的,同时支持高效的融合GPU实现。实验上,WUSH在最强Hadamard基线(例如,在Llama-3.1-8B-Instruct的MXFP4上,RTN平均提升+2.8个点,GPTQ提升+0.7个点)上改善了W4A4精度,同时通过FP4 MatMul实现了高达BF16的5.8倍每层吞吐量。源代码可在https://github.com/IST-DASLab/WUSH获取。

英文摘要

Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for joint weight-activation quantization under standard RTN AbsMax-scaled block quantizers, covering both integer and floating-point formats. The resulting construction, WUSH, combines a Hadamard backbone with a data-dependent second-moment component to form a non-orthogonal transform that is provably near-optimal for FP and INT quantizers under mild assumptions while admitting an efficient fused GPU implementation. Empirically, WUSH improves W4A4 accuracy over the strongest Hadamard-based baselines (e.g., on Llama-3.1-8B-Instruct in MXFP4, it gains +2.8 average points with RTN and +0.7 with GPTQ) while delivering up to 5.8$\times$ per-layer throughput over BF16 via FP4 MatMul. Source code is available at https://github.com/IST-DASLab/WUSH.

2601.12247 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

规划、验证与填充:扩散语言模型的结构化并行解码方法

Miao Li, Hanyang Jiang, Sikai Cheng, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) University of Michigan(密歇根大学)

AI总结 提出Plan-Verify-Fill (PVF)方法,通过定量验证进行分层骨架规划,并采用验证协议实现结构化停止,在保持准确性的同时将函数评估次数减少高达65%。

详情
AI中文摘要

扩散语言模型(DLM)为文本生成提供了一种有前景的非顺序范式,不同于标准的自回归(AR)方法。然而,当前的解码策略通常采取被动姿态,未能充分利用全局双向上下文来指导全局轨迹。为了解决这个问题,我们提出了Plan-Verify-Fill(PVF),一种无需训练的范式,通过定量验证来锚定规划。PVF通过优先考虑高杠杆语义锚点主动构建分层骨架,并采用验证协议来实现实用的结构化停止,在进一步思考收益递减时停止。在LLaDA-8B-Instruct和Dream-7B-Instruct上的广泛评估表明,与基于置信度的并行解码相比,PVF在基准数据集上将函数评估次数(NFE)减少了高达65%,在不牺牲准确性的情况下实现了卓越的效率。

英文摘要

Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

2601.14569 2026-06-03 cs.CL cs.LG 版本更新

Social Caption: Evaluating Social Understanding in Multimodal Models

Social Caption: 评估多模态模型的社会理解能力

Leena Mathur, Bhaavanaa Thumu, Youssouf Kebe, Louis-Philippe Morency

发表机构 * School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院)

AI总结 提出基于交互理论的SOCIAL CAPTION框架,从社会推理、整体社会分析和定向社会分析三个维度评估多模态大语言模型的社会理解能力,并分析影响性能的因素。

Comments 25 pages, 10 figures

详情
AI中文摘要

社会理解能力对于多模态大语言模型(MLLMs)解读人类社交互动至关重要。我们引入SOCIAL CAPTION,这是一个基于交互理论的框架,用于从三个维度评估MLLMs的社会理解能力:社会推理(SI),即对互动做出准确推断的能力;整体社会分析(HSA),即生成互动全面描述的能力;定向社会分析(DSA),即从互动中生成相关信息的能力。我们分析了影响模型社会理解性能的因素,如规模、架构设计和口语语境。使用MLLM评判员的实验展示了扩展多模态社会理解自动化评估的路径。

英文摘要

Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce SOCIAL CAPTION, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to generate relevant information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges demonstrate a path towards scaling automated evaluation of multimodal social understanding.

2505.16014 2026-06-03 cs.CL 版本更新

Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

无排序RAG:用选择替代重排序以应用于敏感领域

Yash Saxena, Ankur Padia, Mandar S Chaudhary, Kalpa Gunaratna, Srinivasan Parthasarathy, Manas Gaur

发表机构 * University of Washington(华盛顿大学)

AI总结 提出METEORA框架,通过DPO微调LLM生成检索理由、统计肘部检测自适应截断和验证器过滤,在敏感领域实现可解释、高效且鲁棒的证据选择,无需重排序。

Comments ICML 2026

详情
AI中文摘要

部署在敏感领域的检索增强生成(RAG)系统必须提供可解释的证据选择,并针对数据投毒提供稳健的防护,然而当前方法依赖于不透明的基于相似性的检索,并采用任意的top-k截断,这些方法对其选择不提供任何解释,且容易受到对抗性操纵。METEORA通过三个组件用理由驱动的选择替代重排序:一个DPO微调的LLM,生成明确的检索理由;一个证据块选择引擎(ECSE),利用这些理由结合统计肘部检测进行自适应截断确定;以及一个验证器LLM,使用相同的理由过滤投毒证据。在六个数据集上,METEORA实现了召回率提高13.41%,精确率提高21.05%(无扩展),证据量减少80%,答案准确率提高33.34%,对抗鲁棒性提高4.4倍。人工评估证实了真正的可解释性(置信度3.64/5;86%的真实标签一致性),表明可解释性、效率和鲁棒性是协同而非竞争的目标。代码可在GitHub仓库https://github.com/YashSaxena21/METEORA中获取。

英文摘要

Retrieval-Augmented Generation (RAG) systems deployed in sensitive domains must provide interpretable evidence selection and robust safeguards against data poisoning, yet current approaches rely on opaque similarity-based retrieval with arbitrary top-k cutoffs that offer no explanation for their selections and remain vulnerable to adversarial manipulation. METEORA replaces re-ranking with rationale-driven selection via three components: a DPO-tuned LLM that generates explicit retrieval rationales, an Evidence Chunk Selection Engine (ECSE) that uses those rationales with statistical elbow detection for adaptive cutoff determination, and a Verifier LLM that filters poisoned evidence using the same rationales. Across six datasets, METEORA achieves 13.41% higher recall, 21.05% higher precision (without expansion), an 80% reduction in evidence volume, a 33.34% improvement in answer accuracy, and a 4.4x improvement in adversarial robustness. Human evaluation confirms genuine interpretability (3.64/5 confidence; 86% ground-truth agreement), demonstrating that interpretability, efficiency, and robustness are synergistic rather than competing objectives. The code is available in the GitHub repository https://github.com/YashSaxena21/METEORA

2601.11429 2026-06-03 cs.CL cs.AI 版本更新

Relational Linearity is a Predictor of Hallucinations

关系线性是幻觉的预测因子

Yuetian Lu, Yihong Liu, Sebastian Gerstner, Lea Hirlimann, Jonas Rohweder, Hinrich Schütze

AI总结 通过合成未知实体基准测试,发现语言模型在回答线性关系问题时更容易产生幻觉,且关系线性度与幻觉率强相关。

Comments 15 pages, 6 figures, 14 tables

详情
AI中文摘要

幻觉是语言模型(LMs)的一个核心失败模式。我们关注对诸如“格伦·古尔德演奏哪种乐器?”这类问题的幻觉,但针对设计为模型未知的合成实体提问。我们发现,像Gemma-7B-IT这样的LM经常产生幻觉,即它们难以识别幻觉事实不属于其知识。基于线性关系嵌入的思想,我们提出以下假设:(i)由于用于表示它们的抽象方案,LM可以轻松地为线性关系的非存在主体生成合理的对象,这可能导致幻觉。(ii)对于非线性关系,这种生成对象的机制不可用,因此更容易避免幻觉。为了验证这一假设,我们创建了SyntHal,一个针对15种关系的合成未知实体基准。我们发现,在四个指令调优模型中,关系线性度是模型为未知主体生成对象(而非拒绝回答)的强预测因子,相关系数$r \in [.58, .84]$。

英文摘要

Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.

2512.10999 2026-06-03 cs.CL 版本更新

KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

KBQA-R1:强化大语言模型用于知识库问答

Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出KBQA-R1框架,通过强化学习(GRPO)将知识库问答建模为多轮决策过程,并引入参考拒绝采样(RRS)解决冷启动问题,在多个基准上取得最优性能。

Comments ICML 2026

详情
AI中文摘要

知识库问答(KBQA)挑战模型通过生成可执行的逻辑形式来弥合自然语言与严格知识图谱模式之间的差距。虽然大语言模型(LLMs)推动了这一领域的发展,但当前的方法常常陷入两种失败模式:要么生成未经模式验证的幻觉查询,要么表现出僵化的、基于模板的推理,模仿合成轨迹而没有真正理解环境。为了解决这些局限性,我们提出了 extbf{KBQA-R1}框架,该框架通过强化学习将范式从文本模仿转变为交互优化。将KBQA视为一个多轮决策过程,我们的模型学习使用动作列表导航知识库,利用组相对策略优化(GRPO)根据具体的执行反馈而非静态监督来优化其策略。此外,我们引入了 extbf{参考拒绝采样(RRS)},一种数据合成方法,通过严格对齐推理轨迹与真实动作序列来解决冷启动问题。在WebQSP、GrailQA和GraphQuestions上的大量实验表明,KBQA-R1实现了最先进的性能,有效地将LLM推理锚定在可验证的执行中。

英文摘要

Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

2507.16003 2026-06-03 cs.CL cs.LG 版本更新

Learning without training: The implicit dynamics of in-context learning

无需训练的学习:上下文学习的内在动态

Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

发表机构 * Google(谷歌)

AI总结 本文通过理论分析和实验证明,自注意力层与MLP的组合使Transformer块能够根据上下文隐式修改MLP权重,从而解释大语言模型在推理时无需权重更新即可进行上下文学习的机制。

详情
AI中文摘要

大型语言模型(LLMs)最显著的特征之一是其上下文学习能力。即在推理时,即使提示中呈现的模式在训练中未见,LLM也能在无需额外权重更新的情况下学习新模式。这种机制如何实现仍很大程度上未知。本文中,我们展示了自注意力层与MLP的堆叠使得Transformer块能够根据上下文隐式修改MLP层的权重。通过理论分析和实验,我们认为这种简单机制可能有助于解释为什么LLMs展现出超越训练捕获的上下文学习能力。具体而言,我们证明带有上下文的标准前向传播在数学上等价于无上下文但MLP权重通过表示上下文的最小低秩更新进行更新的前向传播。

英文摘要

One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theoretical analysis and experimentation that this simple mechanism may help explain why LLMs demonstrate capabilities of in-context learning, beyond what is captured during training. Specifically, we show that a standard forward pass with context is mathematically equivalent to a forward pass without context but with the MLP weights updated by a minimal low-rank update representing the context.

2512.11213 2026-06-03 cs.AI cs.CL 版本更新

FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

FutureWeaver: 面向模块化协作的多智能体系统的测试时计算规划

Dongwon Jung, Peng Shi, Muhao Chen, Yi Zhang

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Waterloo(滑铁卢大学) Greenshoe, Inc(Greenshoe公司)

AI总结 提出FutureWeaver框架,通过双层次规划架构和自诱导协作模块,在固定预算下优化多智能体系统的测试时计算分配,显著提升协作性能。

详情
AI中文摘要

扩展测试时计算已被证明可以在无需额外训练的情况下显著提升大语言模型(LLM)的性能。然而,将这些技术扩展到多智能体系统仍然具有挑战性:现有方法缺乏原则性的机制来分配计算以实现有效协作、扩展协调本身,或在明确的预算约束下优化计算使用。为弥补这一差距,我们提出了FutureWeaver,一个在固定预算下规划和优化多智能体系统中测试时计算分配的框架。它引入了协作模块,形式化为模块化的、可调用的函数,封装了可复用的多智能体工作流,并通过自博弈反思从重复出现的交互模式中自动归纳。基于这些模块,它采用了一种双层次规划架构,联合执行短视动作选择和长远抽象前瞻,以在预算约束下优化推理轨迹。在复杂智能体基准上的实验表明,FutureWeaver在各种预算设置下始终优于基线,验证了其在推理时优化中多智能体协作的有效性。

英文摘要

Scaling test-time computation has been shown to significantly improve large language model (LLM) performance without additional training. However, extending these techniques to multi-agent systems remains challenging: existing approaches lack principled mechanisms for allocating compute to enable effective collaboration, scaling coordination itself, or optimizing compute usage under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. It introduces collaboration modules, formalized as modular, callable functions that encapsulate reusable multi-agent workflows and are automatically induced via self-play reflection from recurring interaction patterns. Building on these modules, it employs \emph{a dual-level planning architecture} that jointly performs short-horizon action selection and long-horizon abstract lookahead to optimize inference trajectories under budget constraints. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

2512.00360 2026-06-03 cs.CL 版本更新

CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

CourseTimeQA: 一个讲座视频基准和一种用于时间戳问答的延迟约束跨模态融合方法

Vsevolod Kovalev, Parteek Kumar

AI总结 针对教育讲座视频中的时间戳问答任务,在单GPU延迟/内存预算下,提出了CourseTimeQA基准和一种轻量级延迟约束跨模态检索器CrossFusion-RAG,通过冻结编码器、浅层查询无关交叉注意力和时间一致性正则化,在nDCG@10和MRR上分别提升0.10和0.08,中位端到端延迟约1.55秒。

Comments This paper is being withdrawn because an error in our measurement procedure produced incorrect values in our reported retrieval results (Tables I, II, V, and VI, and the corresponding headline figures in the Abstract). Several of our empirical claims depend on these measurements and therefore do not hold as stated

详情
AI中文摘要

我们研究了在单GPU延迟/内存预算下,对教育讲座视频进行时间戳问答。给定自然语言查询,系统检索相关的时间戳片段并合成有依据的答案。我们提出了CourseTimeQA(52.3小时,涵盖六门课程的902个查询)和一种轻量级、延迟约束的跨模态检索器(CrossFusion-RAG),它结合了冻结编码器、一个学习的512->768视觉投影、在ASR和帧上的浅层查询无关交叉注意力(带有时间一致性正则化)以及一个小型交叉注意力重排序器。在CourseTimeQA上,CrossFusion-RAG相比强基线BLIP-2检索器,nDCG@10提升了0.10,MRR提升了0.08,同时在单个A100上实现了约1.55秒的中位端到端延迟。最接近的对比方法(零样本CLIP多帧池化;CLIP+交叉编码器重排序器+MMR;学习的后期融合门控;仅文本混合方法(带交叉编码器重排序及其MMR变体);字幕增强文本检索;非学习的时间平滑)在匹配的硬件和索引下进行了评估。我们报告了在ASR噪声(WER四分位数)下的鲁棒性、时间定位的诊断结果,以及完整的训练/调优细节,以支持可重复的比较。

英文摘要

We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

2511.21731 2026-06-03 cs.CL cs.AI 版本更新

Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

识别AI语言中的量子结构:人类与人工智能认知进化趋同的证据

Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo

发表机构 * Center Leo Apostel for Interdisciplinary Studies, Vrije Universiteit Brussel (VUB)(利奥·阿波斯泰尔跨学科研究中心,布鲁塞尔自由大学) Department of Economics, University of Bergamo(博洛尼亚大学经济系) Department of Humanities and Cultural Heritage (DIUM) and Centre CQSCS, University of Udine(乌迪内大学人文与文化遗产系及CQSCS中心)

AI总结 通过对大型语言模型进行认知测试,发现其概念组合中存在贝尔不等式显著违背和玻色-爱因斯坦统计,表明人类与人工智能在概念-语言领域均涌现非经典量子结构,支持认知进化趋同假说。

详情
Journal ref
Entropy 28, 622, 2026
AI中文摘要

我们展示了使用特定大型语言模型(LLMs)作为测试对象进行的概念组合认知测试结果。在第一个测试中,使用ChatGPT和Gemini,我们表明贝尔不等式被显著违背,这表明存在一个概率不满足Kolmogorov公理的“非经典概率模型”。在第二个测试中,同样使用ChatGPT和Gemini,我们在大型文本中的单词分布中识别出“玻色-爱因斯坦统计”的存在,而非直觉预期的“麦克斯韦-玻尔兹曼统计”。有趣的是,这些发现与之前在人类参与者认知测试和大规模语料库信息检索测试中获得的结果相呼应。综合来看,它们指向“概念-语言领域中非经典量子类结构的系统性涌现”,无论认知主体是人类还是人工智能。尽管LLMs因历史原因被归类为神经网络,但我们认为,在神经网络之上构建的向量空间的分布式语义结构中,发生了一种更本质的知识组织形式。正是这种承载意义的结构,促成了通过生物进化缓慢建立的人类认知与语言,与通过自我学习和训练快速涌现的LLM认知与语言之间的进化趋同现象。我们分析了支持上述假设的各种方面和实例。我们还提出了一个统一框架,解释了我们识别出的普遍量子组织意义。

英文摘要

We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of a 'non-classical probability model' with probabilities that do not satisfy Kolmogorov's axioms. In the second test, also performed using ChatGPT and Gemini, we identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of non-classical quantum-like structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

2503.07265 2026-06-03 cs.CV cs.AI cs.CL 版本更新

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

WISE: 一种基于世界知识的文本到图像生成语义评估方法

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有文本到图像生成模型缺乏复杂语义理解和世界知识整合评估的问题,提出WISE基准,包含25个子领域的1000个精心设计的提示,并引入WiScore指标评估知识-图像对齐,实验表明当前模型在整合世界知识方面存在显著局限。

Comments Accepted to ICML 2026. We have also released an updated version of the benchmark, WISE_Verified. Please refer to https://github.com/PKU-YuanGroup/WISE for the latest version

详情
AI中文摘要

文本到图像(T2I)模型能够生成高质量的艺术创作和视觉内容。然而,现有研究和评估标准主要关注图像真实性和浅层的文本-图像对齐,缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战,我们提出了 extbf{WISE},这是首个专门用于 extbf{W}orld Knowledge- extbf{I}nformed extbf{S}emantic extbf{E}valuation(世界知识引导的语义评估)的基准。WISE超越了简单的词-像素映射,通过1000个精心设计的提示,涵盖文化常识、时空推理和自然科学等25个子领域,对模型进行挑战。为了克服传统CLIP指标的局限性,我们引入了 extbf{WiScore},一种用于评估知识-图像对齐的新型定量指标。通过对20个模型(10个专用T2I模型和10个统一多模态模型)在涵盖25个子领域的1000个结构化提示上进行全面测试,我们的发现揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限,为下一代T2I模型增强知识整合与应用指明了关键路径。代码和数据可在\href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}获取。

英文摘要

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

2508.21448 2026-06-03 cs.CL 版本更新

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

当模型拒绝时:政治可操控性与特征丰富度作为意识形态深度的度量

Shariar Kabir

发表机构 * Bangladesh University of Engineering and Technology(孟加拉工程与技术大学)

AI总结 本文提出意识形态深度概念,通过可操控性和稀疏自编码器测量的特征丰富度,研究大语言模型拒绝遵循良性指令是否源于能力缺陷而非安全规则。

详情
AI中文摘要

大型语言模型有时会拒绝遵循良性指令,例如拒绝论证某种政治立场或采用指定角色,这种拒绝通常被视为安全护栏在起作用。我们探究这些拒绝是否可能表明**能力缺陷**:模型缺乏从指令角度进行推理所需的内部表示。为此,我们引入**意识形态深度**,该属性包含两个组成部分:(i) 模型遵循政治指令而不*失败*的能力(可操控性),以及 (ii) 其内部政治表示的**特征丰富度**,通过稀疏自编码器测量。使用两个广泛使用的开源权重LLM作为候选,我们比较基于提示和激活操控的干预措施,并用公开可用的SAE探测政治特征。我们发现巨大且系统性的差异:在两个意识形态方向上更具可操控性的模型激活了约**7.3倍**更多的独特政治特征,而另一个模型则以增加拒绝作为回应。从前者模型中因果性地消融一小部分目标政治特征,重现了相同的特征贫乏行为并导致拒绝增加。综合这些结果表明,对良性提示的拒绝可能源于**能力缺陷**而非固定的安全规则,并且意识形态深度是LLM的一个可测量属性,有助于预测模型何时会拒绝。

英文摘要

Large language models (LLMs) sometimes refuse to follow benign instructions, such as declining to argue a political position or adopt a stated persona, and such refusals are commonly read as safety guardrails at work. We ask whether they can instead signal a **capability deficit**: a shortage of the internal representations a model needs to reason from the instructed perspective. To investigate, we introduce **ideological depth**, a property with two components: (i) a model's ability to follow political instructions without *failure* (steerability), and (ii) the **feature richness** of its internal political representations, measured with sparse autoencoders (SAEs). Using two widely used openweight LLMs as candidates, we compare interventions based on prompts and activation-steering, and probe political features with publicly available SAEs. We find large, systematic differences: a model that is more steerable in both ideological directions activates **~7.3x** more distinct political features, while the other model instead responds with increased refusals. Causally ablating a small, targeted set of political features from the former model reproduces the same feature-poor behavior and drives up refusals. Together, these results indicate that refusals on benign prompts can arise from **capability deficits** rather than fixed safety rules, and that ideological depth is a measurable property of LLMs that helps predict when a model will refuse.

2511.08247 2026-06-03 cs.CL cs.CY 版本更新

ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

ParliaBench: 面向大语言模型生成的议会演讲的评估与基准框架

Marios Koniaris, Argyro Tsipi, Panayiotis Tsanakas

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出ParliaBench基准框架,通过构建英国议会数据集、结合计算指标与LLM评判的评估方法以及两种新型嵌入指标(政治光谱对齐和政党对齐),系统评估LLM生成议会演讲的语言质量、语义连贯性和政治真实性,实验表明微调显著提升多数指标且新指标对政治维度具有强区分力。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 4797-4818, European Language Resources Association (ELRA), Palma, Mallorca, Spain, May 2026
AI中文摘要

议会演讲生成对大型语言模型提出了超越标准文本生成任务的特定挑战。与通用文本生成不同,议会演讲不仅需要语言质量,还需要政治真实性和意识形态一致性。当前语言模型缺乏针对议会上下文的专门训练,现有评估方法侧重于标准NLP指标而非政治真实性。为此,我们提出了ParliaBench,一个用于议会演讲生成的基准。我们构建了一个来自英国议会的演讲数据集,以实现系统性的模型训练。我们引入了一个评估框架,将计算指标与LLM-as-a-judge评估相结合,用于衡量生成质量在三个维度上的表现:语言质量、语义连贯性和政治真实性。我们提出了两种新颖的基于嵌入的指标——政治光谱对齐和政党对齐——以量化意识形态定位。我们微调了五个大型语言模型(LLM),生成了28k篇演讲,并使用我们的框架对其进行了评估,比较了基线和微调模型。结果表明,微调在大多数指标上产生了统计显著的改进,并且我们的新颖指标对政治维度表现出强大的区分能力。

英文摘要

Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.

2511.02304 2026-06-03 cs.MA cs.AI cs.CL cs.FL cs.LG 版本更新

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

自动机条件化协作多智能体强化学习

Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学)

AI总结 提出自动机条件化协作多智能体强化学习框架,通过自动机分解团队目标为子任务,学习任务条件化的分散策略,实现最优任务分配和多步协调。

详情
AI中文摘要

我们研究在集中训练、分散执行下,针对协作性时间目标的多任务、多智能体策略学习。在此设置中,使用自动机表示分配给智能体的任务,能够将团队级目标分解为更简单、更小的子任务。然而,现有方法样本效率低下,且局限于单任务情况,需要为每个新任务重新训练策略。在这项工作中,我们提出了自动机条件化协作多智能体强化学习(ACC-MARL),一个学习任务条件化分散团队策略的框架。我们识别了ACC-MARL可行性的挑战,提出了解决方案,并证明了我们的方法是最优的。我们进一步展示了学习到的价值函数可用于在测试时最优地分配任务。实验表明,智能体之间涌现出任务感知的多步协调,例如按下按钮开门、扶住门以及短路任务。

英文摘要

We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.

2510.16282 2026-06-03 cs.CL 版本更新

Instant Personalized Large Language Model Adaptation via Hypernetwork

通过超网络实现即时个性化大型语言模型自适应

Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, Meng Jiang

发表机构 * University of Notre Dame(诺丁汉大学) Amazon.com Inc(亚马逊公司) Université de Montréal(蒙特利尔大学)

AI总结 提出Profile-to-PEFT框架,使用超网络将用户编码直接映射到适配器参数,实现无需用户训练的即时个性化,在降低计算成本的同时优于现有方法。

Comments accepted to ACL 2026

详情
AI中文摘要

个性化大型语言模型(LLM)利用用户档案或历史记录来定制符合个人偏好的内容。然而,现有的参数高效微调(PEFT)方法,例如“每用户一个PEFT”(OPPU)范式,需要为每个用户训练单独的适配器,这使得它们在计算上昂贵且不适用于实时更新。我们引入了Profile-to-PEFT,一个可扩展的框架,它采用端到端训练的超网络,将用户编码档案直接映射到一组完整的适配器参数(例如LoRA),从而消除了部署时的每用户训练。这种设计实现了即时自适应、对未见用户的泛化以及保护隐私的本地部署。实验结果表明,我们的方法在部署时使用显著更少的计算资源,同时优于基于提示的个性化和OPPU。该框架对分布外用户表现出强大的泛化能力,并在不同的用户活动水平和不同的嵌入骨干下保持鲁棒性。所提出的Profile-to-PEFT框架实现了高效、可扩展且自适应的LLM个性化,适用于大规模应用。

英文摘要

Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.

2510.09711 2026-06-03 cs.CL cs.AI 版本更新

ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

ReaLM:残差量化桥接知识图谱嵌入与大型语言模型

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen

发表机构 * Tianjin University(天津大学) The University of Manchester(曼彻斯特大学)

AI总结 提出ReaLM框架,通过残差向量量化将知识图谱嵌入离散化为可学习标记,融入大型语言模型词汇表,结合本体约束实现结构化知识与语言模型的语义对齐,在知识图谱补全任务上取得最优性能。

详情
AI中文摘要

大型语言模型(LLM)最近成为知识图谱补全(KGC)的强大范式,提供了超越传统基于嵌入方法的强大推理和泛化能力。然而,现有的基于LLM的方法通常难以充分利用结构化语义表示,因为预训练KG模型的连续嵌入空间与LLM的离散标记空间根本不对齐。这种差异阻碍了有效的语义转移并限制了它们的性能。为了解决这一挑战,我们提出了ReaLM,一种新颖且有效的框架,通过残差向量量化的机制弥合了KG嵌入和LLM标记化之间的差距。ReaLM将预训练的KG嵌入离散化为紧凑的代码序列,并将它们作为可学习标记集成到LLM词汇表中,从而实现符号知识和上下文知识的无缝融合。此外,我们引入了本体引导的类约束以强制语义一致性,基于类级别的兼容性细化实体预测。在两个广泛使用的基准数据集上进行的大量实验表明,ReaLM实现了最先进的性能,证实了其在将结构化知识与大规模语言模型对齐方面的有效性。

英文摘要

Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

2510.08977 2026-06-03 cs.LG cs.CL 版本更新

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

打破自我确认循环:诊断与缓解自奖励强化学习中的系统性奖励偏差

Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过量化反馈回路偏差并提出集成奖励强化学习(RLER)方法,诊断并缓解了自奖励强化学习中由置信度耦合导致的系统性奖励偏差,从而提升性能与稳定性。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)高效扩展了大语言模型(LLMs)的推理能力,但受限于稀缺的标注数据。基于内在奖励的强化学习(RLIR)通过自奖励提供了一种可扩展的替代方案,但常面临不稳定和性能较差的问题。我们将这一差距归因于置信度耦合的自奖励中的系统性偏差:模型倾向于过度奖励高置信度的错误,形成自我确认循环。我们通过三个指标量化这种反馈回路偏差:奖励噪声幅度(rho_noise)、策略-奖励耦合(rho_selfbias)和过度/不足奖励偏斜(rho_symbias)。我们的分析显示了一种复合效应,其中强耦合放大了置信度条件误差,并导致向过度奖励的漂移,从而引发不稳定和较低的性能上限。为缓解这一问题,我们提出集成奖励强化学习(RLER),该方法通过自适应奖励插值和分歧感知的轨迹选择聚合多样化的模型,以减少耦合并抑制过度奖励漂移。大量实验表明,RLER相比最佳RLIR基线提升了6.2%,且与RLVR的差距在3.6%以内,同时在未标注样本上表现出稳定的扩展性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models (LLMs) but is bottlenecked by scarce labeled data. Reinforcement learning with intrinsic rewards (RLIR) offers a scalable alternative via self-rewarding, yet often suffers from instability and inferior performance. We trace this gap to a systemic bias in confidence-coupled self-rewarding: the model tends to over-reward high-confidence mistakes, forming a self-confirming loop. We quantify this feedback-loop bias with three metrics: reward noise magnitude (rho_noise), policy-reward coupling (rho_selfbias), and over-/under-reward skew (rho_symbias). Our analyses show a compounding effect where strong coupling amplifies confidence-conditioned errors and drives a drift toward over-reward, leading to instability and a lower performance ceiling. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models with adaptive reward interpolation and disagreement-aware rollout selection to reduce coupling and suppress over-reward drift. Extensive experiments show that RLER improves by 6.2% over the best RLIR baseline and is within 3.6% of RLVR, while exhibiting stable scaling on unlabeled samples.

2509.22854 2026-06-03 cs.CL 版本更新

Train Once, Reuse Everywhere: Generalizable Implicit In-Context Learning by Routing Attention

一次训练,随处重用:通过路由注意力实现可泛化的隐式上下文学习

Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出In-Context Routing (ICR)方法,在注意力logits层面捕获可泛化的上下文学习模式,通过可学习的输入条件路由器调制注意力logits,实现高效的一次训练多次重用框架,在12个数据集上优于现有隐式ICL方法并展现强泛化能力。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

隐式上下文学习(ICL)作为一种新兴的有前景范式,在大语言模型(LLMs)的表示空间中模拟ICL行为,旨在以零样本成本获得少样本性能。然而,现有方法主要依赖于将偏移向量注入残差流,这些向量通常从标注示例或任务特定对齐中构建。这种设计未能充分利用ICL背后的结构机制,且泛化能力有限。为了解决这个问题,我们提出了In-Context Routing (ICR),一种新颖的隐式ICL方法,在注意力logits层面捕获和利用可泛化的ICL模式。它提取ICL过程中出现的可重用结构方向,并采用可学习的输入条件路由器相应地调制注意力logits,从而实现高效的一次训练多次重用框架。我们在涵盖不同领域和多个LLM的12个真实世界数据集上评估了ICR。结果表明,ICR一致优于需要任务特定检索或训练的现有隐式ICL方法,同时在它们难以处理的域外任务上展现出稳健的泛化能力。这些发现将ICR定位为推动ICL实际价值边界的方案。代码可在https://github.com/Lijiaqian1/In-Context-Routing.git获取。

英文摘要

Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of large language models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that captures and utilizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling an efficient train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms existing implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where they struggle. These findings position ICR to push the boundary of the practical value of ICL. The code is available at https://github.com/Lijiaqian1/In-Context-Routing.git.

2508.03098 2026-06-03 cs.CL 版本更新

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

隐私感知解码:缓解检索增强生成中大语言模型的隐私泄露

Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu

发表机构 * Emory University(埃默里大学) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 提出一种轻量级推理时防御方法PAD,通过在解码过程中注入校准高斯噪声,结合置信度筛选、敏感度估计和上下文感知噪声校准,在RAG系统中平衡隐私保护与生成质量,并利用Rényi差分隐私跟踪累积隐私损失。

详情
AI中文摘要

检索增强生成(RAG)通过将输出条件化于外部知识源来增强大语言模型(LLM)的事实准确性。然而,当检索涉及私人或敏感数据时,RAG系统容易受到提取攻击,从而通过生成的响应泄露机密信息。我们提出隐私感知解码(PAD),一种轻量级的推理时防御方法,在生成过程中自适应地将校准的高斯噪声注入到token logits中。PAD集成了基于置信度的筛选以选择性保护高风险token、高效的敏感度估计以最小化不必要的噪声,以及上下文感知的噪声校准以平衡隐私与生成质量。Rényi差分隐私(RDP)会计机制严格跟踪累积隐私损失,从而为敏感输出提供明确的每响应$(\varepsilon, δ)$-DP保证。与需要重新训练或语料库级过滤的先前方法不同,PAD是模型无关的,并且完全在解码时运行,计算开销极小。在三个真实世界数据集上的实验表明,PAD在保持响应实用性的同时显著减少了私有信息泄露,优于现有的基于检索和后处理的防御方法。我们的工作通过解码策略在缓解RAG中的隐私风险方面迈出了重要一步,为敏感领域中的通用和可扩展隐私解决方案铺平了道路。我们的代码可在https://github.com/wang2226/PAD获取。

英文摘要

Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, δ)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University(计算科学学院西蒙弗雷泽大学)

AI总结 提出CoMPAS3D数据集和评估框架,通过动作可读性和熟练度适当性等客观指标,解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情
AI中文摘要

社交互动型人形机器人必须通过身体与人类互动,实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动,还要理解在共享社交背景下动作的含义。然而,交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读,也不评估其是否适合伙伴的熟练水平。这一差距有两个原因:现有框架依赖运动学指标(如FID和节拍对齐),无法衡量上述特性;现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适:即兴、双人、由动作词汇和评判标准(涵盖时机、音乐性、技巧、难度、配合和原创性)指导。我们提出CoMPAS3D,一个即兴双人萨尔萨舞的动作捕捉数据集,附带评估框架,涵盖运动学质量、两个客观指标(动作可读性和熟练度适当性)以及六个基于竞赛的主观维度。数据集包含18名舞者(涵盖初级、中级和高级水平)的3小时即兴表演,超过2800个专家标注片段,涵盖动作类型、错误和风格元素。我们定义了三个基准:动作分类(类似于转录)、熟练度估计(流利度评估)和跟随者生成(对话响应)。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时,这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

2506.02018 2026-06-03 cs.CL 版本更新

Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

增强释义类型生成:基于人工排序数据的DPO和RLHF评估影响

Christopher Lee Lübbers

AI总结 本研究利用人工排序的释义类型数据集,结合直接偏好优化(DPO)使模型输出与人类判断对齐,将释义类型生成准确率提升3个百分点,人类偏好评分提升7个百分点,并创建了新的标注数据集以支持更严格的评估。

Comments 21 pages, 11 figures. Master's thesis, University of Goettingen, December 2024. Code: https://github.com/cluebbers/dpo-rlhf-paraphrase-types. Models: https://huggingface.co/collections/cluebbers/enhancing-paraphrase-type-generation-673ca8d75dfe2ce962a48ac0

详情
AI中文摘要

释义通过重新表达含义来增强文本简化、机器翻译和问答等应用。特定的释义类型有助于精确的语义分析和鲁棒的语言模型。然而,现有的释义类型生成方法由于依赖自动评估指标和有限的人工标注训练数据,常常与人类偏好不一致,掩盖了语义保真度和语言转换的关键方面。本研究通过利用人工排序的释义类型数据集,并整合直接偏好优化(DPO)使模型输出直接与人类判断对齐,填补了这一空白。基于DPO的训练将释义类型生成准确率比监督基线提高了3个百分点,并将人类偏好评分提高了7个百分点。新创建的人工标注数据集支持更严格的未来评估。此外,一个释义类型检测模型在增删、同极性替换和标点变化上的F1分数分别达到0.91、0.78和0.70。这些发现表明,偏好数据和DPO训练能产生更可靠、语义更准确的释义,从而改进摘要生成和更鲁棒的问答等下游应用。PTD模型超越了自动评估指标,为评估释义质量提供了更可靠的框架,推动释义类型研究向更丰富、与用户对齐的语言生成发展,并为基于人类中心标准的未来评估奠定了更坚实的基础。

英文摘要

Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.

2310.10322 2026-06-03 cs.CL 版本更新

Evaluating the Reversal Curse in Model Editing

评估模型编辑中的逆转诅咒

Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Quan Liu, Cong Liu, Jia-Chen Gu

发表机构 * National Engineering Research Center of Speech and Language Information Processing(语音与语言信息处理国家级工程研究中心) University of Science and Technology of China(中国科学技术大学) iFLYTEK Research(iFLYTEK研究院) University of California, Los Angeles(美国加州大学洛杉矶分校)

AI总结 本文研究双向语言模型编辑,提出反向泛化指标并构建BAKE基准,发现多数编辑方法在反向评估中存在系统性缺陷,并分析逆转诅咒的成因及缓解策略。

Comments Accepted by TMLR

详情
AI中文摘要

大型语言模型(LLMs)由于错误或过时的知识容易产生不期望的文本幻觉。由于重新训练LLMs资源密集,模型编辑日益受到关注。尽管出现了基准和方法,现有的单向编辑和评估范式未能探索逆转诅咒。在本文中,我们研究双向语言模型编辑,旨在提供严格的评估,以判断编辑后的LLMs能否双向回忆编辑知识。引入了反向泛化指标,并构建了名为BAKE(双向知识编辑评估)的基准,用于评估编辑后的模型能否在编辑的反向方向上回忆编辑知识。我们使用多种编辑方法和LLMs进行了大量实验。结果表明,尽管大多数编辑方法能够沿着修改方向准确回忆编辑事实,但在反向方向评估时,它们表现出显著的系统性缺陷。为了进一步研究逆转诅咒的根本原因并探索潜在的缓解策略,我们从三个角度进行了详细分析。我们的发现表明,尽管上下文学习(ICL)可以在一定程度上缓解逆转诅咒,但它缺乏连续性,受输入长度限制,并可能引入幻觉。因此,结合ICL和其他编辑方法的优势是开发新编辑范式的一个有前景的方向。

英文摘要

Large language models (LLMs) are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in model editing. Despite the emergence of benchmarks and approaches, existing unidirectional editing and evaluation paradigms have failed to explore the reversal curse. In this paper, we study bidirectional language model editing, aiming to provide a rigorous evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A metric of reverse generalization is introduced and a benchmark dubbed Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate if post-edited models can recall the edited knowledge in the reverse direction of editing. We conduct extensive experiments using a variety of editing methods and LLMs. The results show that while most editing methods are able to accurately recall editing facts along the modification direction, they exhibit substantial systematic deficiencies when evaluating in the reverse direction. To further investigate the underlying causes of reversal curse and to explore potential strategies for mitigation, a detailed analysis is conducted from three perspectives. Our findings reveal that although In-Context Learning (ICL) can mitigate the reversal curse to a certain extent, it lacks continuity, is limited by the input length, and may introduce hallucinations. Therefore, combining the advantages of ICL and other editing methods is a promising direction for developing new editing paradigms.

2303.15619 2026-06-03 cs.CL cs.AI 版本更新

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Typhoon: 面向预训练语言模型的有效任务特定掩码策略

Muhammed Shahir Abdurrahman, Hashem Elezabi, Bruce Changlong Xu

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 本文提出Typhoon,一种基于任务损失梯度的自适应掩码策略,在GLUE任务上对比随机掩码和整词掩码,经严格评估发现无显著优势。

详情
AI中文摘要

在掩码语言建模(MLM)中,选择哪些token进行掩码是一个核心但未被充分研究的设计决策。标准预训练随机均匀掩码token,但多项研究表明,更具信息性的掩码目标可以提升下游性能。我们将掩码视为微调流程中任务自适应的组件,并引入Typhoon,一种掩码策略,它利用任务损失相对于one-hot token输入的梯度来在线估计每种token类型对目标的贡献程度。Typhoon维护每个token类型显著性的指数移动平均,并将这些分数校准为掩码分布,在token独立性近似下,其期望掩码率与目标预算匹配。我们形式化了该方法,并在两个GLUE任务(MRPC和CoLA)上,针对三个BERT系列骨干网络(TinyBERT、DistilBERT和BERT-base)以及每个配置五个随机种子(总共90次训练运行),将其与随机掩码和整词掩码进行了评估。我们的主要发现是,一旦考虑了种子方差,没有哪种掩码策略在这些任务上可靠地优于其他策略:在MRPC上,Typhoon与最佳基线之间的差距保持在0.004 F1以内,所有十二次Typhoon比较中无配对检验达到显著性,且每个95%置信区间包含零。Typhoon在单次运行实验中的明显优势并未经受住这种更仔细的评估。我们将此视为一个警示性的、以可重复性为重点的结果——基于梯度的任务自适应掩码具有竞争力,但在此规模上并不明显优于无资源的随机掩码——我们描述了一个干净的现代重实现以支持后续工作。

英文摘要

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a \emph{task-adaptive} component of the fine-tuning pipeline and introduce \textbf{Typhoon}, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration ($90$ training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within $0.004$ $F_1$, across all twelve Typhoon comparisons no paired test reaches significance, and every $95\%$ confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.