arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02569 2026-06-02 cs.CV cs.AI cs.CL 版本更新

AdaCodec: A Predictive Visual Code for Video MLLMs

AdaCodec: 面向视频多模态大语言模型的预测性视觉编码

Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li, Shuai Dong, Kele Shao, Ruilin Li, Dianyi Wang, Nan Duan, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) JD.com(京东公司)

AI总结 针对视频帧间冗余问题,提出预测性视觉编码AdaCodec,通过条件预测代价决定是否发送完整参考帧或紧凑P-令牌,在匹配视觉令牌预算下提升性能,并大幅降低首令牌延迟。

Comments 23 pages

详情
AI中文摘要

视频在时间上是冗余的:相邻帧通常共享大部分物体、背景和布局。然而,现有的视频多模态大语言模型(视频MLLMs)通常将每个采样帧编码为独立的RGB图像,导致视觉令牌重复先前帧中已有的内容。这提示了一种更直接的视频接口:仅当场景无法从先前上下文中良好预测时,才发送完整的参考帧;否则,传输帧间变化的紧凑描述。我们将这种接口称为\emph{预测性视觉编码},并针对视频MLLMs实例化为 extbf{AdaCodec}。AdaCodec仅在条件预测代价高时,为参考帧花费完整的视觉令牌;否则,它将帧间变化(包括运动和预测残差)编码为紧凑的P-令牌。在所有11个基准测试中,在匹配视觉令牌预算下,AdaCodec相比基于Qwen3-VL-8B的逐帧RGB基线有所改进。即使在1/7的预算下,使用32k令牌的AdaCodec在所有长视频基准测试中超越了224k基线;在五个通用视频基准测试中,它提高了平均得分,同时将首令牌时间从9.26秒大幅缩短至1.62秒。

英文摘要

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.

2606.02568 2026-06-02 cs.AI cs.CL cs.ET cs.MA 版本更新

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

ClinEnv:面向智能体的交互式多阶段长时程电子健康记录环境

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo, Xukai Zhao, Jinzhuo Wang, May Dongmei Wang

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Peking University(北京大学) University of Texas Southwestern Medical Center(德克萨斯西南医学中心) Tsinghua University(清华大学)

AI总结 提出ClinEnv,一个基于真实住院患者数据的交互式基准,通过多阶段决策序列评估大语言模型在不确定性下逐步收集信息并做出不可逆决策的能力,发现模型决策质量与过程质量严重脱节。

Comments 20 pages, 6 figures, 12 tables

详情
AI中文摘要

临床实践并非从枚举选项中选择答案:医生会逐步收集异质信息,并在不确定性下做出顺序的、不可逆的决策。静态基准无法探测,而现有的交互式医学基准各自至少在一个方面有所妥协。我们提出ClinEnv,一个交互式基准,在称为纵向住院模拟的范式下,将大语言模型评估为真实住院患者的主治医生。每个病例自动构建为有序的决策阶段序列;在每个阶段,模型必须主动查询四个专门的智能体,然后才能提交药物、程序和诊断。ClinEnv通过确定性本体匹配对模型的决策内容进行评分,同时也对其信息收集过程进行评分。在七个模型中,最强的模型仅达到0.31的决策F1分数,且结果质量与过程质量严重脱节。困难集中在管理决策和后期阶段,模型恢复出院诊断的可靠性远高于管理行动(F1分别为0.51 vs 0.17),并且随着病例进展继续发出冗余查询。ClinEnv使这种信息获取差距(仅通过结果评估无法察觉)变得可直接测量。

英文摘要

Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.

2606.02559 2026-06-02 cs.CL cs.AI 版本更新

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

从层到子模块:重新思考基于替换的LLM压缩中的粒度

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

发表机构 * University of Trento(特伦托大学)

AI总结 提出子模块级别的非连续替换压缩方法SubFit,通过为注意力和前馈子模块分别设计轻量残差旁路,在多种LLM上实现更好的困惑度-准确率权衡。

详情
AI中文摘要

大型语言模型(LLM)的后训练压缩会移除整个架构组件,要么删除它们,要么用拟合模块替换它们。现有的基于替换的方法共享两个设计约束:全层粒度和连续选择。我们认为这过于严格:事实上,预训练Transformer中的冗余并不局限于连续区域,也不均匀分布在注意力和前馈输出之间,这意味着不同的策略最适合近似不同的子模块类型,并且可移除的组件不需要聚集在连续的深度范围内。基于这一直觉,我们引入了SubFit(子模块级拟合残差替换),它在子模块级别压缩LLM:注意力和前馈子模块被非连续地选择,并且每个子模块都获得自己的轻量级拟合残差旁路。SubFit在训练后运行,仅需要校准数据。在十个LLM(五个基础模型,五个指令微调模型)、五个从12.5%到37.5%的稀疏度水平以及四个基于替换的基线上,SubFit在评估的稀疏度水平上实现了最佳的聚合困惑度-准确率权衡,在激进压缩下获得更大收益。在25%稀疏度下,它保留了84.6%的密集下游准确率,困惑度退化2.42倍,而最强基线分别为81.6%和4.34倍,同时实现了可测量的推理加速和KV缓存节省。代码可在https://github.com/eliacunegatti/SubFit获取。

英文摘要

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.

2606.02556 2026-06-02 cs.CL 版本更新

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

英雄之旅:用文本游戏测试复杂规则归纳

Anshun Asher Zheng, Kanishka Misra, David I. Beaver, Junyi Jessy Li

发表机构 * Department of Linguistics(语言学系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出HERO'S JOURNEY基准,通过目标导向的文本游戏评估大型语言模型在属性与程序归纳任务中的规则推理能力,发现模型虽能进行规则归纳但能力有限且不均衡,程序执行成为瓶颈,而表面语义影响较小。

Comments 24 pages

详情
AI中文摘要

我们引入了HERO'S JOURNEY,一个用于目标导向情节任务中规则归纳的基准,其中智能体必须从演示中推断隐藏规则,并通过多步执行对其采取行动。HERO'S JOURNEY涵盖了属性和程序归纳家族的八个任务,每个任务有四种结构规则形式、可控的词汇基础和可识别性条件。评估最先进的LLM,我们发现模型显示出规则归纳的证据,但能力有限且在不同任务上不均衡。同时,过程执行为模型增加了执行瓶颈,而表面语义影响最小。归纳特定的引导方法在属性任务上提高了性能,但在程序任务上没有可靠的增益,表明程序归纳的差距仍然是一个开放的挑战。

英文摘要

We introduce HERO'S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluating state-of-the-art LLMs, we find that models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, whereas surface semantics has minimal effect. Induction-specific steering methods improve performance on attribute tasks but show no reliable gains on procedural tasks, suggesting the gap in procedural induction remains an open challenge.

2606.02548 2026-06-02 cs.CL 版本更新

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

SN-WER:用于多脚本印度语ASR评估的脚本归一化词错误率

Priyaranjan Pattnayak

发表机构 * Oracle America Inc.(Oracle美国公司)

AI总结 提出SN-WER指标,通过将参考和假设文本音译为规范脚本后计算WER,解决多脚本场景下WER高估错误的问题,在印度语上评估显示可减少高达12%的模型差距。

Comments Accepted to ACL 2026 MeLLM

详情
AI中文摘要

词错误率(WER)是自动语音识别(ASR)的主要指标,但当参考文本和假设文本以不同脚本编码相同单词时,WER可能高估错误。在多语言设置中,ASR模型可能输出罗马化文本,这一问题很常见。我们提出脚本归一化WER(SN-WER),一种无需训练、仅用于评估的评分方法,在计算WER之前将参考文本和假设文本音译为特定语言的规范脚本。我们在5种印度语言、2个数据集和3个ASR模型上评估了SN-WER。在精心整理的FLEURS数据上,SN-WER将膨胀的模型差距减少了高达12%,而在噪声较大的Common Voice数据上,减少幅度较小或不一致,表明存在真正的识别弱点而不仅仅是脚本不匹配。受控压力测试显示,人为罗马化引起的WER膨胀衰减了67%,而词汇替换控制显示对语义错误的敏感性几乎相同,Delta SN-WER / Delta WER约为1.09。SN-WER对音译器选择、归一化变化具有鲁棒性,并且在评估的印度语设置中,令牌碰撞率低于0.1%。我们认为,SN-WER应作为WER和CER的伴随指标报告,用于脚本不敏感的ASR评估,特别是当转录文本用于下游搜索、索引或多语言LLM流水线时。

英文摘要

Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.

2606.02545 2026-06-02 cs.CL 版本更新

Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

基于证据增强机器学习方法的急诊科分诊笔记可迁移自伤监测

Liuliu Chen, Gowri Rajaram, Eleanor Bailey, Katrina Witt, Michelle Lamblin, Jo Robinson, Mike Conway, Vlada Rozova

发表机构 * School of Computing and Information Systems, University of Melbourne(墨尔本大学计算与信息系) Orygen Centre for Youth Mental Health, University of Melbourne(墨尔本大学青年心理健康中心) Centre for Digital Transformation of Health, University of Melbourne(墨尔本大学医疗数字化转型中心)

AI总结 本研究提出一种结合大语言模型筛选与证据提取的三阶段机器学习方法,从急诊科分诊笔记中检测自伤行为,并在三家澳大利亚医院验证了其高可迁移性和细粒度监测能力。

详情
AI中文摘要

自伤是一个主要的公共卫生问题,但当前依赖于医院就诊的监测因诊断代码敏感性低而不足。急诊科(ED)分诊笔记在初次接触时记录,提供了就诊的简洁摘要,并提供了识别自伤的机会。我们开发了一种三阶段方法,将传统机器学习与大语言模型筛选和证据提取相结合,以检测ED分诊笔记中的自伤行为。我们评估了模型在三个澳大利亚医院的可迁移性。我们的方法在内部和外部验证中分别显示了0.887 ± 0.016和0.884 ± 0.012的AUPRC。在前瞻性评估中,它在开发站点达到了0.881 ± 0.008的AUPRC,在两个外部站点无需特定站点重新训练的情况下分别达到了0.879 ± 0.012和0.816 ± 0.015。该方法的一个关键优势是能够以95%的准确率识别主要的自伤方式,支持超越二元分类的更细粒度监测。

英文摘要

Self-harm is a major public health concern, but current surveillance relying on hospital presentations is inadequate due to the low sensitivity of diagnostic codes. Emergency Department (ED) triage notes, recorded at the initial point of contact, provide a succinct summary of presentations and an opportunity to identify self-harm. We developed a three-stage approach, augmenting traditional machine learning with large language model-based screening and evidence extraction to detect self-harm in ED triage notes. We assessed model transferability across three Australian hospitals. Our approach showed AUPRCs of 0.887 +/- 0.016 and 0.884 +/- 0.012 during internal and external validation. Prospectively, it achieved AUPRC of 0.881 +/- 0.008 at the development site, and 0.879 +/- 0.012 and 0.816 +/- 0.015 at two external sites without site-specific retraining. A key advantage of the approach is that it enables identification of the primary self-harm method with an accuracy of 95%, supporting more granular surveillance beyond binary classification.

2606.02544 2026-06-02 cs.CL cs.AI 版本更新

SimSD: Simple Speculative Decoding in Diffusion Language Models

SimSD:扩散语言模型中的简单推测解码

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo, Jinya Jiang, Haoru Li, Chaojie Ren, Yiming Huang, Kaijie Zhu, Zhongkai Yu, Kun Zhou, Jingbo Shang

发表机构 * University of California San Diego(加州大学圣地亚哥分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Google(谷歌) University of California Santa Barbara(加州大学圣芭芭拉分校)

AI总结 针对扩散语言模型无法直接使用标准推测解码的问题,提出SimSD算法,通过即插即用的掩码策略引入参考令牌并设计注意力掩码,实现单次前向传播验证多个草稿令牌,在保持并行解码优势的同时提升解码吞吐量。

Comments 13 pages, 4 figures, code available at https://github.com/airevo2/SimSD-release

详情
AI中文摘要

扩散大语言模型(dLLMs)最近作为自回归(AR)LLMs的有前景替代方案出现,通过并行或块状解码实现更快的推理。然而,它们的掩码语言建模公式仍然与标准令牌级推测解码不兼容,而后者是AR模型最有效的加速技术之一。在AR解码中,因果掩码保留了时间上有效的令牌级上下文,使目标模型能够在单次前向传播中验证多个草稿令牌。相比之下,dLLMs依赖于掩码令牌和双向注意力,导致有效上下文在去噪步骤中发生变化,从而阻止了直接的令牌级推测验证。为了弥合这一差距,我们提出了一种简单但有效的扩散语言模型推测解码算法,名为SimSD,它主要采用即插即用的掩码策略,为dLLMs配备时间上有效的令牌级上下文以进行推测解码。我们的方法明确地从草稿模型预测中引入参考令牌,并设计了一种注意力掩码来调节它们与当前步骤令牌的交互,使dLLMs能够在单次前向传播中计算草稿令牌的有效logits。这恢复了AR模型中因果掩码提供的关键验证能力,同时保留了dLLMs的并行解码优势。所提出的方法无需训练,并且可以灵活地与其他加速技术(如KV缓存和块状解码)集成。在四个基准测试上的SDAR系列dLLMs实验表明,我们的方法实现了高达7.46倍的解码吞吐量提升,同时保持甚至提高了平均生成质量。

英文摘要

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

2606.02540 2026-06-02 cs.CL 版本更新

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

SkillHarm: 通过自动化构建实现生命周期感知的基于技能的攻

Yuting Ning, Zhehao Zhang, Yash Kumar Lal, Boyu Gou, Junyi Li, Weitong Ruan, Chentao Ye, Rahul Gupta, Diyi Yang, Yu Su, Huan Sun

AI总结 提出SkillHarm基准,通过固定载荷投毒和自我变异投毒两种攻击场景,系统评估基于技能的攻击在技能使用生命周期中的风险,并构建自动化管道AutoSkillHarm生成大规模攻击样本。

Comments Work in Progress

详情
AI中文摘要

智能体技能在智能体工作流中占据特权地位,因为智能体被期望隐式地遵循并执行它们,这使得第三方技能成为易受攻击的表面。现有研究揭示了由基于技能的攻击引起的不安全智能体行为,但它们主要评估单个任务执行中的投毒技能,并通过临时风险列表枚举危害。为弥补这些不足,我们引入SkillHarm,这是一个跨越技能使用生命周期的基于技能的攻击基准,并配有一个系统化的技能相关风险分类。SkillHarm评估两种攻击场景:固定载荷投毒(FPP),其中固定的投毒技能包直接危害任何调用它的任务会话;以及自我变异投毒(SMP),其中最初良性的执行静默地变异持久的技能内容,将危害延迟到后续重用。它进一步根据危害针对的智能体工作流组件定义了12种风险类型:数据管道、系统环境和智能体自主性。为了大规模实例化这些攻击,我们构建了AutoSkillHarm,一个由自然语言驱动、编码智能体驱动的自动化构建管道。由此产生的基准包含71个技能的879个攻击样本。实验表明,当前智能体仍然易受攻击,FPP和SMP的攻击成功率分别高达86.3%和69.3%。我们的分析进一步揭示了一个潜在风险:许多明显的攻击失败源于智能体未能与被投毒的文件交互,而非真正的抵抗,而当前的防御措施仍然无法可靠地缓解这一威胁。

英文摘要

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.

2606.02530 2026-06-02 cs.AI cs.CL 版本更新

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer: 局部化在策略蒸馏用于高效安全对齐

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang, Wendi Feng, Yesheng Liu, Lijun Li, Jin-Ge Yao, Lei Sha

发表机构 * Beihang University(北航) Beijing Institute of Technology(北京理工大学) Beijing University of Posts and Telecommunications(北京邮电大学) Peking University(北京大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 针对大语言模型安全对齐导致通用能力下降的问题,提出SafeSteer方法,通过激活引导构建安全教师并选择安全令牌,仅在安全令牌上施加反向KL惩罚,在仅用100个有害样本且无需通用数据的情况下,实现了安全与通用能力之间的优越平衡。

Comments 19 pages, 8 figures, 14 tables. Submitted to EMNLP 2026

详情
AI中文摘要

将大型语言模型(LLMs)与人类价值观对齐通常会降低其通用能力,这被称为对齐税。现有方法通过平衡双重目标来缓解这一问题,但严重依赖大量通用数据或辅助奖励模型。在本文中,我们认为,由于安全特征在输出分布中本质上是稀疏的,对齐需要局部修改而非全局权衡。为此,我们提出SafeSteer,它在安全令牌上执行在策略蒸馏。首先,我们通过激活引导构建一个安全教师。基于该教师,我们开发了一种安全令牌选择算法。因此,SafeSteer在训练期间将反向KL惩罚限制在这些令牌上,以保留通用能力。跨多种模型的实验结果表明,与现有方法相比,我们的SafeSteer在安全性和通用能力之间实现了更优越的权衡,在七个安全基准上取得了强大的安全性能,同时在五个通用能力基准上仅有最小程度的下降。值得注意的是,SafeSteer仅需100个有害样本,无需使用任何通用数据,不到先前基线所用数据的1%,大大降低了对齐成本。更多详情请访问我们的项目页面:https://anjingkun.github.io/SafeSteer。

英文摘要

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.

2606.02523 2026-06-02 cs.CL cs.CV cs.CY 版本更新

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

FigSIM:用于自杀迷因的细粒度自杀严重程度和比喻语言数据集

Liuliu Chen, Elise R. Carrotte, Brian E. Chapman, Jo Robinson, Mike Conway

发表机构 * School of Computing and Information Systems, University of Melbourne, Australia(墨尔本大学计算与信息学院) Orygen, The National Centre of Excellence in Youth Mental Health, Australia(奥里根青少年心理健康国家研究中心) Centre for Youth Mental Health, University of Melbourne, Australia(墨尔本大学青少年心理健康中心) O’Donnell School of Public Health, UT Southwestern Medical Center, United States(奥唐奈公共卫生学院,西南医学中心)

AI总结 本文提出FigSIM数据集,包含1049个自杀迷因,标注了细粒度自杀严重程度、比喻现象和自杀相关内容,并评估了16个单模态和多模态模型在比喻语言、自杀严重程度和自杀相关内容检测任务上的表现,揭示了建模和内容审核的独特挑战。

Comments Content warning: contains suicide-related content. Accepted to Findings of the Association for Computational Linguistics: ACL 2026

详情
AI中文摘要

自杀迷因是用于表达自杀相关想法或评论自杀相关问题的迷因。自杀迷因在社交媒体上越来越常见,但仍未被充分理解且可能有害。迫切需要更好地了解其特征,并制定适当的内容审核策略,以限制用户接触潜在有害内容。目前,缺乏注释的自杀迷因数据集仍然是开发和评估自动审核方法的主要障碍。在本文中,我们介绍了FigSIM,这是第一个用于自杀迷因细粒度分析的数据集。该数据集包含1049个迷因,每个迷因都标注了(1)细粒度自杀严重程度级别,(2)比喻现象(例如隐喻),以及(3)自杀相关内容(例如自杀方法描绘)。我们在三个任务上评估了16个单模态和多模态模型:比喻语言、自杀严重程度和自杀相关内容检测。总体而言,FigSIM表明自杀迷因对建模和内容审核都构成了独特的挑战。分析揭示了偏差,例如对较高自杀严重程度级别的预测不足,尤其是对于比喻迷因。该数据集(包括用于分析的分割)是公开可用的。内容警告:本文包含可能引发不适的自杀相关内容。

英文摘要

Suicide memes are memes used to express suicide-related thoughts or comment on suicide-related issues. Suicide memes are increasingly common on social media, yet remain poorly understood and potentially harmful. There is an urgent need to better understand their characteristics and to develop appropriate content moderation strategies that limits users' exposure to potentially harmful content. Currently, the absence of annotated datasets of suicide memes remains a key barrier to developing and evaluating automated moderation approaches. In this paper, we introduce FigSIM, the first dataset designed for fine-grained analysis of suicide memes. The dataset consists of 1049 memes, each annotated for (1) fine-grained suicide severity levels, (2) figurative phenomena (e.g., metaphors), and (3) suicide-related content (e.g., suicide method depiction). We benchmark 16 unimodal and multimodal models across three tasks: figurative language, suicide severity, and suicide-related content detection. Overall, FigSIM demonstrates that suicide memes pose unique challenges for both modeling and content moderation. Analysis revealed biases, such as underprediction of higher suicide severity levels, especially for figurative memes. The dataset (including splits used for analyses) is publicly available. Content Warning: This paper contains suicide-related content that may be triggering.

2606.02509 2026-06-02 cs.CL 版本更新

When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives

当评分量表不足时:LLM辅助发现土耳其教师叙述中的ADHD信号

Baris Karacan, Irem Aktar Songur, Ahmet Ozaslan, Elvan Iseri

发表机构 * Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Child and Adolescent Psychiatry, Gazi University(加齐大学儿童与青少年精神病学系)

AI总结 本研究通过分析土耳其教师评估表中的结构化评分和开放式叙述,利用大语言模型辅助的主题发现方法,揭示了叙述文本中未被结构化量表捕捉的ADHD互补信号。

Comments 15 pages. Accepted to CLPsych 2026. Camera-ready author version. The final version will appear in the ACL Anthology

详情
AI中文摘要

注意缺陷多动障碍(ADHD)是儿童期最常见的神经发育障碍之一,其诊断依赖于结合临床医生判断、标准化评分量表以及家长和教师报告的评估。虽然诸如康纳斯教师评分量表修订版简表(CTRS-R:S)等结构化工具能够量化ADHD相关行为,但教师也会提供开放式叙述,其中可能包含结构化评估未捕捉的互补信号。然而,教师叙述在多大程度上编码了评分量表忽视的信号仍不清楚。在本研究中,我们分析了临床ADHD评估期间收集的去标识化土耳其教师评估表,包括CTRS-R:S评分和开放式教师叙述。我们比较了结构化评分和叙述文本的预测信号,并识别了结构化评估无法清晰区分ADHD与非ADHD学生,而基于叙述的模型却能捕捉到不同行为模式的案例。值得注意的是,这些案例与叙述模型遗漏的案例重叠极少,表明结构化和叙述信息编码了互补信号。为了解释这些差异,我们应用了大语言模型(LLM)辅助的主题发现流程,揭示了不同的注意力、行为和家庭相关模式,突显了自然语言处理(NLP)在从教师叙述中发现临床相关信号以及补充传统ADHD筛查工具方面的潜力。

英文摘要

Attention Deficit Hyperactivity Disorder (ADHD) is one of the most common neurodevelopmental disorders in childhood, and its diagnosis relies on assessments combining clinician judgment with standardized rating scales and reports from parents and teachers. While structured instruments such as the Conners' Teacher Rating Scale-Revised Short Form (CTRS-R:S) quantify ADHD-related behaviors, teachers also provide open-ended narratives that may contain complementary signals not captured by structured assessments. However, it remains unclear to what extent teacher narratives encode signals overlooked by rating scales. In this study, we analyze de-identified Turkish teacher evaluation forms collected during clinical ADHD assessments, including both CTRS-R:S scores and open-ended teacher narratives. We compare predictive signals from structured scores and narrative text and identify cases where structured assessments fail to clearly distinguish ADHD from non-ADHD students while narrative-based models capture distinct behavioral patterns. Notably, these cases show minimal overlap with those missed by the narrative model, suggesting that structured and narrative information encode complementary signals. To interpret these differences, we apply a large language model (LLM)-assisted theme discovery pipeline that reveals distinct attention, behavioral, and family-related patterns, highlighting the potential of natural language processing (NLP) to uncover clinically relevant signals from teacher narratives and to complement traditional ADHD screening tools.

2606.02502 2026-06-02 cs.CL 版本更新

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM:面向多模态持续指令调优的质心路由与自适应MoE

Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室)

AI总结 提出CRAM方法,通过将任务特定模式隔离到独立模块、自适应秩实例化动态分配参数、质心路由激活现有专家以及正交惩罚约束更新方向,解决了多模态持续指令调优中任务竞争导致遗忘和参数效率低下的问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)通过指令调优在共享生成框架下统一异构视觉-语言任务,但实际部署需要持续能力扩展,这使得多模态持续指令调优(MCIT)至关重要。现有方法要么使用共享参数集更新所有任务,要么为每个新任务分配专用模块。共享更新迫使异构任务竞争,导致已学能力遗忘。相反,隔离扩展防止了干扰,但在长任务流中严重限制了参数效率。为解决这一困境,我们提出了CRAM。具体来说,通过将任务特定模式隔离到独立模块,CRAM减轻了跨任务的灾难性遗忘。为了进一步提高参数效率,我们利用自适应秩实例化来识别现有专家能力与新任务需求之间的能力差距,并仅动态分配必要的参数。为了确保任务间的稳定复用,质心引导路由识别并激活现有专家的能力,而正交惩罚将新更新限制在任务特定方向,防止重新学习通用能力。跨多个基准的大量实验一致证明了其相对于现有方法的优越性。

英文摘要

Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts' capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.

2606.02487 2026-06-02 cs.CL 版本更新

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

迈向多学科住院总结:高效的句子级临床来源分类

Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde, Mary Khetani, Yu-Shan Tseng, Vanessa Barbosa, Julie Vignato, Lindsey Knake, Rajashree Dahal, Emily Spellman, Danielle Hitzel, Janine Petitgout, Kristi Haughey, Amanda Karstens, Brianna Clarahan, Rachel Dawson, Lauren Boyd, Mackenzie Weis, Angie Tipton, Jaewon Bae, Catherine K. Craven, Karen Dunn Lopez, Andrew D. Boyd

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of Iowa(爱荷华大学) University of Missouri-Columbia(密苏里大学哥伦比亚分校)

AI总结 本研究提出一个基于大语言模型监督微调的临床来源分类流水线,用于多学科住院总结,通过量化70B模型在跨领域迁移中提升F1分数7%,并证明模型容量对语义灵活性的关键作用。

Comments 5 pages. Submitted preprint version of a paper accepted to AIME 2026. This version may differ from the camera-ready manuscript and the final Version of Record. The Version of Record will be available from Springer Nature once published

详情
AI中文摘要

在新生儿重症监护室(NICU)等高复杂性环境中,有效的“全团队”总结需要聚合来自不同学科(医生、护士、治疗师)的见解,这些见解分散在数百份临床自由文本记录中。简单地将异质文本汇集在一起往往会导致输出不连贯。因此,结构化总结首先需要对跨多源记录的句子级来源进行准确分类。本初步研究引入了一个临床来源分类流水线,使用大语言模型(LLM)的监督微调(SFT)。我们将两个Llama-3模型(8B和70B)适配到MedSecId,这是一个包含2,002份MIMIC-III(成人ICU)记录并带有临床来源标题注释的语料库,两个模型在域内均实现了超过92%的宏F1分数。为了评估跨域泛化能力,我们评估了模型容量(8B vs. 70B)和量化在一个由来自三个多学科NICU总结的227个句子级跨度组成的金标准数据集上的表现。实验结果表明存在规模依赖的迁移效应:虽然SFT对8B模型仅产生边际变化,但显著改进了70B模型,使宏F1提高了7%。值得注意的是,量化微调的70B模型在显著降低计算需求的同时,超越了其全精度基线。这些发现表明,足够的模型容量对于在跨域临床迁移中保持语义灵活性至关重要,并且高效的量化适配可以为下游总结实现结构化来源建模。

英文摘要

Effective "all-team" summarization in high-complexity settings like the Neonatal Intensive Care Unit (NICU) requires aggregating insights from diverse disciplines (physicians, nurses, therapists) spread across hundreds of clinical free-text notes. Simply pooling heterogeneous text often leads to incoherent outputs. Structured summarization therefore first requires accurate categorization of sentence-level provenance across multi-source notes. This pilot study introduces a clinical provenance categorization pipeline using supervised fine-tuning (SFT) of large language models (LLMs). We adapted two Llama-3 models (8B and 70B) to MedSecId, a corpus of 2,002 MIMIC-III (Adult ICU) notes annotated with clinical provenance headers, achieving in-domain Macro F1 scores above 92% for both models. To evaluate cross-domain generalization, we assessed model capacity (8B vs. 70B) and quantization on a gold-standard dataset of 227 sentence-level spans derived from three multi-disciplinary NICU summaries. Experimental results demonstrate a scale-dependent transfer effect: while SFT produced only marginal changes for the 8B model, it substantially improved the 70B model, increasing Macro F1 by 7%. Notably, the quantized fine-tuned 70B model outperformed its full-precision baseline while substantially reducing computational requirements. These findings suggest that sufficient model capacity is critical for preserving semantic flexibility during cross-domain clinical transfer and that efficient quantized adaptation can enable structured provenance modeling for downstream summarization.

2606.02483 2026-06-02 cs.CR cs.AI cs.CL 版本更新

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

幽灵工具调用:投机性智能体工具的发布时隐私保护

Bardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) EPFL(苏黎世联邦理工学院) Aarhus University(阿arhus大学)

AI总结 针对工具增强型语言智能体投机性预发调用泄露用户意图的问题,提出投机性工具隐私契约,在发布时而非提交后保护隐私。

详情
AI中文摘要

工具增强型语言智能体投机性地发出可能的未来工具调用以隐藏延迟,但这些调用在智能体提交分支之前将推断出的用户意图泄露给外部服务。每个收到调用的外部观察者在智能体放弃分支后仍保留该披露。问题在于时机,而非授权:提交后的清理、只读限制或访问控制白名单都无法撤回观察者已持有的信息。我们将这些调用称为幽灵工具调用,并提出投机性工具隐私契约,这是一种运行时抽象,将提交前的观察视为与状态突变不同的第一类效应。我们在原型运行时中实现了该契约,并在三个语料库上评估了十二种策略。投机性调度增加了观察者能够推断用户意图的程度;事后过滤器、只读限制和访问控制白名单无法消除这种推断;只有那些在调度前改变或抑制投机性调用的参数或目标投影的发布时策略才能减少这种推断。

英文摘要

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the call retains the disclosure after the agent abandons the branch. Timing is the issue, not authorization: no commit-time cleanup, read-only restriction, or access-control allow-list unsends what an observer already holds. We call these invocations ghost tool calls and propose Speculative Tool Privacy Contracts, a runtime abstraction that treats observation before commitment as a first-class effect, distinct from state mutation. We implement the contracts in a prototype runtime and evaluate twelve policies across three corpora. Speculative dispatch increases what an observer can infer about user intent; post-hoc filters, read-only restrictions, and access-control allow-lists leave that inference intact; only issue-time policies that change or suppress the speculative call's argument or destination projection before dispatch reduce it.

2606.02465 2026-06-02 cs.CL cs.AI 版本更新

Learning When to Translate for Multilingual Reasoning

学习何时翻译以实现多语言推理

Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Department of Computer Science & Engineering(计算机科学与工程系) POSTECH

AI总结 提出Luar框架,通过强化学习训练推理语言模型在直接理解不可靠时选择性调用翻译,从而缩小多语言推理差距。

Comments preprint

详情
AI中文摘要

推理语言模型(RLMs)在复杂推理任务上表现出色,但仍存在显著的多语言推理差距,这主要源于非英语输入中的语言理解失败。英语翻译可以通过将非英语输入转换为RLMs更可靠解释的形式来缓解这些失败,但当模型能够从原始查询中可靠推理时,翻译每个输入是不必要的。为应对这一挑战,我们提出Luar,一种语言理解边界感知的强化学习框架,训练RLMs在直接理解不可靠时选择性调用翻译。Luar训练模型在直接解决原始输入和对其英语翻译进行推理之间做出选择,仅在翻译增强推理预期显著优于直接推理时鼓励翻译。在多语言推理基准测试中,Luar优于标准GRPO和其他基于训练的基线,在低资源语言上尤其获得巨大提升。进一步分析表明,Luar在直接推理足够的情况下避免不必要的翻译,同时将其翻译调用行为扩展到未见过的低资源语言。总之,我们的工作提出了一种选择性多语言推理方法:RLMs可以学习仅在直接理解不可靠时调用翻译。该项目将在https://github.com/deokhk/LUAR公开。

英文摘要

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at https://github.com/deokhk/LUAR

2606.02449 2026-06-02 cs.AI cs.CL cs.CV cs.LG cs.MM 版本更新

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL:智能体能否跨越人类最后一道验证防线?

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学) Tongji University(同济大学)

AI总结 提出HLL基准,通过交互式CAPTCHA验证评估多模态智能体在受保护工作流中替代人类的能力,发现当前智能体在定位、动作校准、状态跟踪和过程一致性方面存在脆弱性。

Comments 27 pages, 14 figures

详情
AI中文摘要

多模态智能体越来越被期望代表用户操作界面,这引发了一个核心部署问题:在服务特意防止自动化的流程中,它们能否真正替代人类?CAPTCHA验证使这个问题具体化。它不仅仅是一个视觉谜题,更是在账户创建、内容访问、表单提交和其他受保护操作之前设置的人类验证边界。我们引入了 extbf{人类最后一道验证防线(HLL)},这是一个受控基准,使用交互式CAPTCHA验证来评估智能体是否能够通过基于环境的类人交互(而非仅识别)跨越这一边界。HLL涵盖了多种CAPTCHA交互,并让智能体暴露于受控的现实压力因素下,包括杂乱的网页、更困难的任务变体以及解决过程的轨迹条件验证。我们在闭环GUI环境中评估了八个前沿多模态智能体。结果表明,当前智能体在这个人类替代边界上仍然脆弱:性能在不同验证类型间差异显著,在现实界面条件下下降,当正确答案必须由有效动作轨迹支持时进一步下降。通过揭示定位、动作校准、状态跟踪和过程一致性方面的差距,HLL为衡量多模态智能体在受保护的真实世界工作流中作为人类替代品有多接近提供了一个具体的测试平台。我们的代码可在https://github.com/XinhaoS0101/HLL获取。

英文摘要

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

2606.02444 2026-06-02 cs.AI cs.CL 版本更新

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

食物噪音与虚假安全:系统评估LLMs如何在临床医生反馈下未能适应饮食障碍查询

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar, Arabella Sinclair

发表机构 * University of Aberdeen(阿伯丁大学) University of Colorado Anschutz(科罗拉多大学安舒茨分校) Heriot-Watt University(赫里奥特-瓦特大学) University College London(伦敦大学学院)

AI总结 本研究通过与临床饮食障碍专家合作,系统评估了大型语言模型在处理饮食障碍用户查询时,因不加批判地适应不安全或自伤请求而可能产生的危害。

详情
AI中文摘要

近期证据表明,饮食障碍患者越来越多地向基于大型语言模型的聊天系统寻求指导、建议和情感支持。尽管这些系统并非设计用于提供临床建议,但其感知的专业性、中立性和可访问性使其成为频繁但存在风险的支持来源。本文调查了饮食障碍用户与LLMs之间潜在的交互模式,重点关注模型不加批判地适应并促进不安全或自伤用户请求可能产生的危害。通过与临床饮食障碍专家协商,我们发现提示中的特定语言线索会增加不安全响应的可能性,并通过系统性地改变用户提示中潜在风险的程度,报告了LLMs不加批判地适应有问题的、潜在危险用户输入的程度。

英文摘要

Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large Language Model (LLM)-based chat systems. Although these systems are not designed to provide clinical advice, their perceived expertise, neutrality and accessibility make them a frequent, albeit risky, source of support. This paper investigates potential patterns of interaction between users with EDs and LLMs, focusing on the potential harms arising from models that uncritically adapt to, and facilitate unsafe or self-harming user requests. We find, in consultation with clinical ED experts, that specific linguistic cues in prompts increase the likelihood of unsafe responses and, through systematically varying the degree of potential risk present in the user prompt, report the extent to which LLMs uncritically adapt to problematic, and potentially dangerous user inputs.

2606.02443 2026-06-02 cs.CL cs.AI cs.CV 版本更新

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学)

AI总结 提出PaSBench-Video基准,包含740个视频,评估多模态大模型在危险发生前及时发出预警的能力,发现现有模型在时序精度和低误报率上表现不佳。

详情
AI中文摘要

从危险的第一个可见迹象到事故发生之间,通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型(MLLM)可以作为始终在线的安全监控器,在此窗口内发出警告。然而,当前的基准测试并未检验这一能力:它们依赖静态输入,忽略时间精度,并且省略了对安全场景的误报测量。我们提出了PaSBench-Video,一个包含740个视频的基准测试,涵盖驾驶、医疗、日常生活和工业生产四个领域,其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频,并发出在时间上校准且内容正确的警告。测试了13个MLLM后,我们发现没有模型在我们的最严格指标上超过20.0%,并且召回率与误报率紧密相关,皮尔逊相关系数为0.64:更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化:在日常生活领域,模型在低误报率下实现了中等召回率,因为该领域的风险本质上是异常的;而在驾驶领域,模型不加区分地触发警告,因为常规场景和危险场景看起来相似。这些结果表明,当前模型依赖于场景级别的活动线索,而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

2606.02433 2026-06-02 cs.IR cs.AI cs.CL cs.LG cs.MA 版本更新

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe:面向未来数据预测与推理的开放域表格问答数据集

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出开放域表格问答的未来预测与推理任务,并构建首个覆盖时间序列预测和基于预测推理的数据集,通过基于LLM代理的TimeFore框架(检索器、预测器、分析器)解决历史数据检索、预测限制和响应标准化挑战。

Comments This paper has been accepted by Findings of ACL 2026

详情
AI中文摘要

大语言模型的快速发展显著推进了表格问答,但大多数系统无法进行面向未来的数值预测。为弥补这一空白,我们引入了一个新任务——面向未来数据预测与推理的开放域表格问答,并提出了首个覆盖时间序列预测和基于预测推理场景的数据集,使用房地产数据。该任务在检索精确历史数据、克服LLM的预测限制以及标准化多样化查询的响应方面提出了挑战。为解决上述挑战,我们提出了TimeFore,一个基于LLM代理的框架,将问题分解为三个协作角色:检索器自主生成SQL以获取数据,预测器调用外部时间序列模型以获得更高精度,分析器综合结果以构建精确且一致的最终答案。大量实验证明了我们TimeFore的有效性。

英文摘要

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.

2606.02423 2026-06-02 cs.CL cs.LG 版本更新

Investigating and Alleviating Harm Amplification in LLM Interactions

调查和缓解大语言模型交互中的危害放大

Ruohao Guo, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出HarmAmp基准和TrajSafe监控器,用于评估和缓解多轮对话中大语言模型对危害的放大效应。

详情
AI中文摘要

大语言模型(LLM)可以作为有用的助手,但它们同样可以作为危害放大器,使恶意用户通过扩展交互实现超出其能力的危害结果。这种风险沿着两个轴显现,即民主化领域专业知识,使新手能够产生专门的有害内容,以及以手动努力无法匹敌的规模扩大有害操作。然而,现有工作往往忽略了LLM在多轮对话中如何加剧危害。我们引入了HarmAmp,这是一个新的基准,用于涵盖十二个风险类别的多轮危害放大场景。每个场景都基于现实世界的威胁,并满足严格的标准,即实质性放大、操作特异性和多轮必要性。我们进一步提出了TrajSafe,一种主动监控器,可以预测有害轨迹并通过诸如探测用户真实意图和引导模型更安全地完成等行动进行干预。我们的广泛实验表明,TrajSafe显著降低了多轮交互中产生的危害性,同时保持了低过度拒绝率和目标模型的一般能力。我们的工作为缓解LLM交互中微妙的安全风险提供了一个有前景的范式。

英文摘要

Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.

2606.02404 2026-06-02 cs.CL 版本更新

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

K-BrowseComp:基于韩国语境的网页浏览代理基准测试

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim

发表机构 * Chung-Ang University(Chung-Ang 大学) KAIST(韩国科学技术院) Seoul National University(首尔国立大学) OnelineAI NAVER Cloud AI Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对韩国语境,构建包含400个问题的网页浏览代理基准K-BrowseComp,评估前沿模型性能,发现其准确率显著低于BrowseComp,并公开数据与代码。

详情
AI中文摘要

前沿模型评估正从基础能力(如指令遵循和推理)转向组合性、代理性能力,但韩语代理基准仍然稀缺。我们介绍了K-BrowseComp,一个基于韩国语境的网页浏览代理基准,包含400个问题。其中300个问题的K-BrowseComp-Verified子集由母语为韩语的人手动构建和验证。在该子集上,包括GPT-5.5、DeepSeek-V4-Pro和GLM-5.1在内的前沿LLM仅达到30.00–45.67%,相比BrowseComp大幅下降,而通过韩国专有AI基础模型计划发布的韩语LLM仅获得0.00–10.33%。我们进一步利用解决和创建网页浏览问题之间的不对称性,通过硬样本少样本示例和失败模式导向生成构建了一个100个问题的合成分割。在对抗性过滤的合成诊断分割上,最强模型仅达到26.00%,我们将此分割作为定向压力测试单独报告。我们公开发布了数据和代码。

英文摘要

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

2606.02398 2026-06-02 cs.LG cs.CL 版本更新

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

跨域干扰与恢复的局部微扰理论:多领域强化学习

Lei Yang, Siyu Ding, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院TJUNLP实验室) Baidu Inc.(百度公司)

AI总结 针对多领域RL训练中一个领域性能下降的问题,提出局部微扰理论,证明后期领域训练主要通过二阶损伤项在低维共享冲突子空间中损害早期领域,并通过短时领域刷新实现选择性恢复。

详情
AI中文摘要

强化学习后训练在数学推理、代码生成、问答和创意写作等单个领域上改进了大型语言模型,但在一个领域上的训练往往会降低其他领域的性能。基于灾难性遗忘或全局梯度冲突的现有解释是不完整的:即使全模型梯度几乎正交,也可能发生实质性干扰。我们表明,单领域RL产生稀疏、小量级的参数编辑,且top变化神经元之间的重叠较弱,而不同领域仍然共享大量的活跃计算路径,这些路径上的更新方向决定了它们是协同还是冲突。在此观察指导下,我们在多领域RL的局部微扰模型下证明,后期领域训练主要通过二阶损伤项损害早期领域,在观察到的稀疏路径结构下,该损伤项集中在低维共享冲突子空间中。此外,短时领域刷新会收缩该子空间上的有害成分,从而在有限的附带损伤下实现选择性恢复。与理论一致,在Code → Math → QA → CW之后进行短暂的Re-Math刷新,将Math从57.66恢复到66.04,同时基本保持其他领域的性能,得到最佳平均分66.39。除了刷新之外,针对Math-QA对的稀疏代理冲突坐标集进行无训练回滚可部分恢复Math,为局部损伤提供了直接的代理级证据。这些结果为多领域RL中的干扰和恢复提供了局部机制解释。

英文摘要

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.

2606.02380 2026-06-02 cs.CL cs.AI 版本更新

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench:通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Peking University(北京大学) University of Science and Technology of China(中国科学技术大学) University of Chinese Academy of Science(中国科学院大学) Alibaba Group(阿里巴巴集团)

AI总结 针对LLM智能体在工具使用中可能出现的自发性策略欺骗(计划与行动不一致),提出SPADE-Bench基准,通过结合实际工具执行和受控压力场景,严格区分欺骗与幻觉,实验证实该问题真实且紧迫。

详情
AI中文摘要

随着基于LLM的智能体扩展其操作范围,可靠性成为实际部署的前提。然而,在实际应用中,人类用户无法监控每一个即时行为;相反,执行过程往往是一个黑箱,用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险:智能体可能呈现与执行行动不一致的面向观察者的报告,使得系统不可控,尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点,我们引入了SPADE-Bench,一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同,SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度,并通过在压力下进行受控的计划-行动比较,严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实,智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架,SPADE-Bench填补了智能体安全中的关键空白,促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

2606.02375 2026-06-02 cs.CL cs.CY cs.HC 版本更新

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

WAXAL-NET: 针对19种非洲语言的微调边缘ASR

Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi, Wanchi Lucia Yen, John Uzodinma, Sunday Ajayi, Oluwademilade Williams, Kausar Moshood, Innocent Elendu Anyaele, Akebert Arefaine, Candace Hunzwi, Wongel Dawit Daniel, Emmilly Namuganga, Cleophas Kadima, Athanase Bahizire, Onitsiky Ranaivoson, Emmanuel Aaron, Nicholaus Ladislaus, Idris Muhammed, Jonathan Enoch Simenya, Martin Koome, Matewos Tegete Endaylalu, Peter Ifeoluwa Adeyemo, Hondi Prisca Birindwa, Ukachi Agnes Eze-Mbey, Yacoba Oduro-Yeboah, Pericles Adjovi, Mikel K. Ngueajio, Toluwani Aremu, Prasenjit Mitra

发表机构 * CMU Africa(CMU非洲中心) LyngualLabs MBZUAI

AI总结 本研究评估了紧凑型领域专用ASR模型在WAXAL语料库的19种非洲语言会话语音上是否优于大规模多语言基础模型,通过微调边缘模型实现了宏平均WER从64.9%降至38.0%,模型大小缩小3-40倍,证实领域专业化主导规模效应。

详情
AI中文摘要

我们评估了紧凑型领域专用ASR模型是否能在WAXAL语料库的19种非洲语言会话语音上优于大规模多语言基础模型。微调后的边缘模型实现了宏平均词错误率(WER)38.0%,而最佳零样本基线为64.9%,使用小3-40倍的模型降低了26.9个百分点。结果证实,对于自发的非洲语音,领域专业化主导规模效应。跨域评估显示,微调模型在分布外(OOD)语音上恢复了可用性能,而零样本模型在测试域与其预训练分布匹配时重新获得优势。一项涵盖所有调查语言的分布式母语者审计产生了基于语言学的错误分类,表明CTC和自回归架构在不同语系中表现不同。我们进一步表明,对于音节文字语言,仅WER会错误表示性能,其中CER/WER比率显示字符级准确率远高于标题WER所暗示的。最后,为促进未来的非洲ASR研究,我们发布了所有模型权重、微调和评估脚本,以及涵盖全部19种语言的清洗后的WAXAL子集。

英文摘要

We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.

2606.02373 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1:基于状态外化马具的搜索智能体强化学习

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han

AI总结 提出Harness-1,一个20B参数的搜索智能体,通过强化学习在有状态搜索马具中训练,将常规状态管理外化到环境,在八个检索基准上平均召回率0.730,超越现有开源搜索子智能体11.4个百分点。

详情
AI中文摘要

搜索智能体通常被训练为基于不断增长的转录的策略:模型必须决定如何搜索,同时记住它看到了什么、哪些证据有用、哪些约束仍然开放、哪些声明已被检查。我们认为这种表述将过多的常规状态管理放在策略内部:强化学习被迫同时优化语义搜索决策和可恢复的簿记,而环境可以更可靠地维护这些簿记。我们引入Harness-1,一个20B参数的搜索智能体(检索子智能体),在有状态搜索马具内通过强化学习训练。该马具维护环境端的工作记忆,包括候选池、重要性标记的精选集、紧凑的证据链接、验证记录、压缩和去重的观察结果,以及预算感知的上下文渲染。策略保留语义决策:搜索什么、保留或丢弃哪些文档、验证什么以及何时停止。在涵盖网络、金融、专利和多跳问答的八个检索基准上,Harness-1实现了0.730的平均精选召回率,比次强的开源搜索子智能体高出11.4个百分点,并与更大的前沿模型搜索器保持竞争力。其优势在保留的迁移基准上尤为显著,表明基于显式搜索状态的强化学习可以产生超越训练领域的检索行为。我们的代码可在https://github.com/pat-jj/harness-1获取。

英文摘要

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

2606.02372 2026-06-02 cs.AI cs.CL 版本更新

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP:面向LLM智能体的世界模型与智能体策略协同进化

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

发表机构 * Central South University(中南大学) College of Computer Science, Sichuan University(四川大学计算机学院) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 提出COMAP框架,通过闭环交互协同进化文本世界模型和智能体策略,在具身任务规划、网页导航和工具使用基准上显著提升性能。

详情
AI中文摘要

为语言智能体配备世界模型使其能够在执行前预测环境动态并评估候选动作。然而,现有的文本世界模型通常在训练后固定不变,无法适应由进化中的智能体引发的策略内状态-动作分布。同时,智能体改进方法往往依赖外部奖励或验证器,限制了其在现实交互环境中的适用性。本文提出COMAP,一种通过闭环交互协同进化文本世界模型和智能体策略的新框架。在每个决策步骤,世界模型预测候选动作的未来状态反馈,智能体通过估计该反馈的可靠性并相应调整动作来进行未来感知反思。由此产生的策略内轨迹随后通过自蒸馏用于更新世界模型,使其更好地匹配智能体不断演化的交互分布。在具身任务规划、网页导航和工具使用基准上,COMAP始终优于竞争基线,例如使用Qwen3-4B相对提升16.75%。进一步分析表明,协同进化循环随时间提高了世界模型的预测准确性,并导致更有效的长程决策。我们的代码可在https://github.com/loyiv/CoMAP获取。

英文摘要

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

2606.02304 2026-06-02 cs.CL 版本更新

Unified Context Evolution for LLM Agents

统一上下文演化:面向LLM智能体

Zixuan Zhu, Yitong Hu, Yong Dai, Junfeng Fang, Chunyang Jiang, Senkang Hu, Yuzhi Zhao

AI总结 提出统一上下文演化(UCE)框架,通过将智能体经验外部化为四种类型的可演化上下文单元(ECU),实现跨任务的知识积累与动态调度,在ALFWorld和WebShop上显著提升性能。

详情
AI中文摘要

基于LLM的智能体可以通过结合推理与环境反馈来解决多步交互任务,然而每个回合从相同的固定上下文开始,任务结束后沿途发现的任何有用策略都会丢失。现有方法要么将学习限制在当前任务,要么将所有经验汇集到单一的无类型存储中,而不区分知识类型、通过使用跟踪质量或平衡库中仍缺乏的内容。我们引入了统一上下文演化(UCE),一种无梯度框架,将智能体经验外部化到不断演化的类型化可演化上下文单元(ECU)库中。UCE将经验分解为四种互补类型(记忆、策略、工作流和技能),每种类型从轨迹中根据类型特定条件生成,在决策时检索,通过重复使用结果评分,并在不再有价值时修剪。调度模块将每个周期的生成预算分配给库中最弱的类型。在两个交互基准测试中,UCE将ALFWorld的成功率从75.4%提高到96.3%,将WebShop的任务得分从45.1%提高到61.3%,并且累积的库无需重新训练即可迁移到其他智能体主干。

英文摘要

LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to the current task or pool all experience into a single untyped store, without distinguishing knowledge types, tracking quality through use, or balancing what the library still lacks. We introduce Unified Context Evolution (UCE), a gradient-free framework that externalizes agent experience into an evolving library of typed Evolvable Context Units (ECUs). UCE decomposes experience into four complementary types (Memory, Strategy, Workflow, and Skill), each generated from trajectories under type-specific conditions, retrieved at decision time, scored through repeated usage outcomes, and pruned when no longer valuable. A scheduling module allocates each cycle's generation budget toward the types where the library is weakest. Across two interactive benchmarks, UCE raises ALFWorld success from 75.4% to 96.3% and WebShop task score from 45.1% to 61.3%, and the accumulated library transfers to alternative actor backbones without retraining.

2606.02300 2026-06-02 cs.CL 版本更新

Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization

超越孤立行为:面向LLM个性化的层次化用户建模

Liang Wang, Xinyi Mou, Xiaoyou Liu, Tiannan Wang, Yuqing Wang, Zhongyu Wei

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) OPPO

AI总结 针对LLM个性化中用户行为缺乏层次结构的问题,提出基于布迪厄实践理论的PHF框架,通过实践-惯习-场域三层建模,并实现轻量级模型无关方法PHF_Compass,在LaMP基准上取得一致提升。

详情
AI中文摘要

大型语言模型(LLM)在多个领域展现出卓越能力,但将其输出个性化以适应个体用户仍是一个开放挑战。现有方法主要采用扁平行为范式,聚合用户行为而未明确考虑它们如何组织成更深层的行为结构。在本工作中,我们借鉴皮埃尔·布迪厄的实践理论,提出PHF(实践-惯习-场域),一个基于社会学的框架,通过三个层次重新概念化LLM个性化:作为实践的个人行为、作为惯习的行为在时间上的积累形成稳定倾向、以及作为场域的相似用户间的共享规律。我们通过$\mathrm{PHF}_{ ext{Compass}}$实例化PHF,这是一种基于冻结LLM的轻量级且模型无关的实现。在语言模型个性化(LaMP)基准上的实验表明,该方法在多种任务上取得一致改进,进一步分析验证了所学行为结构的可解释性和可扩展性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet personalizing their outputs to individual users remains an open challenge. Existing approaches predominantly adopt a flat behavioral paradigm, aggregating user behaviors without an explicit account of how they are organized into deeper behavioral structures. In this work, we draw on Pierre Bourdieu's Theory of Practice to propose PHF (Practice-Habitus-Field), a sociologically grounded framework that reconceptualizes LLM personalization through three hierarchical levels: individual behaviors as practices, their temporal accumulation into stable dispositions as habitus, and shared regularities across similar users as fields. We instantiate PHF through $\mathrm{PHF}_{\text{Compass}}$, a lightweight and model-agnostic implementation based on a frozen LLM. Experiments on the Language Model Personalization (LaMP) benchmark demonstrate consistent improvements across diverse tasks, while further analyses validate the interpretability and extensibility of the learned behavioral structures.

2606.02293 2026-06-02 cs.CL 版本更新

AI as a Tool for Simulation-Based Experiments in Literary Studies

AI作为文学研究中基于模拟的实验工具

Matthew Wilkens

发表机构 * Department of Information Science(信息科学系)

AI总结 本文探讨利用生成式AI进行受控、大规模、低成本的文学文化生产模拟实验,总结当前技术现状,并通过与人类小说对比的实验展示AI在文学文本生成中的初步成果。

详情
AI中文摘要

生成式人工智能系统通过受控、有依据、大规模、低成本的模拟文化生产,为文学研究中的实验开辟了新的可能性。当前系统尚未被证明能够生成高质量、长篇幅的叙事文本,并可靠地反映任意指定的文化约束或风格特征。但在文学历史模拟所需的各个组件上存在大量相关研究,包括:使用和验证AI系统作为可区分人类群体的代理;AI生成文本的叙事和风格特性;多智能体、多轮次AI模拟人类行为者的稳定性和连贯性;以及通过可预测方式改变生成系统知识和行为的技术方法。这些领域共同为基于AI的文学生产文化系统建模提供了更雄心勃勃的起点。我们描述了文学研究中基于模拟的实验的可能性和挑战,总结了相关领域的最新进展,并解释了工作的关键技术方面。为了提供一个与文学学者直接相关的例子,我们展示了文学文本生成实验的结果,包括与高地位、人类作者小说的比较。我们的结果首次展示了AI模型在该领域内(有限的)分布内输出。最后,我们描述了未来使用AI进行完整反事实文学历史模拟的工作。

英文摘要

Generative artificial intelligence (AI) systems open new possibilities for experimentation in literary studies via controlled, grounded, large-scale, low-cost simulations of cultural production. Current systems have not yet been shown to produce high-quality, book-length narrative texts that reliably reflect arbitrarily specified cultural constraints or stylistic features. But there exists substantial relevant research on each of the components required for literary-historical simulation. These include the use and validation of AI systems as proxies for differentiable human populations; the narrative and stylistic properties of AI-generated texts; the stability and coherence of multiagent, multiturn AI simulations of human actors; and technical methods through which to alter in predictable ways the knowledge and behavior of generative systems. Together, these areas could provide a starting point for more ambitious AI-based modeling of cultural systems of literary production. We describe the possibilities and challenges of simulation-based experiments in literary studies, summarize the current state of the art in relevant fields, and explain key technical aspects of the work. To provide an example directly relevant to literary scholars, we present the results of experiments on literary text generation, including comparisons to high-status, human-authored novels. Our results include the first demonstration of (limited) in-distribution outputs by AI models in this domain. We conclude with a description of future work on full counterfactual literary-historical simulations using AI.

2606.02289 2026-06-02 cs.CL 版本更新

DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

DECK: LLM幻觉的一致性×置信度分类法

Mohit Singh Chauhan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于样本间一致性和词级置信度的2×2分类法(DECK),将LLM幻觉分为四个行为区域,每个区域对应可检测的评分器家族,并通过实验验证其有效性及识别出输出级不确定性评估的普遍盲点。

Comments 18 pages, 3 figures, 5 tables

详情
AI中文摘要

现有的幻觉分类法根据输出错误的内容(如记忆错误、推理失败、流畅编造)对LLM错误进行分类。这些分类法有助于诊断,但无法回答另一个问题:哪个不确定性评分器本可以捕捉到这个错误?我们提出一种补充性分类法,根据错误的可检测性特征(评分器家族能读取的信号)对错误进行分类。DECK分类法是一个2×2划分,沿样本间一致性和词级置信度分为四个行为区域(Drift、Entrenched、Confabulation、Knotted),每个区域映射到能够检测它的特定评分器家族(或多个家族):黑盒一致性评分器在D和C中有信号,白盒词概率评分器在K和C中有信号,只有经过独立预训练的LLM-as-a-Judge才能检测E。通过在每个评分器轴上使用Youden's J最优分割来操作化单元成员关系。在三个模型和四个数据集上,我们通过两种方式验证该分类法:分析评分器对的不一致性,以及检查外部标签(SelfAware不可回答、HaluEval对抗性、PopQA实体流行度)是否落在预测的DECK单元中,并附带模型规模特定和内容特定的次级单元细化。我们进一步识别出输出级不确定性评估的一个普遍盲点:在知识缺口输入上,当生成器输出自信、可重复的编造时,每个输出级家族在构造上都会失效。对Llama-3-8B隐藏状态的线性探针也失效至随机水平,初步证据表明该失败可能在激活层面持续存在;更丰富的内部状态方法(不确定性评估头、信息论估计器)仍有待测试。

英文摘要

Existing hallucination taxonomies classify LLM errors by what is wrong with the output -- memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature -- the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden's J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B's hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine(医学人工智能实验室) RWTH Aachen University(亚琛工业大学) Department of Diagnostic and Interventional Radiology(诊断与介入放射学部门)

AI总结 研究临床视觉-语言模型(VLM)在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险,并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情
AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型(VLM)学习了一个共享嵌入空间,该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中(例如仅图像数据共享或受控访问的报告)构成了隐私风险,因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索,并使用公共配对队列(其中真实配对是已知的)作为基准来审计风险,而不是作为隐私场景。在来自MIMIC-CXR(43,793个保留对)和外部CheXpert Plus(29,296个对)的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM,我们发现重链接率随专业化程度系统性地上升:最强的VLM在候选池N=100时以15倍随机概率检索到正确报告,在N=10,000时以50倍随机概率,在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在,表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险,我们冻结了两个编码器,仅对定义对齐层的投影头应用差分隐私优化(epsilon=0.34,delta=6x10^-6)。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%,并无需重新训练即可迁移到CheXpert Plus,同时图像侧效用基本保持:线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接,而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

2606.02255 2026-06-02 cs.CL cs.AI 版本更新

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

谁在NLP中进行标注?2018年至2025年人工标注报告的大规模评估

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger

发表机构 * NLLG Lab University of Technology Nuremberg(NLLG实验室 梅尔堡技术大学) Interdisciplinary Transformation University(跨学科转型大学)

AI总结 本研究通过大规模审计NLP论文中的人工标注报告,发现标注细节报告不完整,并提出了改进报告质量的框架和建议。

详情
AI中文摘要

人工标注是许多NLP研究的经验基础,从数据集构建到模型评估,但论文往往不清楚谁产生了标注以及如何控制标注过程。我们首次对主要NLP会议中的人工标注报告进行了大规模、任务级别的审计,询问哪些标注细节被记录,哪些缺失,以及报告如何随时间、主题、会议和人工判断的预期用途而变化。我们引入了一个统一的标注报告实践分类法,并针对Annotated-gold(一个由41篇论文和72个标注任务组成的人工裁决黄金标准)验证了一个LLM辅助的提取流程,其中最佳模型与裁决标签达到了与人类相当的一致性,Krippendorff's alpha为0.606,而人类间一致性为0.585。利用该流程,我们构建了Annotated-llm数据集,涵盖2018-2025年ACL会议论文,包含来自1603篇论文的2667个提取的标注任务,发现论文经常报告操作细节,如招募策略、标注者专业知识和标注量,但往往省略评估标注有效性所需的细节,包括培训、语言能力、报酬、社会人口统计、裁决和一致性值,尤其是在模型评估研究中。我们的结果表明,NLP中的标注报告随时间有所改善,但仍不均衡,我们建立了一个可扩展的框架和最低报告建议,以使人工标注更可靠、可重复和可解释。

英文摘要

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

2606.02252 2026-06-02 cs.CL 版本更新

ResMerge: Residual-based Spectral Merging of Large Language Models

ResMerge:基于残差的谱合并大型语言模型

Yandu Sun, Zhiyan Hou, Haokai Ma, Yuheng Jia, Junfeng Fang, Haiyun Guo, Hongyan An, weizhen wang, Jinqiao Wang

发表机构 * Southeast University(东南大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) National University of Singapore(新加坡国立大学) Wuhan University of Technology(武汉理工大学) Peking University(北京大学) Wuhan AI Research(武汉人工智能研究院)

AI总结 提出ResMerge框架,通过将RL任务向量分解为谱头部和残差组件,利用残差组件构建稳定骨干并选择性引入头部信息,实现无需训练的专家模型合并。

Comments 14 pages including appendix

详情
AI中文摘要

模型合并提供了一种无需训练的方式组合多个后训练专家模型,但合并通过强化学习(RL)获得的专家仍然具有挑战性。现有的谱合并方法通常假设主导奇异方向包含主要任务信号,而低能量残差组件可以被压缩、选择或衰减以减少干扰。我们发现这一假设不适用于RL任务向量:将每个任务向量分解为谱头部和残差组件后,两部分都能独立恢复大量行为知识,同时表现出不同的合并特性。头部高度集中且信息丰富,但更容易产生尖锐的跨专家冲突,而残差组件更分散,为聚合提供了更稳定的基础。基于这一观察,我们提出了ResMerge,一种针对RL专家的基于残差的谱合并框架。ResMerge首先通过球形残差共识自适应构建稳定的残差骨干,该算法在Frobenius球面上估计一个可靠性加权的共识方向。然后通过轻量级头部校正模块重新引入头部信息,该模块由正跨专家一致性门控。在多个RL专家组和能力领域的实验表明,ResMerge比代表性的任务向量和谱合并基线更好地保留了专家能力。ResMerge的实现可在https://github.com/sunyd0303-cpu/ResMerge-release公开获取。

英文摘要

Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at https://github.com/sunyd0303-cpu/ResMerge-release.

2606.02248 2026-06-02 cs.CL 版本更新

Geometric Latent Reasoning Induces Shorter Generations in LLMs

几何潜在推理诱导LLM生成更短的文本

Shashi Kumar, Yacouba Kaloga, Petr Motlicek, Ina Kodrasi, Andrea Cavallaro

发表机构 * Idiap Research Institute(日内瓦研究所) EPFL(苏黎世联邦理工学院) BUT(布拉格技术大学)

AI总结 提出几何潜在推理(GLR)方法,通过轻量级过渡头在嵌入空间中预测迭代方向更新,近似离散推理轨迹,从而在不显式优化长度的情况下显著缩短LLM的生成步数。

详情
AI中文摘要

大型语言模型通过生成冗长的显式推理令牌链来解决复杂问题。虽然有效,但这使得推理成本高昂、对长度敏感,并受限于(离散的)自然语言。虽然潜在推理提供了连续的替代方案,但确定中间潜在状态的有用结构仍是一个开放挑战。在本文中,我们将潜在推理公式化为模型预训练令牌嵌入空间中的几何路径逼近问题。我们引入了几何潜在推理(GLR),它使用轻量级过渡头来预测嵌入空间中的迭代方向更新。利用文本思维链轨迹作为锚点,GLR学习逼近离散推理轨迹,同时允许偏离精确令牌嵌入的连续偏差。使用Qwen3模型在数学推理基准上的评估揭示了一个涌现现象:几何潜在推理在没有显式长度目标的情况下诱导出显著更短的生成。通过用连续的潜在步骤替换早期的显式推理,模型通常使用更少的总生成步骤达到正确答案。这些发现表明,连续轨迹充当紧凑的中间推理状态,揭示了潜在计算预算、输出长度和准确性之间的新权衡。

英文摘要

Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, determining useful structures for intermediate latent states is an open challenge. In this paper, we formulate latent reasoning as a geometric path-approximation problem within the model's pretrained token-embedding space. We introduce Geometric Latent Reasoning (GLR), which uses a lightweight transition head to predict iterative direction updates in embedding space. Using textual chain-of-thought traces as anchors, GLR learns to approximate discrete reasoning trajectories while permitting continuous deviations from exact token embeddings. Evaluations on mathematical reasoning benchmarks using Qwen3 models reveal an emergent phenomenon: geometric latent reasoning induces substantially shorter generations without an explicit length objective. By replacing early explicit reasoning with continuous latent steps, models often reach correct answers using substantially fewer total generation steps. These findings suggest that continuous trajectories act as compact intermediate reasoning states, exposing a new tradeoff between latent computation budget, output length, and accuracy.

2606.02245 2026-06-02 cs.CL 版本更新

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

当知识并非免费:检索增强生成中的成本感知证据选择

Mingyan Wu, Han Yang, Omer Ben-Porat, Yftah Ziser

发表机构 * Northeastern University(东北大学) Technical University of Munich(慕尼黑技术大学) GESIS – Leibniz Institute for the Social Sciences(GESIS——莱比锡社会科学研究所) Technion–Israel Institute of Technology(技术学院——以色列理工学院) NVIDIA Research(NVIDIA研究) University of Groningen(格罗宁根大学)

AI总结 提出成本感知RAG设置,通过访问成本层级和预算约束,研究证据选择策略,发现静态选择脆弱而智能体方法有潜力但依赖模型和任务。

详情
AI中文摘要

检索增强生成(RAG)通常假设外部知识是免费的,但许多高质量来源需要付费、许可、受限或以其他方式访问成本高昂。我们引入了成本感知RAG,这是一种设置,其中检索到的证据被分配访问成本层级,系统必须在明确的证据访问预算下回答问题。我们通过为MS MARCO v2.1添加访问摩擦层级来实例化这一设置,并在通用领域和特定领域QA基准上评估预算证据选择。我们的结果表明,静态选择是脆弱的:没有固定的选择器能统一占优,更大的预算并不能可靠地提高答案质量,即使高成本证据与领域匹配。然后我们研究了智能体成本感知RAG,其中LLM决定何时检索、访问哪个层级以及何时停止。智能体作为自适应证据获取控制器显示出强大的潜力,但其行为仍然高度依赖于模型和任务。这些发现表明,成本感知的证据获取是下一代RAG系统的核心挑战。所有代码和数据可在https://github.com/Mignonmy/Cost-Aware获取。

英文摘要

Retrieval-Augmented Generation (RAG) typically assumes that external knowledge is free, but many high-quality sources are paywalled, licensed, restricted, or otherwise costly to access. We introduce cost-aware RAG, a setting where retrieved evidence is assigned access-cost tiers and systems must answer under an explicit evidence-access budget. We instantiate this setting by augmenting MS MARCO v2.1 with access-friction tiers and evaluate budgeted evidence selection across general-domain and domain-specific QA benchmarks. Our results show that static selection is brittle: no fixed selector uniformly dominates, and larger budgets do not reliably improve answer quality, even when costly evidence is domain-matched. We then study agentic cost-aware RAG, where an LLM decides when to retrieve, which tier to access, and when to stop. Agents show strong promise as adaptive evidence-acquisition controllers, but their behavior remains highly model- and task-dependent. These findings suggest that cost-aware evidence acquisition is a central challenge for the next generation of RAG systems. All code and data are available at https://github.com/Mignonmy/Cost-Aware.

2606.02215 2026-06-02 cs.CL cs.SI 版本更新

Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

经验更优:用于证据基础健康社区笔记的自进化LLM智能体

Zihang Fu, Fanxiao Li, Jianyang Gu, Haonan Wang, Preslav Nakov, Bryan Hooi, Min-Yen Kan, Jiaying Wu

发表机构 * National University of Singapore(国立新加坡大学) Yunnan University(云南大学) The Ohio State University(俄亥俄州立大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出EvoNote框架,通过自进化经验记忆和细粒度信用分配,在健康社区笔记生成中实现证据获取、分析与撰写,显著提升笔记质量并缩短生成时间。

详情
AI中文摘要

大型语言模型增强的社区笔记为社交媒体上健康错误信息的及时、基于证据的纠正提供了一条可扩展的路径。然而,它们仍然在每条帖子后重置,留下了先前案例中未使用的有用纠正经验。我们引入了EvoNote,一个智能体框架,通过先前错误信息纠正事件的自进化经验记忆,使健康社区笔记生成能够自我进化。其核心是细粒度信用分配:EvoNote将轨迹级反馈基于健康特定的笔记质量进行归因,并将其提炼为行动级记忆,用于声明分析、证据获取和笔记撰写。我们在MM-HealthCN上评估了EvoNote,这是一个包含1200个实例的多模态基准测试,包括用户标记的健康帖子、人工撰写的社区笔记和众包有用性标签。在一个人工验证的分层效用评判器下,EvoNote生成的笔记在89.6%的案例中优于对应的人工笔记;在一组没有众包有用性判定的“需要更多评分”帖子中,EvoNote为82.0%的案例生成了有用的笔记。它还将生成候选纠正所需的中位时间从人工笔记流程的超过13小时减少到不到2分钟。分析将这些收益归因于更强的证据使用和可复用的纠正策略,将自进化笔记生成定位为健康错误信息治理的一种有前景的范式。

英文摘要

Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.

2606.02214 2026-06-02 cs.CL 版本更新

Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark

性别线索是否影响LLM价值权衡?来自受控决策基准的证据

Yangyang Liu, Dong Yu, Pengyuan Liu

发表机构 * Beijing Language and Culture University(北京语言大学)

AI总结 通过构建受控基准RVDB,研究性别线索是否导致大语言模型在价值决策中产生系统性翻转,并发现性别效应集中在价值边界模糊和决策严重性高的情境下,且模型自我归因常掩盖性别影响。

详情
AI中文摘要

大型语言模型越来越多地用于价值敏感的决策场景,在这些场景中,无关的人口统计线索不应改变判断。我们构建了现实价值决策基准(RVDB),这是一个受控基准,仅改变角色-性别配置,同时保持场景、有序价值对、角色、候选决策、价值距离和决策严重性固定。使用跨七个模型的位置平衡评估,我们测试模型是否在性别扰动下保持决策不变性,以及它们的自我归因是否反映观察到的行为变化。我们发现,显式性别线索会引起有限但系统的决策翻转,包括在显式性别归因提示下,该提示要求模型报告性别是否影响其选择。跨性别角色交换揭示了一致的女性提议决策不对称性,而模型通常将翻转的决策归因于无影响或其他非性别因素。进一步分析表明,性别效应集中在价值边界较不明确和决策背景更严重的情况下,表明性别线索作为局部边界移动因素,而非价值推理的全局覆盖。价值排名基本保持稳定,但有序价值对权衡在不同角色-性别配置中不均匀地变化。这些结果表明,性别可以在行为上进入LLM价值权衡,同时在自我归因中被掩盖,这促使在基于解释的评估之外进行受控行为审计。

英文摘要

Large language models are increasingly used in value-sensitive decision settings, where irrelevant demographic cues should not alter judgments. We construct the Realistic Value Decision Benchmark (RVDB), a controlled benchmark that varies only the role-gender configuration while holding the scenario, ordered value pair, roles, candidate decisions, Value Distance, and Decision Severity fixed. Using a position-balanced evaluation across seven models, we test whether models preserve decision invariance under gender perturbations and whether their self-attributions reflect observed behavioral changes. We find that explicit gender cues induce bounded but systematic decision flips, including under an explicit gender-attribution prompt that asks models to report whether gender influenced their choice. Cross-gender role swaps reveal a consistent female-proposed-decision asymmetry, while models often attribute flipped decisions to No Influence or other non-gender factors. Further analysis shows that gender effects concentrate near less determinate value boundaries and under more severe decision contexts, suggesting that gender cues act as local boundary-shifting factors rather than global overrides of value reasoning. Value rankings remain largely stable, but ordered value-pair trade-offs shift unevenly across role-gender configurations. These results show that gender can enter LLM value trade-offs behaviorally while remaining obscured in self-attribution, motivating controlled behavioral audits beyond explanation-based evaluation.

2606.02211 2026-06-02 cs.CL cs.AI 版本更新

Consistency Training while Mitigating Obfuscation via Rate Matching

通过速率匹配缓解混淆的一致性训练

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出速率匹配一致性训练(RMCT),通过匹配目标行为率而非约束表达方式,在减少模型受无关特征影响的同时避免混淆,提升可监控性。

详情
AI中文摘要

大型语言模型常常受到无关输入特征的影响,例如揭示用户偏好答案的线索。一致性训练通过训练模型在具有和不具有无关特征的输入上表现相似来减少这种影响。然而,现有方法在整个响应或内部激活上训练一致性,这也限制了模型是否表达这些无关特征。我们表明这会导致混淆,即模型学会不提及线索但仍受其影响,这可能削弱可监控性。为了解决这个问题,我们引入了速率匹配一致性训练(RMCT),它在选定的行为属性上训练一致性,而不约束这种行为如何表达。RMCT匹配模型在输入扰动下表现出目标行为(例如,遵循偏见线索)的速率,而不是要求具有和不具有无关特征的配对输入,从而将一致性训练扩展到无法移除无关特征的场景。我们在两个开放权重语言模型上评估了RMCT在减少谄媚方面的效果,在保留的偏见类型上实现了与标准一致性训练基线相当的偏见遵循减少,同时很大程度上保留了模型表达偏见线索的倾向。此外,我们发现RMCT在我们的实验中更节省数据,但计算效率较低。总体而言,RMCT表明一致性训练可以在不直接牺牲可监控性的情况下提高行为鲁棒性。

英文摘要

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

2606.02204 2026-06-02 cs.CL 版本更新

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

跨环境神经重排序用于文本智能体中样本高效的动作选择

Kan Shao

发表机构 * Jinglue Technology Development (Nanjing) Co., Ltd.(江苏 Jinglue 技术发展(南京)有限公司)

AI总结 研究通过联合训练轻量级神经重排序器(DeBERTa-v3)在多个文本环境(ALFWorld、WebShop、ScienceWorld)中实现样本高效的动作选择,发现重平衡联合训练显著提升性能,跨环境适应仅需9.2%目标域数据即可恢复93%全数据性能,数据多样性是主要驱动力。

Comments 11 pages, 4 figures, 6 tables

详情
AI中文摘要

大型语言模型智能体在基于文本的基准测试中取得了强劲性能,但推理成本过高,促使使用紧凑的神经重排序器进行动作选择。我们研究单个轻量级模型是否能在多个不同环境中执行动作选择,这种能力将消除每个环境模型的维护成本。在ALFWorld、WebShop和ScienceWorld上联合训练DeBERTa-v3(184M-434M参数),并采用少数类上采样,我们发现重平衡的双环境联合训练显著提升了单环境ALFWorld性能(净增益+0.412),同时保持了具有竞争力的WebShop性能(+0.214 vs. 单环境+0.249)。三环境训练在4个种子上平均组合净增益为+0.551 +/- 0.024,每个环境的结果接近专门的单环境模型,同时提供正向跨域迁移。跨环境适应具有极高的样本效率:仅用9.2%的目标域数据微调即可恢复93%的全数据性能,且扩展模型容量收益有限,表明数据多样性是主要驱动力。基于环境感知的LoRA适配器路由结合PCGrad实现了最佳种子结果+0.611(种子42),种子456和789分别为+0.554和+0.559,但由于种子123降至+0.263(4种子均值+0.497 +/- 0.158),表现出高方差,这是一个有前景但目前不稳定的方向。使用干净分割和数据重平衡的联合训练是关键要素。我们将发布包含51,580个训练实例(41,740个原始唯一状态,带少数类上采样)的三环境基准测试,以及所有模型检查点(接收后)。

英文摘要

Large language model agents achieve strong performance on text-based benchmarks but incur prohibitive inference costs, motivating the use of compact neural rerankers for action selection. We investigate whether a single lightweight model can perform action selection across multiple diverse environments, a capability that would eliminate per-environment model maintenance. Training DeBERTa-v3 (184M-434M parameters) jointly on ALFWorld, WebShop, and ScienceWorld with minority-class upsampling, we find that rebalanced two-environment joint training substantially improves over single-environment ALFWorld performance (net gain +0.412) while maintaining competitive WebShop performance (+0.214 vs. +0.249 single-environment). Three-environment training yields a mean combined net gain of +0.551 +/- 0.024 across 4 seeds, with per-environment results approaching specialized single-environment models while providing positive cross-domain transfer. Cross-environment adaptation is highly sample-efficient: fine-tuning on only 9.2% of target-domain data recovers 93% of full-data performance, and scaling model capacity yields limited benefits, indicating data diversity is the primary driver. Environment-aware LoRA adapter routing with PCGrad achieves a best-seed result of +0.611 (seed 42), with seeds 456 and 789 at +0.554 and +0.559, but exhibits high variance due to seed 123 collapsing to +0.263 (4-seed mean +0.497 +/- 0.158), representing a promising but currently unstable direction. Joint training with clean splits and data rebalancing is a key ingredient. We will release our three-environment benchmark of 51,580 training instances (41,740 raw unique states with minority-class upsampling) and all model checkpoints upon acceptance.

2606.02170 2026-06-02 cs.CL 版本更新

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

CRAFTQA: 一种用于复杂结构化数据推理的代码驱动自适应框架

Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu, Lei Liang, Wen Zhang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) JIUTIAN Research(JIUTIAN研究) ZJU-Ant Group Joint Lab of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)

AI总结 提出CRAFTQA框架,通过CodeSTEP模块生成逐步代码推理序列,并利用CRAFT模块动态生成自定义代码函数,以突破预定义函数集的限制,在复杂结构化数据推理任务中显著优于现有统一方法。

Comments Accepted by Findings of ACL 2026

详情
AI中文摘要

现实场景涉及大量异构结构化数据(例如表格、知识图谱),因此对这些多样化数据进行有效推理变得越来越重要。统一结构化数据问答已成为一个突出的研究趋势,旨在单一框架内回答跨不同结构化数据类型的自然语言问题。然而,现有的统一方法有一个共同的局限性:它们依赖于一组预定义函数,这限制了它们在超出这些预定义操作之外执行复杂推理的能力。为了克服这一根本性限制,我们提出了CRAFTQA,一种新颖的自适应代码驱动框架,包含两个核心模块:CodeSTEP和CRAFT。CodeSTEP模块是一种范式,它生成一个完整的可执行Python代码序列,其中包含基于问题的逐步代码推理操作。CRAFT模块为超出预定义函数集的操作动态生成自定义代码函数,并与CodeSTEP无缝集成,显著增强了处理复杂推理的灵活性。在多个结构化数据集上的综合实验表明,与现有的统一方法相比,CRAFTQA在复杂推理场景中取得了显著的改进。

英文摘要

Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a prominent research trend, aiming to answer natural language questions across different structured data types within a single framework. However, existing unified methods share a common limitation: they rely on a set of predefined functions, which restricts their ability to perform complex reasoning beyond these predefined operations. To overcome this fundamental limitation, we propose CRAFTQA, a novel adaptive code-driven framework comprising two core modules, CodeSTEP and CRAFT. The CodeSTEP module is a paradigm that generates a complete executable Python code sequence, which contains step-by-step code-based reasoning operations based on the question. The CRAFT module dynamically generates custom code functions for operations beyond the predefined function set, and seamlessly integrates with CodeSTEP to significantly enhance flexibility in handling complex reasoning. Comprehensive experiments on multiple structured datasets demonstrate that CRAFTQA achieves remarkable improvements in complex reasoning scenarios compared to existing unified methods.

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR 版本更新

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

视觉丰富文档类型分类的多模态方法:一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结 针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题,本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比,发现专用多模态Transformer优于LLM方法,且图像信息贡献最大。

详情
AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性,因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性,当前方法依赖于多样化的多模态建模策略,导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中,这些研究通常依赖于异构评估设置,进一步复杂化了系统比较,并使得评估进展变得困难。为了解决这些局限性,本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析,并结合统一实验框架内的受控实证比较。具体来说,在RVL-CDIP基准上评估了四种代表性模型(LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B),以系统分析文本、图像和布局信息对文档类型分类的贡献,特别关注对比OCR依赖和OCR无关的方法。结果表明,专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大,而OCR派生的文本提供有用但次要的支持。这些发现强调,对于具有显著布局结构的文档,多模态处理仍然是必不可少的。总体而言,该研究为比较多模态架构提供了系统基础,并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

2606.02161 2026-06-02 cs.CV cs.CL 版本更新

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

InfoMerge: 信息感知的令牌压缩用于高效视频大语言模型

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu

发表机构 * State Key Laboratory of Novel Software Technology(新型软件技术国家重点实验室)

AI总结 提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配,在减少85%视觉令牌的同时保持98.8%性能,实现4.24倍预填充加速。

Comments 15 pages, 8 figures

详情
AI中文摘要

视频大语言模型在视频理解中表现出色,但过多的视觉令牌带来了巨大的计算开销。现有的免训练压缩方法通过减少视觉令牌来提高推理效率,但它们通常依赖局部相邻帧相似性进行时间冗余估计,或主要根据片段长度分配令牌预算。这种设计对帧级噪声敏感,且无法捕捉真实视频的非均匀信息分布。为解决这些挑战,我们提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配来提高令牌利用率。具体来说,我们提出时间指纹差异:一种片段级二阶时间冗余估计策略,用于建模每个片段内相同空间位置令牌的时间相似性结构。我们进一步引入内容感知预算分配(CABA),根据片段独特性和基于谱熵的表征丰富性动态分配片段级令牌预算。通过减少对冗余静态区域的重复保留,并将更多令牌分配给信息丰富的片段,InfoMerge在保持强大性能的同时更好地利用了有限的令牌预算。大量实验表明,InfoMerge在多个基准和骨干网络上实现了强效的精度-效率权衡,在激进压缩下优势更为明显。在LLaVA-OneVision-7B上,InfoMerge保留了原始平均性能的98.8%,同时减少了85%的视觉令牌,并在预填充阶段实现了4.24倍的加速。

英文摘要

Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency--accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8\% of the original average performance while reducing 85\% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.

2606.02158 2026-06-02 cs.CL 版本更新

On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

低概率标记对AI生成文本检测的显著性:多尺度不确定性视角

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke, Haoran Luo

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对AI生成文本检测中模板标记主导和点估计脆弱的问题,提出基于低概率标记的多尺度不确定性估计器Uncertainty及其改进版Uncertainty++,在多个数据集和LLM上验证了有效性、泛化性和鲁棒性。

Comments Accepted by ICML 2026 main conference

详情
AI中文摘要

AI生成的文本日益与人类写作融合,带来了诸如错误信息、学术滥用和语料库污染等实际风险。虽然统计检测器因高效和泛化能力而具有吸引力,但它们存在两个关键局限性:(i)模板标记主导,人类和LLM写作中共有的模板标记可能压倒判别信号;(ii)脆弱的点估计,依赖单一概率分数在对抗性操纵下产生不稳定的决策。为解决这些问题,我们提出Uncertainty,一种多尺度不确定性估计器,专注于信息丰富的低概率标记,这些标记更清晰地暴露分布差异。局部上,它通过平均低概率标记的对数概率来缓解模板标记主导;全局上,它通过Rényi熵捕捉该低概率区域的分布形状,从而降低脆弱性。我们进一步通过条件独立采样将检测器扩展为Uncertainty++,得到更稳定的不确定性估计。在七个数据集和十六个LLM上的实验证明了其高效性、泛化性和鲁棒性。我们的代码可在https://github.com/guoyikai2000/Uncertainty-AIGT获取。

英文摘要

AI-generated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the detector to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments across seven datasets and sixteen LLMs demonstrate high effectiveness, generalization, and robustness. Our code is available at https://github.com/guoyikai2000/Uncertainty-AIGT.

2606.02147 2026-06-02 cs.CL cs.AI 版本更新

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

高、中、低资源语言中的句子和对话多语言习语

Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B, Parameswari Krishnamurthy, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Nguyen Phan Gia Bao, Amir Hossein Yari, Hawau Olamide Toyin, Nurdaulet Mukhituly, Mena Attia, Besher Hassan, Ahmad Fathan Hidayatullah, Tatsuki Kuribayashi, Haonan Li, Suma Bhat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德大学人工智能大学) University of Hamburg(汉堡大学) Manipal University(曼印大学) IIIT Hyderabad(海得拉尔IIIT) University of Science and Technology of Hanoi(河内科学技术大学) Universitas Islam Indonesia(印尼伊斯兰大学) Princeton University(普林斯顿大学)

AI总结 针对多语言习语理解,构建了覆盖3种高资源、3种中资源和12种低资源语言的MIDI数据集,包含句子和对话上下文中的字面与比喻用法,实验表明低资源语言理解更差,字面义比比喻义更难,对话上下文虽有改善但未消除差距。

详情
AI中文摘要

习语表达对多语言NLP构成重大挑战,因为其意义在比喻和字面用法之间转换,通常需要上下文才能准确理解。先前工作集中在高资源语言上,通常评估孤立的习语意义问题,忽略了现实话语。我们引入了MIDI,一个多语言习语数据集,涵盖3种高资源、3种中资源和12种低资源语言,由母语者策划。与之前的数据集不同,MIDI提供了嵌入在句子级和对话上下文中的习语,捕捉了字面和比喻解读。对最先进模型的基准测试表明,习语理解在低资源语言中下降,并且在所有资源层级中,字面解释比比喻解释更难。对话上下文提高了性能,但并未消除这些差异。通过受控测试和对隐藏表示的干预,我们进一步将记忆与推理分离,揭示了当前模型的核心局限性。

英文摘要

Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.

2606.02113 2026-06-02 cs.CL cs.AI 版本更新

A Primer in Post-Training Reasoning Data: What We Know About How It Works

后训练推理数据入门:我们对其运作机制的了解

Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun, Xiangzheng Zhang, Tong Yang

AI总结 本文综述了后训练推理数据的类型、效用、构建方法和扩展规律,为未来推理数据发布和后训练方案提供归因框架。

Comments 22 pages. Project Repository: https://github.com/RenBing-Sumeru/Awesome-LLM-Reasoning-Data

详情
AI中文摘要

后训练已成为大型推理模型近期进展的主要驱动力,而推理数据通常是决定这一阶段成功与否的关键变量。关于后训练推理数据的研究迅速增长,但相关文献仍分散在数据集论文、强化学习方案、奖励模型研究、基准测试和前沿系统报告中。本文是首篇综合了超过150篇关键公开研究和系统报告的后训练推理数据入门文章。我们围绕四个问题组织该领域:存在哪些数据对象、什么使它们有用、它们如何构建以及它们如何扩展。这一组织方式为未来的推理数据发布和后训练方案提供了归因框架。

英文摘要

Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key public studies and system reports on post-training reasoning data. We organize the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. Together, this organization provides an attribution framework for future reasoning-data releases and post-training recipes.

2606.02111 2026-06-02 cs.CV cs.AI cs.CL 版本更新

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

使用多片段视频破解多模态大语言模型

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

发表机构 * Department of Applied Artificial Intelligence, Sungkyunkwan University(应用人工智能系,成均馆大学) Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University(人机交互系,成均馆大学)

AI总结 提出MCV SafetyBench数据集,通过多片段视频评估多模态大语言模型的安全漏洞,发现视频模态比图像更脆弱,动态和多样化上下文增加攻击成功率,并基于图像模态的鲁棒性提出防御策略。

Comments 27 pages, 20 figures, Accepted to the Main Conference of ACL 2026

详情
AI中文摘要

随着多模态大语言模型(MLLMs)发展到处理视频输入,人们开始担忧其被恶意滥用的可能性。先前的越狱研究表明,MLLMs中的安全对齐可以通过视觉输入被绕过,但尚不清楚视频输入的哪些属性导致了这种脆弱性。为填补这一空白,我们引入了Multi-Clip Video (MCV) SafetyBench,一个包含2,920个视频的数据集,旨在评估视频输入的多样性如何影响MLLMs的脆弱性。每个视频由多个短片段组成,描述与有害查询相关的不同上下文。对八个代表性视频MLLMs的实验表明,攻击成功率随着片段数量的增加而持续提高。我们的结果进一步表明,视频模态(1)比图像模态更脆弱,(2)对动态视频比对静态视频更脆弱,(3)当视频包含更多样化的上下文时更脆弱。基于这些发现,我们提出了一种利用图像模态相对鲁棒性的防御策略。

英文摘要

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.

2606.02100 2026-06-02 cs.CL 版本更新

PortBERT: Navigating the Depths of Portuguese Language Models

PortBERT:探索葡萄牙语语言模型的深度

Raphael Scheible-Schmitt, Henry He, Armando B. Mendes

AI总结 本文提出PortBERT,一种基于RoBERTa的葡萄牙语语言模型家族,通过字节级BPE分词和稳定预训练在超过450GB数据上训练,在ExtraGLUE基准上达到竞争性能,并重点分析了训练和推理效率。

详情
Journal ref
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, 2025, pp. 59-71
AI中文摘要

Transformer模型主导现代自然语言处理,但高效的语言特定模型仍然稀缺。在葡萄牙语中,大多数工作侧重于规模或准确性,往往忽略了训练和部署效率。在本文中,我们介绍了PortBERT,一个基于RoBERTa的葡萄牙语语言模型家族,旨在平衡性能和效率。使用fairseq在来自CulturaX的超过450GB去重和过滤的mC4和OSCAR23数据上从头训练,PortBERT利用字节级BPE分词以及在GPU和TPU处理器上的稳定预训练流程。我们发布了两个变体,PortBERT base和PortBERT large,并在ExtraGLUE(一组翻译的GLUE和SuperGLUE任务)上评估它们。两个模型都表现出竞争力,匹配或超越现有的单语和多语言模型。除了准确性,我们还报告了训练和推理时间以及微调吞吐量,提供了模型效率的实用见解。因此,PortBERT通过解决葡萄牙语NLP中计算-性能权衡这一未被充分探索的维度,补充了先前的工作。我们在Huggingface上发布所有模型,并提供fairseq检查点以支持进一步的研究和应用。

英文摘要

Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.

2606.02093 2026-06-02 cs.CL cs.AI cs.LG 版本更新

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

不确定性量化中模糊性在错误预测中的作用

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过解耦输入模糊性与不确定性信号,利用门控专家和选择性预测提升大语言模型在问答任务中的错误预测性能。

Comments 8 pages not including references and appendices, 3 figures

详情
AI中文摘要

错误预测任务,即预测模型输出是否正确,通常通过不确定性量化(UQ)来解决。然而,虽然不确定性指标捕捉了模型缺乏知识或能力进行预测的情况,但它们也反映了模型输入和上下文中固有的偶然不确定性。本文提出了一种通过将输入模糊性与UQ信号解耦来改进大语言模型(LLM)错误预测的方法。我们在问答(QA)任务上使用六种UQ指标进行实验,结果表明,UQ指标在无歧义实例上的错误预测能力优于具有多个合理答案的问题。我们使用门控专家和选择性预测将真实和预测的模糊性标签纳入错误预测流程。我们发现,模糊性信息提高了跨模型家族、训练和评估范式、数据集(包括据称无歧义的数据集)以及偶然不确定性来源的错误预测分数,在标准数据集上对单个UQ指标的PRR提升超过10个百分点。

英文摘要

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

2606.02041 2026-06-02 cs.CL 版本更新

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

SentGuard:面向大型语言模型的句子级流式护栏

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University(复旦大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出SentGuard,一种与生成并行运行的句子级流式护栏,通过轻量级等待缓冲区将流式令牌分组为句子块并仅释放已验证块,以在低延迟下实现高精度不安全内容检测。

Comments 16 pages, 5 figures, submitted to ARR

详情
AI中文摘要

大型语言模型越来越多地实时流式输出长篇幅、推理密集的响应,这使得何时进行审核与是否进行审核同样关键。现有的护栏分为两种不理想的极端:响应级方法延迟干预直到完整输出生成,而令牌级方法基于不完整的语义进行操作,往往产生不稳定的决策和过多的护栏调用。为应对这一挑战,我们提出SentGuard,一种与生成并行运行的句子级流式护栏。一个轻量级等待缓冲区将流式令牌分组为句子块,并仅向用户释放已验证的块,引入一个小偏移量,使得SentGuard能够在目标LLM解码后续内容时评估当前前缀。为支持这一点,我们构建了StreamSafe基准,包含8个危害类别的结构化逐句标注,捕捉推理和响应段中安全风险的演变。我们进一步使用从粗到细的目标训练SentGuard,以在不安全意图在句子边界出现时立即检测。在5个安全基准上的实验表明,SentGuard优于现有基线,在两个句子内检测到90.5%的不安全案例,同时保持7.41%的低流式误报率。

英文摘要

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

2606.02020 2026-06-02 cs.CL cs.LG 版本更新

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

揭示思维链推理的熵动力学

Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过熵动力学揭示思维链推理的两阶段结构(不确定性区域和置信区域),并提出基于CUSUM变化点检测的无训练框架实现早期退出和测试时缩放,以提升推理效率与可靠性。

Comments 21 pages, 10 figures, accepted in ICML2026

详情
AI中文摘要

本文研究了思维链(CoT)的熵动力学,揭示了一致的两阶段结构:一个探索性的不确定性区域,然后急剧过渡到收敛的置信区域。我们证明置信区域具有两个关键性质:1)高可靠性——置信区域中的答案变得高度准确和稳定,以及2)高冗余性——模型在达到正确答案后生成长时间的不必要token。这些性质解锁了更高效和可靠的推理策略:1)早期退出利用可靠性和冗余性,在收益递减时安全终止计算,以及2)测试时缩放使用置信区域信号优先考虑收敛轨迹。为了实施这些见解,我们将置信区域检测建模为序列变化点检测问题,首次将经典变化点方法应用于监控CoT推理。使用累积和(CUSUM)算法(一种统计最优的变化点检测器),我们开发了一个无训练框架用于实时推理控制。实验表明,我们的方法为早期退出建立了优越的帕累托前沿。CUSUM在减少11.1% token的情况下达到63.06%的准确率,在准确率上分别超过DEER和Dynasor 3.28%和4.36%。对于测试时缩放,CUSUM加权投票始终优于自一致性。

英文摘要

This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

2606.02010 2026-06-02 cs.CL cs.AI 版本更新

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench: 通过平面图绘制评估LLM空间推理能力

Oleksandr Nikitin

发表机构 * tvori.info

AI总结 提出PlanarBench基准,通过让LLM根据边列表以ASCII艺术绘制平面图来评估其空间推理能力,发现边数是主要难度预测因子。

Comments 12 pages, 4 figures, https://github.com/wizzard0/planar-bench-as1073

详情
AI中文摘要

PlanarBench测试LLM是否能够仅根据边列表以ASCII艺术形式绘制平面图——这是一项抵抗记忆的空间推理任务,因为边的顺序、边的方向和节点标签都是可置换的。我们在199个最简单的非同构连通平面图(2-7个顶点)上评估了91个模型。边数是主要的难度预测因子(r = -0.85)——这一发现未在之前的LLM图基准测试中报告,这些基准仅使用节点数作为难度轴。

英文摘要

PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate 91 models on the 199 simplest non-isomorphic connected planar graphs (2 - 7 vertices). Edge count is the dominant difficulty predictor ($r = -0.85$) -- a finding not reported in prior LLM graph benchmarks, which use only node count as the difficulty axis.

2606.02009 2026-06-02 cs.CL 版本更新

Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French

自动作文评分与语言认证:评估法语中的泛化性、一致性和有效性

Rodrigo Wilkens, Rémi Cardon, Vincent Folny, Thomas François

发表机构 * University of Exeter(埃克塞特大学) France Éducation international(法国教育国际) Cental, IL&C, UCLouvain Computer Science and Engineering Department, Universidad Carlos III de Madrid(马德里卡斯蒂利亚大学计算机科学与工程系)

AI总结 本文提出一个增强的论证有效性框架,通过公平性分析、语言特征相关性、预测误差评估和与人工评分的一致性比较,对8种模型架构在法语作文评分上进行多维评估,推进了法语自动作文评分的前沿。

详情
AI中文摘要

在自动作文评分(AES)中,基准测试实践促进了最小化评估方法,这与评估框架(如论证有效性框架ABV)的广泛视角建议形成对比,ABV主张对系统进行多维评估,特别是在高风险语言测试的背景下。在本文中,我们引入了一个增强且更实用的ABV框架版本,结合了公平性分析、与语言特征的相关性、预测误差评估以及与人工评分者的一致性比较。将该框架应用于法语AES,我们在一个包含27k篇考试作文(每篇2名评分者)的语料库和一个包含961篇作文(每篇至少9名评分者)的泛化语料库上比较了8种模型架构。我们的分析展示了应用ABV框架以更好地理解AES模型的能力和缺陷的益处,同时推进了法语AES的最新水平。

英文摘要

In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.

2606.02001 2026-06-02 cs.CL 版本更新

Scaling Agentic Capabilities via Grounded Interaction Synthesis

通过基于交互合成扩展智能体能力

Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao, Shuqing Bian, Wei Lu, Xiaoyong Du

发表机构 * Renmin University of China(中国人民大学) Peking University(北京大学) Tencent(腾讯)

AI总结 提出GAIS框架,通过两阶段接地机制(协议锚定环境和结构引导规划)自动生成多样化的环境和复杂任务,显著提升智能体在BFCL、τ²-Bench和ACEBench上的性能。

详情
AI中文摘要

通用智能体智能的关键在于与多样化的真实世界工具交互以完成复杂任务的能力,这种能力与交互数据的质量密切相关。为了规避人工标注的昂贵成本,现有范式完全依赖大型语言模型(LLMs)来扩展智能体环境和任务的合成。然而,这种无约束的生成常常退化为LLMs内部先验的有偏随机采样,无法捕捉真实世界领域的多样性和难度,也无法构建高保真、长周期的任务。在这项工作中,我们引入了基于交互合成(GAIS),这是一个通过两阶段接地机制自动构建多样化环境和复杂任务的框架。具体来说,我们构建了源自真实世界模型上下文协议(MCP)服务器的协议锚定环境,以确保功能多样性和难度。随后,我们采用结构引导规划来导航这些环境,主动施加逻辑依赖和对抗策略以生成复杂任务。在BFCL、τ²-Bench和ACEBench上的实验表明,GAIS合成的数据显著优于最先进的基线,使基础模型能够匹配甚至超越其官方指令微调版本。此外,GAIS展现出优越的数据效率和可扩展性,在显著减少数据量的情况下实现卓越能力,同时在基线停滞时保持持续增长。我们的代码和数据集可在https://github.com/Eric8932/GAIS公开获取。

英文摘要

General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random sampling of LLMs' internal priors, failing to capture the diversity and difficulty of real-world domains or construct high-fidelity, long-horizon tasks. In this work, we introduce Grounded Agentic Interaction Synthesis (GAIS), a framework that automates the scalable construction of diverse environments and complex tasks via a two-phase grounding mechanism. Specifically, we construct protocol-anchored environments derived from real-world Model Context Protocol (MCP) servers to ensure functional diversity and difficulty. Subsequently, we employ structure-guided planning to navigate these environments, actively enforcing logical dependencies and adversarial policies to generate complex tasks. Experiments on BFCL, $τ^2$-Bench, and ACEBench demonstrate that GAIS-synthesized data significantly outperforms state-of-the-art baselines, enabling base models to match or even surpass their official instruction-tuned counterparts. Furthermore, GAIS exhibits superior data efficiency and scalability, achieving exceptional capabilities with significantly less data while maintaining continuous growth where baselines stagnate. Our code and dataset are publicly available at https://github.com/Eric8932/GAIS.

2606.01995 2026-06-02 cs.CL 版本更新

CARTE: A Benchmark for Mapping Language Model Knowledge Across France

CARTE:法国语言模型知识映射基准

Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang, Michalis Vazirgiannis

发表机构 * École Polytechnique, Institut Polytechnique de Paris(巴黎政治学院) National Technical University of Athens(雅典国家技术大学) Mohamed bin Zayed University of Artificial Intelligence(姆阿扎德人工智能大学)

AI总结 提出CARTE基准,通过2431道多选题评估大语言模型在法国13个大区14个主题领域的细粒度区域知识,并引入CARTE-LV子集聚焦语言变异,实验发现模型在区域和规模上存在性能差异。

详情
AI中文摘要

我们推出了CARTE(文化锚定的区域-领土评估),这是一个多项选择基准,用于评估大语言模型(LLMs)在法国境内基于地理和区域差异的知识上进行细粒度推理的能力。虽然先前的基准侧重于国家层面的文化理解,但它们很大程度上忽略了国内差异以及区分密切相关区域背景的需求。CARTE通过引入涵盖法国13个大区和14个主题领域(包括文化、语言、人口、经济、环境和流动性)的2431个问题来填补这一空白。我们进一步推出了CARTE-LV,这是一个针对法国区域语言变异的子集,能够对语言相关差异进行集中评估。我们在少样本设置下评估了27个参数从1B到12B的LLMs。我们的实验揭示了跨区域和模型规模的性能差异,表明预训练覆盖存在系统性差距,且对国内变异的鲁棒性有限。

英文摘要

We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.

2606.01993 2026-06-02 cs.CL cs.AI cs.LG 版本更新

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能?

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Kuaishou Technology(快手科技)

AI总结 提出MMG2Skill框架,将多模态异构的野外指南编译为可编辑技能,通过轨迹级根因反馈持续改进,在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情
AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而,这些知识通常是多模态、异构、有噪声的,并且隐含地假设人类执行者,使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距,我们将此问题形式化为指南到技能学习:将野外指南转换为可执行技能,并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力,我们引入了MMG2Skill-Bench,这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill,一个闭环框架,它将指南编译为可编辑技能,在执行过程中将固定的视觉语言模型(VLM)智能体条件化于这些技能,并从轨迹级根因反馈中修正技能,而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中,使用六个VLM骨干网络,MMG2Skill在每个模型-领域设置中始终优于普通基线智能体,在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明,直接用原始指南提示智能体会降低性能,而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中,当成功信号适当校准时,基于分析器的提前停止进一步防止了后期性能退化,并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

2606.01991 2026-06-02 cs.AI cs.CL cs.CY 版本更新

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP:基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院)

AI总结 针对LLM智能体因动作空间扩大而面临功率寻求风险,提出SafeMCP服务器端防御插件,通过内部世界模型进行前瞻推理,实现主动工具过滤和即时干预两级防御,在保持智能体效用的同时有效降低风险。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

详情
AI中文摘要

随着大语言模型(LLM)智能体越来越多地利用模型上下文协议(MCP)在复杂环境中运行,其动作空间的扩展赋予了智能体不安全的能力,并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要,但它们也创造了一个脆弱的风险表面,其中微小的错误或幻觉会被放大为灾难性故障。为此,我们提出了SafeMCP,一种{服务器端}防御插件,通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理,实现两级防御:主动工具过滤以限制危险功率扩展,以及即时干预作为故障安全机制。为了训练SafeMCP,我们引入了一个三阶段流程,包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习(RL)。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明,SafeMCP实现了安全平衡,在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

2606.01967 2026-06-02 cs.CL 版本更新

Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning

训练提示至关重要:面向鲁棒微调的状态自适应优化

Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao, Jinhao Dong, Pengfei Hu, Wei Lu, Xiaoyong Du

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出状态自适应提示优化(SAPO)策略,通过将任务公式从静态输入转变为动态状态自适应变量,有效缓解灾难性遗忘并提升泛化能力,在多个基准上取得显著性能提升。

详情
AI中文摘要

虽然提示工程在推理过程中对最大化大型语言模型(LLM)的能力至关重要,但提示在训练过程中的作用仍未得到充分探索。现有的微调范式通常将训练提示视为表面形式,假设语义等价的指令会产生相同的学习结果。然而,我们揭示这种等价性具有欺骗性:虽然释义后的提示通常会导致类似的任务内性能,但它们在灾难性遗忘和泛化方面会引发截然不同的跨任务影响。关键的是,这些影响在任务间呈正相关,表明存在始终产生更好性能的优越提示。此外,我们发现这些优越提示可以在学习之前通过任务损失稳健地识别。利用这些见解,我们引入了状态自适应提示优化(SAPO),这是一种轻量级但有效的训练策略,它将任务公式从静态输入转变为动态的、状态自适应的变量。在多种基准上的全面实验证实了其有效性,它显著减轻了遗忘,同时提高了泛化能力,相比于最先进的方法取得了显著的性能提升。这些结果提供了关于训练提示如何塑造学习动态的见解,并为鲁棒微调提供了实用的方法。我们的代码可在 https://github.com/Eric8932/SAPO 获取。

英文摘要

While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning. Our code is available at https://github.com/Eric8932/SAPO.

2605.04948 2026-06-02 cs.CL 版本更新

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

将大型语言模型适配到低资源黏着语:LoRA与QLoRA在巴什基尔语上的比较研究

Mullosharaf K. Arabov, Svetlana S. Khaybullina

发表机构 * Institute of Computational Mathematics and Information Technologies, Kazan Federal University(计算数学与信息科技学院,卡兹安联邦大学)

AI总结 本文比较了LoRA和QLoRA两种参数高效微调方法在低资源黏着语巴什基尔语上的适配效果,发现QLoRA在7B规模模型上能在质量和计算成本间取得有效平衡。

Comments Accepted to CLIB 2026

详情
AI中文摘要

本文对参数高效微调方法(包括LoRA和QLoRA)在将大型语言模型适配到巴什基尔语(突厥语族的一种低资源黏着语)任务上进行了比较研究。实验在包含71k文档(4690万个token)的巴什基尔语文本语料库上进行,使用了多种架构的模型:DistilGPT2、GPT-2(base、medium)、Phi-2、Qwen2.5-7B、DeepSeek-7B和Mistral-7B。为提高结果可靠性,每种配置使用三种不同的随机种子进行训练。测试集上最低困惑度由全微调的GPT-2 medium获得(3.34)。同时,应用于Mistral-7B(3.79)和Phi-2(3.81)的QLoRA在可训练参数减少40倍以上的情况下达到了相当的质量。然而,我们也观察到某些架构使用PEFT时质量显著下降的情况(例如,DeepSeek-7B,秩为8,困惑度=129.55),这表明结果关键取决于基础模型及其分词器的选择。此外,基于巴什基尔语提示的生成文本定性分析显示,具有最佳困惑度的模型不一定产生最连贯的输出:QLoRA微调的模型生成了单语巴什基尔语续写,而具有最低困惑度的全微调模型则频繁切换到英语。结果表明,对于巴什基尔语,7B规模模型上的QLoRA在质量和计算成本之间提供了有效的折中。为确保可重复性,开放数据、代码和训练好的适配器将在论文被接收后发布。

英文摘要

This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.

2605.02270 2026-06-02 cs.CL 版本更新

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

塔吉克-波斯语对机器音译模型的系统基准测试:从基于规则到Transformer架构的比较研究

Mullosharaf K. Arabov

发表机构 * Institute of Computational Mathematics and Information Technologies(计算数学与信息科技学院) Kazan Federal University(卡兹安联邦大学)

AI总结 本文通过构建多源平行语料库,系统比较了从基于规则到Transformer的六类模型,发现字节级ByT5在塔吉克-波斯语音译任务中表现最优(chrF++ 87.4/80.1),而基于子词的分词模型完全失败。

Comments Accepted to CLIB 2026

详情
AI中文摘要

本文首次对塔吉克语(西里尔字母)和波斯语(阿拉伯字母)之间音译的现代机器学习架构进行了全面比较分析。一个关键贡献是创建并验证了一个独特的平行语料库,该语料库汇集了多个异构来源,包括众包项目、词典对、《列王纪》平行文本、外交文章、《玛斯纳维》文本、官方术语列表和音译对应关系。初始数据集包含328,253个句子对;通过分层随机抽样形成了40,000个句对的代表性子集。实验比较了六类模型:基于规则的基线、带注意力的LSTM、字符级Transformer、G2P Transformer(从头训练)、预训练多语言模型(mBART、带LoRA的mT5)以及字节级ByT5。结果表明ByT5具有压倒性优势(塔吉克语到波斯语chrF++ 87.4,反向80.1)。尽管数据有限,G2P Transformer显著优于mBART(72.3 vs. 62.2 chrF++)。使用子词分词(mT5)的模型完全失败(chrF++低于18.5)。研究结果表明,对于塔吉克-波斯语对的准确音译,在字节或字符级别操作的架构明确优于依赖子词分词的傳統多语言Seq2Seq模型。

英文摘要

This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization.

2606.01964 2026-06-02 cs.CL 版本更新

Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location, Within-Word Landing Position, and Fixation Duration in Reading

Eyettention II: 一种用于建模阅读中注视位置、词内着陆位置和注视持续时间的双序列架构

Shuwen Deng, Cui Ding, David R. Reich, Paul Prasse, Lena A. Jäger

发表机构 * Department of Computer Science, University of Potsdam(波恩大学计算机科学系) Department of Computational Linguistics, University of Zurich(苏黎世大学计算语言学系) Department of Informatics, University of Zurich(苏黎世大学信息学系)

AI总结 提出端到端训练的轻量级深度学习模型Eyettention II,通过双序列架构生成包含注视位置、词内着陆位置和注视持续时间的完整扫描路径,在预测性能上超越现有模型并捕捉关键心理语言学现象。

详情
AI中文摘要

阅读时眼睛的运动方式为理解读者的认知过程和文本属性提供了宝贵信息。特别是,阅读过程中的眼动追踪数据在多种技术应用中显示出高度价值,例如增强和解释语言模型以及推断读者特征。然而,这些应用通常依赖于大规模数据驱动模型,需要大量的眼动追踪数据集,而由于数据收集的资源密集型特性,这些数据集难以获取。为了解决数据稀缺的挑战,我们开发了Eyettention II,一个端到端训练的深度学习模型,能够生成由按时间顺序排列的完整注视属性组成的真实扫描路径,包括注视位置、词内着陆位置和注视持续时间。我们的模型轻量级,可在有限的GPU资源上高效训练,并与认知理论紧密对齐。我们证明,Eyettention II在扫描路径预测方面超越了最先进的模型,并通过捕捉关键心理语言学现象模拟了类似人类的注视行为。凭借其稳健的性能,Eyettention II有潜力推动自然语言处理的发展,促进心理语言学实验材料的预测试,并揭示超出理论认知模型明确编码的新见解。

英文摘要

The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader's characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.

2606.01936 2026-06-02 cs.CL 版本更新

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

格式化什么以及如何格式化:文档格式化的基准与工作流方法

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li, Xiyan Gao, Peng Fu, Jing Huang, Can Ma

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 针对内容感知的文档格式化任务,提出基准DocFormBench和工作流方法DocFormFlow,通过解耦目标定位与修改执行,在提升准确率的同时降低token消耗。

详情
AI中文摘要

大型语言模型(LLM)的最新进展为自动化文档格式化开辟了新的可能性。然而,现实中的格式化通常需要根据文档内容识别目标。这种内容感知的设置仍然具有挑战性且未被充分探索,主要是由于缺乏专门的评估数据集。为了在现实的内容感知场景中实现评估,我们引入了DocFormBench,这是一个将文本到格式评估扩展到多样化格式化需求的基准,同时提供了准确性和效率的指标。为了减少现有方法在格式化过程中的冗余文档读取,我们提出了DocFormFlow,一种工作流格式化方法,将目标定位与修改执行解耦为“格式化什么”和“如何格式化”。在多个LLM和多模态模型上的大量实验表明,与代表性基线相比,DocFormFlow在减少token消耗的同时持续提高了格式化准确性。进一步的分析表明,精确的目标定位是影响格式化性能的主要因素。我们希望DocFormBench和DocFormFlow能够促进未来朝着更智能、更可靠的文档格式化的研究。

英文摘要

Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.

2606.01934 2026-06-02 cs.LG cs.CL 版本更新

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO: 用于思维链压缩的混合中位数长度策略优化

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

发表机构 * Li Auto Inc.(Li Auto公司)

AI总结 提出HMPO,一种单阶段强化学习框架,通过自适应中位数预算、余弦衰减令牌奖励和乘法奖励公式,在数学数据上训练后实现19%-46%的令牌压缩且精度损失极小,并泛化至多种任务。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理取得了显著性能,但这一冗长过程带来了大量推理开销。现有的思维链压缩方法面临不灵活的手动长度预算、计算昂贵的多阶段训练流程以及仅适用于小模型的脆弱可扩展性。我们提出HMPO(混合中位数长度策略优化),一种经济高效的单阶段强化学习框架。HMPO通过三个协同组件高效压缩思维链:基于成功轨迹的自适应中位数预算以消除手动调整、用于平滑长度惩罚的余弦衰减令牌奖励,以及通过严格优先考虑答案正确性来大幅减轻琐碎奖励破解的乘法奖励公式。仅在数学数据上训练,HMPO无缝泛化到数学、代码、科学和指令遵循任务。在从9B到122B参数、涵盖密集和混合专家架构的大规模实验中,HMPO实现了19%-46%的令牌压缩,精度下降可忽略,同时与现有的多阶段基线相比大幅降低了训练成本。

英文摘要

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

2606.01926 2026-06-02 cs.CL 版本更新

Mitigating Bias in Locally Constrained Decoding via Tractable Proposals

通过可处理提议缓解局部约束解码中的偏差

Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao, Guy Van den Broeck, Stefano Ermon

发表机构 * Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对局部约束解码中因短视掩码导致的采样偏差,提出基于张量化有限自动机的全局约束解码提议和概率全局约束解码提议,结合序贯蒙特卡洛方法实现无偏采样,在函数调用、关键词生成和SQL生成任务中显著减少所需粒子数并加速收敛。

Comments 13 pages, 5 figures

详情
AI中文摘要

大型语言模型的生成结果往往不符合期望的约束,如JSON模式。现有的局部约束解码(LCD)方法通过短视地掩蔽下一个词元来强制约束,导致采样偏差和性能下降。最近的工作使用序贯蒙特卡洛(SMC)方法来缓解此类偏差,但设计有效的提议分布或势函数仍然是一个关键挑战。在这项工作中,我们提出了一种通用方法来构建从 $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$ 进行SMC采样的提议和势函数。首先,我们证明了以有限自动机形式指定的约束可以张量化以在GPU上高效执行,我们利用这一点构建了全局约束解码(GCD)提议。此外,利用张量化有限自动机与隐马尔可夫模型共享相同电路结构的事实,我们通过电路乘法得到概率全局约束解码(P-GCD)提议,该提议编码了目标分布的逻辑和概率信息。我们在函数调用、基于关键词的生成和SQL生成任务上评估了(P-)GCD。实验表明,在相同的SMC采样设置下,与LCD提议相比,(P-)GCD以显著更少的粒子更快地收敛到目标分布。

英文摘要

Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$. First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.

2606.01923 2026-06-02 cs.CL cs.LG 版本更新

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

共振上下文锚定:推理时解耦注意力路由与信号增益

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min

发表机构 * Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Tongji University(同济大学) Tsinghua University(清华大学)

AI总结 提出共振上下文锚定(RCA)方法,通过解耦自注意力中的路由逻辑与信息幅度,在推理时动态增强上下文令牌的信号,有效抑制大语言模型的参数化幻觉,提升事实一致性。

详情
AI中文摘要

大型语言模型(LLM)在面对与内部参数记忆冲突的输入证据时,经常表现出“上下文忽视”,导致持续的事实幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或使用计算昂贵的对比解码机制,这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性,我们从残差流信号动力学的角度提出了一种轻量级的推理时干预方法——共振上下文锚定(RCA)。RCA旨在解决外部证据在深层网络传播过程中的信号衰减问题。其核心机制是在自注意力模块中正交解耦路由逻辑和信息幅度。通过利用原始的softmax前注意力分数作为语义对齐的即时度量,我们通过非线性整流构建动态增益场,选择性地放大上下文令牌对应的值向量的范数,而不改变注意力概率分布。该机制有效提升了残差流混合中输入证据的信噪比(SNR),从而在推理时稳健地将生成轨迹锚定到真实上下文。在Llama-3模型系列上的大量实验表明,RCA在多个事实一致性和强知识冲突任务中显著提高了上下文忠实度,有效抑制了参数化幻觉。此外,结果证实,作为一个无需训练且计算量可忽略的即插即用模块,RCA在保持模型通用语言理解能力的同时,在忠实度和流畅性上实现了帕累托改进。

英文摘要

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

2606.01914 2026-06-02 cs.CL cs.CV 版本更新

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

发表机构 * Kyoto University(京都大学) NII LLMC(日本国立信息与通信技术研究所语言模型中心) RIKEN AIP(日本理化学研究所先进理工研究所) Case Western Reserve University(凯斯西储大学) The Hong Kong Polytechnic University(香港理工大学) The University of Osaka(大阪大学) University of Tokyo(东京大学)

AI总结 本文发现多模态大语言模型存在空间词汇偏差,即添加空间关系词会吸引模型选择该选项,并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧,最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间多项选择题上仍不可靠,其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差:向答案选项添加空间关系词会吸引模型决策,使新添加的选项更可能被选中。使用九个开放权重的MLLMs,我们证明该现象广泛存在。特别地,模型能正确回答二元空间问题,但一旦向答案集添加第三个空间选项,模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例,并利用机制可解释性工具,揭示失败的主要原因来自语言侧而非视觉侧:视觉注意力分析和残差流探针表明,在这些失败中,正确的空间关系在内部仍然可用,而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现,我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差,在合成数据上将四路鲁棒准确率提升高达100个百分点,在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

2606.01901 2026-06-02 cs.CV cs.AI cs.CL 版本更新

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏:通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam(波恩大学语言学系计算语言学部) German Research Center for Artificial Intelligence (DFKI), Berlin(德国人工智能研究中心(DFKI)柏林)

AI总结 提出图像重建游戏基准,通过多轮迭代中视觉语言模型向图像生成器发出纠正指令,使累积的共同基础直接可视化为重建图像,发现描述器是重建质量的主导因素,而生成器决定迭代改进的效果。

详情
AI中文摘要

我们引入了图像重建游戏,这是一个全自动基准测试,其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令,使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试,我们发现描述器是重建质量的主导因素,而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性:较短的预算产生更稀疏的初始渲染,有更多可见改进的空间,而较长的预算提高了绝对质量,但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇,涵盖空间、数值和结构类别,而较弱的描述器则集中于表面属性,并且往往在几轮后停止。人工验证表明,最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性,并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

2606.01879 2026-06-02 cs.CL 版本更新

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

CultureForest:理解与评估大语言模型中的文化规范推理

Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The University of Hong Kong(香港大学) Harvard University(哈佛大学)

AI总结 为弥补现有研究仅将文化智能视为知识获取问题而忽视实际场景应用的不足,提出CultureForest基准,通过基于原子规范的推理任务评估模型,发现顶级模型在开放式生成中性能大幅下降,并揭示推理能力瓶颈。

详情
AI中文摘要

现有研究大多将大语言模型中的文化智能简化为知识层面的问题,忽视了模型能否在现实场景中有效利用其获取的知识。为弥补这一差距,我们引入了CultureForest,一个用于 extit{文化规范推理}的基准。每个问题都基于一组原子规范,从而支持可验证和可归因的评估。CultureForest包含来自8个领域和53个国家/地区的5,378个示例,并支持从多项选择到开放式生成的渐进式评估。大量实验表明,即使是顶级模型在开放式设置中性能也大幅下降,并伴随显著的跨区域差异。通过针对性分析,我们发现了几个一致的模式:(1)测试时推理带来的收益有限,且可能加剧不公平;(2)模型表现出高度共享的区域偏好结构;(3)模型响应明显保守,尤其在更严格的文化约束下;(4)通过分离文化知识获取与文化推理,我们发现虽然LLMs拥有丰富的文化知识,但其性能进一步受限于知识的有效利用。这些发现表明,有必要从以知识为中心的评估转向衡量基于知识的推理。

英文摘要

Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.

2606.01845 2026-06-02 cs.CL cs.AI 版本更新

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

揭示大型语言模型在推断非语言回应中的语用意义的局限性

Sugyeong Eo, Heuiseok Lim

发表机构 * Department of Software, Yonsei University Mirae Campus(燕山大学软件系) Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系)

AI总结 本研究首次系统评估大型语言模型(LLMs)从纯非语言回应对话中推断语用意义的能力,发现其准确率相比语言回应下降高达60%,并表明上下文学习有助于语用推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在语用语言理解方面取得了显著进展,但先前的研究主要集中在其对语言行为的理解上。然而,非语言行为仍然是人类交流的基本组成部分,特别是当故意单独使用以传达间接意义时。在这项工作中,我们首次系统评估了LLMs从仅包含非语言回应的对话中推断语用意义的能力。我们探讨了三个研究问题:(1)LLMs能否识别通过非语言回应传达的间接意图?(2)LLMs何时以及如何未能捕捉非语言意图?(3)我们如何提高LLMs解释非语言意图的能力?通过评估,我们观察到LLMs难以从非语言回应中推断出潜在意义,准确率相比语言回应下降高达60个百分点。进一步的广泛分析揭示了LLMs在解释非语言行为时的行为模式,并表明上下文学习有助于语用推理。

英文摘要

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs' ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs' ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs' interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.

2606.01838 2026-06-02 cs.CL cs.AI cs.LG 版本更新

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

LayerRoute: 基于LoRA微调的输入条件自适应层跳过方法用于智能语言模型

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出LayerRoute,通过为每个Transformer块添加轻量级路由器和LoRA适配器,根据输入类型(工具调用或规划推理)自适应跳过层,在仅增加0.22%可训练参数下实现12.91%的跳过差异并提升质量。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

智能语言模型系统交替使用两种结构不同的步骤类型:结构化工具调用(短、确定性、低困惑度)和开放式规划/推理步骤(长、复杂、高困惑度)。尽管存在这种异质性,当前的推理系统对每个步骤应用相同的计算量。我们引入LayerRoute,一个轻量级适配器,学习基于每个输入有选择地跳过Transformer块。LayerRoute为Qwen2.5-0.5B-Instruct中的24个Transformer块中的每一个增加:(1)一个每层路由器(约897个参数,Linear(896,1)),通过直通估计器输出硬二值门;(2)在Q/K/V/O注意力投影上的LoRA适配器(秩8,约1.08M参数)。骨干权重保持冻结。在智能体数据(Hermes、Glaive、GSM8K、Turing)上进行单次端到端训练,并加入门正则化项,迫使系统发现每个输入类型下哪些块是可跳过的。经过3000步(在A100 40GB上6.4分钟),LayerRoute实现了12.91%的跳过差异:工具调用跳过15.25%的FLOPs,而规划步骤仅跳过2.34%,仅使用1.10M可训练参数(占494M骨干的0.22%)。由于LoRA适配,质量相比基础模型有所提升,工具调用上的困惑度差为-1.29,规划步骤上为-1.30。

英文摘要

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.

2606.01820 2026-06-02 cs.CL 版本更新

TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech

TalkTag: 转录语音的细粒度形态句法错误标注

Shamira Venturini, Oliver Hennhöfer, Steffen Kinkel, Jannik Strötgen

发表机构 * Karlsruhe Institute of Technology, Germany(德意志联邦共和国卡尔斯鲁厄理工大学) Karlsruhe University of Applied Sciences, Germany(德意志联邦共和国卡尔斯鲁厄应用科学大学)

AI总结 提出基于LLM的轻量级工具TalkTag,在数据稀缺条件下自动进行口语转录文本的CHAT风格错误标注,实现低资源场景下的精确标注并识别歧义情况。

详情
AI中文摘要

细粒度形态句法错误标注在临床和发展语言研究中很重要,但劳动密集、依赖专家且难以扩展。我们提出了TalkTag,一个基于LLM的轻量级工具,经过微调可自动对口语转录文本进行CHAT风格的错误标注。该系统在极端数据稀缺条件下使用儿童叙事数据开发,展示了低资源设置下语言分析的可行性。我们的评估表明,TalkTag产生了令人鼓舞的精确标注,同时有效识别了语言歧义使自动标注真正复杂的情况。总之,通过TalkTag,我们提供了一种可扩展的手动错误标注替代方案,并为形态句法错误标注提供了实际可行的支持。

英文摘要

Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language transcripts. Developed under conditions of extreme data scarcity using children's narrative data, the system shows the feasibility of linguistic analysis in low-resource settings. Our evaluation demonstrates that TalkTag produces encouragingly precise annotation while effectively identifying instances where linguistic ambiguity makes automated tagging genuinely complex. In summary, with TalkTag, we provide a scalable alternative to manual error annotation and practically viable support for morphosyntactic error annotation.

2606.01815 2026-06-02 cs.CL 版本更新

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

CRAB-Bench:在复杂任务依赖和人类对齐用户模拟下评估LLM智能体

Danqing Wang, Akshay Sivaraman, Lei Li

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出CRAB-Bench和RUSE框架,通过约束图生成复杂任务依赖和基于人类行为研究的用户模拟,评估LLM智能体在真实服务场景中的表现,发现最佳模型仅达61%通过率,用户模拟导致性能下降高达57%。

详情
AI中文摘要

在现实服务场景中评估LLM智能体需要复杂的任务依赖、不完美的用户行为以及能够容纳多种有效解决方案的评估。我们引入了CRAB-Bench(基于约束的现实智能体基准)和RUSE(现实用户模拟引擎)来填补这一空白。CRAB-Bench通过一个包含多个相互依赖实体和结构化干扰项的约束图生成任务,要求智能体在数千个误导性候选项中仔细推理,其中只有极小部分解是有效的。RUSE用基于人类行为研究的现实用户取代了合作性的模板式模拟器,这些用户实例化在不同的角色和四个行为维度上。在四个前沿LLM智能体上的实验表明,最佳模型在CRAB-Bench上仅达到61%的pass@1,而切换到RUSE导致进一步下降高达57%,主要集中在任务解决能力而非对话质量上。信息泄露是最具破坏性的行为维度,与RUSE交互的智能体更少承认错误,而是通过隐式纠正掩盖错误。

英文摘要

Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.

2606.01813 2026-06-02 cs.CL 版本更新

Cost-Aware Diffusion Draft Trees for Speculative Decoding

用于推测解码的成本感知扩散草稿树

Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 提出CaDDTree方法,通过联合优化树结构和节点预算直接最大化令牌吞吐量,并证明在凸验证成本下吞吐量函数是单峰的,从而无需离线预算搜索。

详情
AI中文摘要

推测解码通过让轻量级草稿模型生成令牌,并由目标语言模型并行验证来加速推理。诸如DFlash之类的块扩散草稿模型一次性生成整个草稿块,产生每个位置边际分布;DDTree利用这些边际分布构建候选树,在固定节点预算下最大化期望接受长度。然而,我们观察到接受长度随预算非递减:它总是偏好更大的树而不考虑验证成本,没有为预算选择提供原则性基础。我们提出 extbf{CaDDTree}(成本感知扩散草稿树),一种通过联合选择树结构和节点预算直接优化令牌吞吐量(单位时间内期望生成的令牌数)的方法。我们显式建模草稿和验证延迟,表明吞吐量目标可分解为每轮对预算的一维搜索,并证明在凸验证成本下吞吐量函数是 extit{单峰的},从而实现了高效的贪心停止规则。CaDDTree无需离线预算搜索,每轮根据当前每个位置分布和验证成本自适应调整预算。在Qwen3-4B和Qwen3-8B上跨越推理、编码和指令遵循任务的八个基准测试上的实验表明,CaDDTree在几乎所有任务上匹配或超越了具有最优预算选择的DDTree。

英文摘要

Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.

2606.01811 2026-06-02 cs.CL cs.AI cs.LG 版本更新

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

“我知道这会如何发展”:通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Stanford University(斯坦福大学)

AI总结 提出一种基于上下文学习的多样性度量方法 Decan(D_{Ca_n}),通过单次前向传递计算每个字节的得分,无需嵌入模型、参考语料或人工标注,在多个基准上验证了其有效性。

Comments 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity

详情
AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法,其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例:一个基于每个字节的得分,该得分从基础模型 $θ$ 的每个标记对数概率中读取,每次排列只需一次前向传递,无需嵌入模型、参考语料库和人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分,将多样性视为(回答、提示、评分模型)的一个属性。在Tevet和Berant基于人类判断的McDiv基准上,$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA,这是其表现最好的情况,仅次于Tevet和Berant报告的最强神经基线(SentBERT,0.897)。在OLMo-2-7B训练后流程中,$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降,检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

2606.01806 2026-06-02 cs.CL cs.AI cs.LG 版本更新

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

ProbeScale: 通过探测分析优化神经缩放定律以实现高效小语言模型推理

Sourav Das

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institution of Information Technology Kalyani(印度信息技术学院Kalyani)

AI总结 提出ProbeScale框架,利用缩放定律和探测分析从预训练小语言模型中识别参数高效子网络,在参数预算下最大化任务加权探测性能,实现5-10倍参数压缩并保持95%-98%原始性能。

Comments 7 pages, 2 figures, ACL

详情
AI中文摘要

小语言模型在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练,表明它们拥有随规模增长而丰富的内部表示。然而,在严格的资源约束下部署即使是这些小语言模型也可能具有挑战性。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出ProbeScale,一个统一缩放定律和探测洞察的框架,用于在预训练小语言模型中识别参数高效的子网络。ProbeScale利用良好缩放的小语言模型的高质量表示,并使用任务特定探测来数学量化每层对目标下游能力的相关性。这使得能够选择在性能与参数规模之间最优权衡的子网络。我们将子网络选择形式化为在参数预算下寻找最大化聚合任务加权探测性能的层子集。在代表性小语言模型如RoBERTa-Large和T5-Base上的实验表明,ProbeScale识别出的子网络实现了5到10倍的显著参数减少,同时在目标任务上保持了高性能(原始小语言模型的95%至98%),优于启发式基线。

英文摘要

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

2606.01800 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術研究所)

AI总结 本研究通过表示结构分析探索大语言模型的多语言性,发现低资源语言与英语的结构差异大于高、中资源语言,且语言特定后训练改变结构但保留语言间关系。

详情
AI中文摘要

大型语言模型(LLMs)通过在多语言数据上进行预训练和后训练,在处理多种语言方面表现出色,尽管英语在训练数据中占主导地位。先前关注标记表示的研究揭示了这些LLMs如何处理非英语文本。尽管这些分析提供了有见地的发现,但它们未能捕捉到结构视角,而结构是语言的内在属性。在本研究中,我们通过表示结构分析探索LLMs的多语言性。我们的发现表明,低资源语言在结构上与英语的差异大于高资源和中资源语言,并且语言特定的后训练改变了它们的结构,同时保留了语言间的关系。

英文摘要

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

2606.01779 2026-06-02 cs.CL 版本更新

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

HarnessForge:面向自适应智能体系统的协同框架与策略进化

Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University(北京未来区块链与隐私计算先进创新中心,人工智能学院,北京航空航天大学) Tsinghua University(清华大学)

AI总结 提出HarnessForge元自适应框架,通过框架-策略协同进化实现LLM智能体系统的全系统自适应,在多个基准上显著提升性能。

Comments 25 pages, 13 figures

详情
AI中文摘要

LLM智能体越来越需要在需要不同执行范式的异构任务环境中运行。这对固定智能体系统提出了挑战,并推动了超越孤立组件更新的系统级元自适应。虽然现有工作已自适应外部框架或训练底层推理策略,但全系统自适应仍未被充分表征。结构与执行之间的自适应空间很少被明确化,外部框架与内部推理器之间的兼容性也未得到联合优化。我们提出HarnessForge,一个用于进化LLM智能体系统的元自适应框架。HarnessForge将智能体系统形式化为一个框架-策略对,定义了一个稳定的自适应空间,将框架级执行结构与策略级推理行为分离。然后,它通过故障引导的框架裁剪和框架条件化的策略对齐执行框架-策略协同进化。在来自不同领域的五个基准上的实验表明,HarnessForge一致地改进了Qwen3-4B和Qwen3-8B骨干网络,优于仅框架和仅策略的基线,比最强基线提升高达12.0%,并实现了有利的展开效率权衡,证明了框架-策略协同进化是有效的,并且框架与推理策略之间的可执行兼容性对于智能体系统自适应至关重要。代码可在https://github.com/mingju-c/HarnessForge获取。

英文摘要

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.

2606.01755 2026-06-02 cs.AI cs.CL 版本更新

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University(数据科学与人工智能系,墨尔本大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题,提出TriAlign框架,通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化,实现公平对齐。

详情
AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应,但可能在不同社会群体间引入显著的通用真值不一致性,即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化,要么主要关注主观偏好对齐,很大程度上忽视了通用真值的公平性和一致性。为填补这一空白,我们研究了真值不变对齐(TIA),这是一个针对个性化LLM的对齐问题,旨在确保通用真值在不同社会群体间保持一致,同时保留个性化。我们提出TriAlign,这是首个用于TIA的离线多智能体强化学习(MARL)框架,其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚,联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明,TriAlign在这三个目标之间实现了比强基线更强的平衡,减少了跨社会群体的通用真值差异,同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

2606.01747 2026-06-02 cs.CL cs.AI 版本更新

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

基于BERT和图神经网络的历史知识图谱构建

Ping Li, Bartlomiej Brzozka

发表机构 * Shandong Management University(山东管理大学) Maria Curie-Sklodowska University(玛丽·居里-斯洛多夫斯卡大学)

AI总结 本文提出结合BERT和图神经网络的高层架构,从历史文本中提取实体和关系,构建知识图谱,在精度、召回率和F1分数上优于传统方法和深度学习基线。

Comments 9 pages, 4 figures

详情
AI中文摘要

通过数字人文研究和规模化历史数据分析,大量传统历史文本被转换为结构化知识图谱。本文提出一种结合双向编码器表示(BERT)和图神经网络(GNN)的高层架构,用于从各类历史文本中提取实体和关系。传统历史文本系统地解决了语言歧义、上下文限制的引用以及缺乏既定语法规范的问题。本研究根据上述建议,开发了一种基于FastRQNet和预训练视觉-语言模型Vilt-qaformer+RoBInet的新型图像检索系统。实验充分利用了市政记录、议会文件和历史信函的全面数据集。与传统基于规则的技术和其他流行的深度学习基线相比,联合BERT-GNN系统获得了更高的精度、召回率和F1分数(表2)。该结构在创建知识图谱时能够以足够的准确性和全面性处理复杂的嵌套结构和隐式引用问题。上述实验表明,将关系图学习算法与上下文敏感的语义表示技术相结合,可以自动提取历史数据,为知识库积累累积的智慧。

英文摘要

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

2606.01738 2026-06-02 cs.CL cs.AI 版本更新

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

THRD:一种针对大语言模型越狱攻击的无训练多轮防御框架

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

发表机构 * Beijing Language and Culture University(北京语言大学)

AI总结 提出无训练框架THRD,通过显式建模时间风险累积(包括逐轮风险评估、跨轮意图检测、响应评估和决策模块)防御多轮越狱攻击,将攻击成功率降至0.2-4.0%且模型效用损失小于1.5%。

详情
AI中文摘要

多轮越狱攻击通过利用对话动态(如逐步升级和跨轮协调)对LLM构成日益严重的威胁。现有防御要么依赖昂贵的重新训练(通常会降低模型效用),要么在每一轮独立应用单轮分析,无法捕捉风险沿交互轨迹的累积。我们观察到多轮交互中的安全行为是轨迹依赖的:对话历史不断重塑模型的调节上下文,使得孤立评估每一轮变得不足。基于这一洞察,我们提出THRD,这是第一个显式建模多轮越狱防御中时间风险累积的无训练框架。THRD集成了四个模块:用于即时风险评估的逐轮风险评估器(TRA)、用于跨轮意图升级检测的历史上下文分析器(HCA)、用于识别促进性输出的响应评估器(RE),以及通过带衰减调制和趋势感知调整的时间演化评分机制组合这些信号的决策模块。在两个目标模型上针对最先进的多轮攻击(包括基于树搜索和多智能体协作方法)的实验表明,THRD将攻击成功率降至0.2-4.0%,同时在MMLU和GSM8K上将模型效用退化控制在1.5%以内。消融研究证实了模块的非冗余贡献和稳定的跨架构泛化。对首次拒绝触发器的分析显示,超过70%的多轮攻击需要在第2轮或之后才能检测到,验证了显式时间聚合的必要性。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

2606.01682 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器:数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland(马里兰大学计算机科学系)

AI总结 提出Chunk-Level Guided Generation方法,利用现成的大语言模型作为过程评分器,通过固定长度块评分和对比选择规则,无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情
AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已经陷入错误推理路径时,该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题,但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation,一种无训练的替代方案,使用现成的大语言模型作为过程评分器。在每一步,小模型采样k个固定长度的候选块,而大模型使用似然度对候选块进行评分,无需生成任何文本。选中的块在下一步之前被提交,从而在错误传播之前引导生成。我们用两种选择规则实例化该框架:似然引导选择(LGS),选择具有最高长度归一化大模型对数概率的块;以及对比引导选择(CGS),减去小模型的对数概率,以偏向于大模型偏好与小模型偏好不同的块。我们证明,由于系统性的长度偏差(即使在长度归一化后仍然存在),使用大模型似然度对可变长度推理步骤进行评分是不可靠的,而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上,使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B,CGS在多数投票上最多提升28个百分点,并且在匹配的引导预算下,在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索,且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B,CGS在k=16时在MATH上达到81.8%,在Minerva Math上达到63.6%,超过多数投票4-6个百分点。最后,Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

2606.01679 2026-06-02 cs.CL 版本更新

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

编码但未路由:解释科学声明验证中的表格-图表差距

Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter, Florian Boudin, Akiko Aizawa

发表机构 * The University of Tokyo(东京大学) NII LLMC(日本信息处理学会LLMC) National Institute of Informatics(日本信息处理研究所) University of Göttingen(哥廷根大学) Inria, LS2N, Nantes Université(法国国家信息与自动化技术研究院(Inria)、LS2N实验室、南特大学)

AI总结 通过分层线性探测和注意力分析,发现多模态大语言模型在科学声明验证中,图表信息被编码但未路由到预测位置,导致表格与图表表现差距。

详情
AI中文摘要

多模态大语言模型越来越多地被用于辅助科学同行评审,其核心要求是验证论文中的声明是否得到其证据的支持。先前的研究表明,当证据是表格时,模型在此任务上的表现显著优于相同底层数据的图表。这引发了一个问题:模型是否未能从图表中提取信息,还是提取了信息但在形成预测时未能使用?我们通过分层线性探测和注意力分析,在三个开源视觉语言模型上研究了这个问题,这些模型处理代表相同底层数据的表格和图表证据。我们发现一致的证据支持后者。图表信息被编码在模型的中间表示中,但未到达预测位置,这一差距在表格中不存在,并且在所有测试条件下都成立。注意力分析进一步揭示,这种断开在不同模型家族中表现为两种架构上不同的形式。这些发现将表格-图表差距重新定义为编码的视觉信息在预测时路由失败的问题,而非编码本身失败。

英文摘要

Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.

2606.01678 2026-06-02 cs.CL 版本更新

Why Do Self-Harm Prediction Models Struggle to Generalise? Lexical and Semantic Variations in Emergency Department Triage Notes

为什么自伤预测模型难以泛化?急诊科分诊笔记中的词汇和语义变化

Liuliu Chen, Mike Conway, Jo Robinson, Vlada Rozova

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息系) Orygen, The National Centre of Excellence in Youth Mental Health, Australia(青年心理健康国家卓越研究中心) Centre for Youth Mental Health, The University of Melbourne, Australia(青年心理健康中心) Centre for Digital Transformation of Health, The University of Melbourne, Australia(健康数字化转型中心)

AI总结 通过分析两家医院急诊科分诊笔记的词汇特征、预测特征和主题差异,发现机构间文档差异导致自伤检测模型跨站点性能下降,并提出改进泛化性的方法。

Comments Accepted to CLPsych2026

详情
AI中文摘要

急诊科(ED)的自伤表现与较高的自杀风险密切相关。NLP模型在单家医院的分诊笔记中检测自伤方面表现出稳健的性能,但跨机构时性能常常下降。为了探究潜在原因,我们通过分析词汇特征、高度相关的预测特征和显著主题,比较了两家医院的ED分诊笔记。我们的结果揭示了与自伤相关的词汇表达和特征重要性在不同医院之间存在差异,尽管核心主题如自我中毒和自我伤害是一致的。这些文档差异与跨站点性能下降相关。我们的发现揭示了机构差异如何影响临床文本中自伤的识别,并强调了改善模型泛化性的潜在方法。

英文摘要

Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown robust performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. Our findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.

2606.01671 2026-06-02 cs.CL 版本更新

When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models

当意义旅行:混合专家模型在语言模型习语理解中的细粒度作用

Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali, Kitsuchart Pasupa, Sriparna Saha

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna, India(印度理工学院帕纳瓦分校计算机科学与工程系) School of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Thailand(泰国拉差班国王理工大学信息科技学院)

AI总结 针对低资源东南亚语言中习语的文化隐喻复杂性,提出Varnika多模态习语语料库和混合专家模型HybridMoE,通过控制混合和掩码多模态嵌入缓解专家稀疏性,实现习语理解性能提升5-6%。

详情
AI中文摘要

在多语言教育的当代,学习习语为通往创造力、文化价值观、历史背景以及各种语言传统固有的多元视角提供了迷人的途径。本文展示了在低资源东南亚语言(如印地语、孟加拉语和泰语)中保留比喻和文化语义的导航,这些语言中文化丰富的习语由于其深刻的隐喻复杂性,给计算建模和跨语言迁移带来了重大障碍。为应对这种复杂性,我们提出了Varnika,一个重建的多模态习语语料库,包含3,533个多语言习语,并丰富了与文本和视觉表示对齐的七种习语语气。此外,为了推断信息丰富的习语理解,我们引入了一个混合专家模型(HybridMoE)框架,该框架嵌入多个习语专家意见,同时通过受控混合整合来自选定和未选定专家的输出来缓解专家稀疏性,并通过掩码多模态嵌入进一步增强了习语属性信号。为了跨多个维度分析性能,我们提出了IDIO-TONE和习语验证分数,这是一个三阶段评估流程,衡量(i)字面翻译保真度,(ii)视觉语义对齐,以及(iii)习语意义保留。实证评估表明,HybridMoE在先进的视觉语言模型上实现了5-6%的性能提升,展示了在多语言多模态设置中比喻语言和文化嵌入意义的改进表示。

英文摘要

In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural semantics in low-resource Southeast Asian languages such as Hindi, Bengali, and Thai, where culturally rich idioms pose significant obstacles for computational modeling and cross-linguistic transfer due to their deep metaphorical complexity. To tackle such complexity, we present Varnika, a reconstructed multimodal idiom corpus comprising 3,533 multilingual idioms, enriched with seven idiomatic tones aligned with both textual and visual representations. Additionally, to infer informative idiomatic understanding, we introduce a Hybrid Mixture-of-Experts (HybridMoE) framework that embeds multiple idiomatic expert opinions while mitigating expert sparsity by integrating outputs from both selected and unselected experts through controlled hybridization, further augmented with Idiomatic Property Signals via masked multimodal embeddings. To analyze the performance across multiple dimensions, we propose the IDIO-TONE and Idiomatic Validation Score, a three-stage evaluation pipeline measuring (i) literal translation fidelity, (ii) visual-semantic alignment, and (iii) idiomatic meaning retention. Empirical evaluations highlight that HybridMoE achieves 5--6\% performance gains across advanced vision language models, demonstrating improved representation of figurative language and culturally embedded meaning in multilingual multimodal settings

2606.01640 2026-06-02 cs.AI cs.CL 版本更新

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

MobEvolve:用于可解释人类移动性生成的智能体自进化启发式系统

Junlin He, Yihong Tang, Tong Nie, Ao Qu, Yuebing Liang, Hamzeh Alizadeh, Bang Liu, Wei Ma, Lijun Sun

发表机构 * The Hong Kong Polytechnic University(香港理工大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Tsinghua University(清华大学) Autorité régionale de transport métropolitain(大都会交通地区管理局) Université de Montréal(蒙特利尔大学) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 提出MobEvolve,首个智能体自进化启发式框架,通过LLM代理迭代演化内部逻辑,在保持可解释性和推理效率的同时,在个体轨迹保真度、群体分布对齐和行为合理性上超越现有方法。

详情
AI中文摘要

人类移动性生成旨在根据个体特征为目标人群合成真实的出行链。现有范式,包括深度生成模型、基于LLM的方法和传统启发式方法,难以同时满足该任务的复杂需求,同时保持可解释性、行为合理性、群体级分布对齐和推理效率。为弥合这一差距,我们引入了MobEvolve,这是首个用于人类移动性生成的智能体自进化启发式框架。MobEvolve初始化一个行为启发的启发式系统,并利用LLM代理迭代演化其内部逻辑。通过在验证集上诊断经验性错位和失败案例,代理提出有针对性的更新并积累演化记忆以实现累积性自我改进。在新加坡和蒙特利尔基准上的广泛评估表明,MobEvolve在个体轨迹保真度、群体级分布对齐和行为合理性方面显著优于最先进的深度生成和基于LLM的方法,同时保持可解释性和高推理效率。

英文摘要

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maintaining interpretability, behavioral plausibility, population-level distributional alignment, and inference efficiency. To bridge this gap, we introduce MobEvolve, the first agentic self-evolving heuristic framework for human mobility generation. MobEvolve initializes a behavior-inspired heuristic system and employs an LLM agent to iteratively evolve its internal logic. By diagnosing empirical misalignments and failure cases on a validation set, the agent proposes targeted updates and accumulates evolution memory for cumulative self-improvement. Extensive evaluations on the Singapore and Montreal benchmarks demonstrate that MobEvolve significantly outperforms state-of-the-art deep generative and LLM-based methods in individual trajectory fidelity, population-level distribution alignment, and behavioral plausibility, while preserving interpretability and high inference efficiency.

2606.01635 2026-06-02 cs.CL cs.AI 版本更新

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

AlphaToken: 在LLM后训练中解耦适应性与稳定性的路径感知响应令牌估值

Liu Qing, Ou Wu, Yi Du

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院)

AI总结 提出AlphaToken框架,通过解耦适应性(促进目标任务学习)和稳定性(保持预训练能力)并引入路径感知机制,利用Fisher漂移代理和Ghost点积扩展实现高效令牌估值,从而在微调和偏好优化中屏蔽低价值令牌,提升后训练性能并缓解灾难性遗忘。

详情
AI中文摘要

令牌选择对于有效的LLM后训练至关重要。然而,现有方法大多依赖局部启发式,很少将令牌选择形式化为对单个响应令牌的原则性估值。我们引入了$\textbf{AlphaToken}$,一个响应令牌估值框架,它将估值解耦为$\textbf{适应性}$(促进目标任务学习)和$\textbf{稳定性}$(保持预训练能力),并通过结合局部令牌梯度的直接路径信号与自回归生成中的下游因果路径信号,使每个目标具有$\textbf{路径感知}$性。由于保留数据通常不可用,AlphaToken通过锚定在预训练参考模型上的$\textbf{Fisher漂移代理}$来近似稳定性。为了高效计算,我们将Ghost点积扩展到令牌级估值。AlphaToken在微调和偏好优化过程中屏蔽低价值响应令牌,将训练信号集中在更有价值的位置。实验表明,AlphaToken提高了后训练性能并缓解了灾难性遗忘。

英文摘要

Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce $\textbf{AlphaToken}$, a response token valuation framework that decouples valuation into $\textbf{adaptation}$ (promoting target-task learning) and $\textbf{stability}$ (preserving pre-trained capabilities), and makes each objective $\textbf{path-aware}$ by combining the direct-path signal from local token gradients with the downstream causal-path signal in autoregressive generation. Since retention data are typically unavailable, AlphaToken approximates stability via a $\textbf{Fisher-drift proxy}$ anchored at the pre-trained reference model. For efficient computation, we extend Ghost Dot-Product to token-level valuation. AlphaToken masks low-value response tokens during fine-tuning and preference optimization, concentrating training signals on more valuable positions. Experiments show that AlphaToken improves post-training performance and mitigates catastrophic forgetting.

2606.01617 2026-06-02 cs.CL cs.AI 版本更新

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

EvoPool: 面向标签高效专业监督的进化式程序化标注

Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出进化多智能体框架EvoPool,通过程序化标注器迭代进化与投票聚合,在低标注成本下显著提升专业领域监督性能。

Comments 39 pages, 7 figures. Code: https://github.com/tianyi0216/EvoPool

详情
AI中文摘要

大型语言模型在通用任务上表现出色,但在训练标签成本高昂的专业高风险领域,其性能不如较小的监督模型。我们针对这一场景提出了EvoPool,一个受达尔文进化启发的进化多智能体框架。三个专业智能体迭代地提出可执行的标注器代码,一个小型验证集提供适应度信号,一个确定性门控仅保留通过跨代可行性、多样性和边际贡献检查的标注器。通过EvoAgg(一种结合语义特征与标注器投票特征的文本感知聚合器)将池投票映射为软训练标签。所构建的池在每样本成本接近零的情况下运行,在10万样本上比LLM标注快4500至31000倍。在8个LLM弱专业和复杂任务中的7个(涵盖生物医学关系抽取、法律条款分类、复杂推理和密集多标签生物医学分类)上,EvoPool比最强的LLM标注基线平均高出+0.141 macro-F1,在ChemProt上最高达+0.301,在PubMed上达+0.265。代码见:https://github.com/tianyi0216/EvoPool

英文摘要

Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool

2606.01600 2026-06-02 cs.CV cs.CL cs.RO 版本更新

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

RoboTrustBench:机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University(新加坡国立管理学院) Fudan University(复旦大学) Princeton University(普林斯顿大学)

AI总结 针对视频世界模型在机器人操作中的可信度问题,提出RoboTrustBench基准,包含正常、约束敏感、反事实和对抗四种场景,通过专家验证的指令-图像对和六维评估协议,发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

详情
AI中文摘要

视频世界模型越来越多地用于机器人操作,然而现有基准大多在有效、可行和安全的指令下评估它们。我们引入了RoboTrustBench,一个用于评估视频世界模型在四种场景下可信度的基准:正常、约束敏感、反事实和对抗。基于真实世界的DROID片段构建,RoboTrustBench包含1,207个专家验证的指令-图像对和一个六维评估协议,包含13个细粒度标准。通过人类和MLLM评估七个代表性的视频世界模型,我们发现当前模型通常生成视觉上连贯的视频,但在约束推理、反事实基础、物理交互和不安全指令抑制方面存在困难。这些结果表明,视觉质量和表面级别的指令遵循不足以实现可信赖的机器人视频世界建模。

英文摘要

Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.

2606.01584 2026-06-02 cs.CL cs.AI 版本更新

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

识别LLM中高置信度的社会偏见以构建可信的对话辅导代理

Aitor Arronte Alvarez, Naiyi Xie Fincham

发表机构 * University of Hawaii at Manoa(夏威夷大学马诺亚分校)

AI总结 本研究通过生成对话数据集,评估大型语言模型在辅导场景中检测社会偏见的能力,发现模型在对话上下文中比基准测试更难检测偏见,且对错误判断过度自信,影响推理和反馈。

Comments Accepted for AIED 2026

详情
AI中文摘要

对话辅导代理已被证明能提高学习参与度和学生成绩,大型语言模型(LLM)越来越多地被用于这些系统以提供可扩展的个性化反馈。然而,LLM可能会延续或放大刻板的社会偏见,在教育环境中带来特殊风险。在本研究中,我们评估了LLM在对话辅导场景中的表现,以识别高置信度的社会偏见,即模型在无法识别辅导对话中的偏见判断时仍保持高度自信,可能影响其推理和向学习者提供的反馈。我们提出了一种新的数据集生成方法,通过重新生成学生-AI辅导教师互动并引入来自基准数据集的受控偏见轮次,实现在自然教学条件下的偏见评估。利用这些数据,我们评估了多个LLM检测刻板偏见的能力,并通过计算和人工评估分析了其响应背后的置信度和推理。我们发现,在对话辅导上下文中,偏见检测比基于基准的评估更具挑战性,且最先进的LLM对其刻板偏见陈述的错误评估过于自信。此外,模型置信度强烈影响推理和反馈,突显了基于LLM的辅导代理中过度自信和偏见行为的风险。最后,我们讨论了影响、缓解考虑和未来研究方向。

英文摘要

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs' ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research.

2606.01542 2026-06-02 cs.DC cs.AI cs.CL cs.DB cs.IR 版本更新

Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit

自条件位置HNSW:面向分块文档RAG系统的重叠感知检索方法与工业证据质量审计

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.(eBay公司)

AI总结 提出自条件位置HNSW(SCP-HNSW),通过低维位置编码和两遍查询过程实现重叠感知检索,减少重复证据,并基于工业审计数据验证其有效性。

Comments 11 pages, 5 figures, 4 tables

详情
AI中文摘要

分块文档检索是检索增强生成(RAG)系统的常见组件。文档被分割成重叠的块,嵌入,并使用近似最近邻搜索(如分层可导航小世界图HNSW)进行索引。重叠改善了边界覆盖,但引入了一个实际故障模式:top-k检索通常返回重复证据的相邻块,浪费提示预算。我们提出自条件位置HNSW(SCP-HNSW),这是一种轻量级修改,将低维位置代码附加到块嵌入,并使用两遍查询过程来估计和应用查询特定的文档位置先验。SCP-HNSW保持HNSW图构建和遍历不变,同时为最终上下文构建添加了一个可审计的最小索引间隙选择器。我们还集成了用于生成证据质量的工业审查工件:一个包含318个完全标记审查的770条文本证据审计,以及一个包含350个评级的70例OCR审计。文本审计显示,770个预计审查中有574个被评为3/5,只有39个落在1-2范围内,叙述性审查者细节比结构化问题标志出现得更频繁。OCR审计显示,切片级通过率从干净聊天截图的95%到手写/模糊捕获的45%不等,一致性中等至强。这些结果激励了重叠感知、审计友好的RAG检索,并确定了因果性能声明所需的剩余受控检索消融。

英文摘要

Chunked-document retrieval is a common component of retrieval-augmented generation (RAG) systems. Documents are split into overlapping chunks, embedded, and indexed with approximate nearest-neighbor search such as hierarchical navigable small world graphs (HNSW). Overlap improves boundary coverage but induces a practical failure mode: top-k retrieval often returns near-adjacent chunks that repeat evidence and waste prompt budget. We propose Self-Conditioned Positional HNSW (SCP-HNSW), a lightweight modification that appends a low-dimensional positional code to chunk embeddings and uses a two-pass query procedure to estimate and apply a query-specific document-position prior. SCP-HNSW leaves HNSW graph construction and traversal unchanged while adding an auditable minimum-index-gap selector for final context construction. We also integrate industrial review artifacts for generated evidence quality: a 770-review text-evidence audit with 318 fully labeled reviews and a 70-case OCR audit with 350 ratings. The text audit shows that 574 of 770 projected reviews are rated 3/5, only 39 fall in the 1-2 range, and narrative reviewer detail appears much more often than structured issue flags. The OCR audit shows slice-level pass rates from 95% for clean chat screenshots to 45% for handwritten/blurry captures, with moderate to strong agreement. These results motivate overlap-aware, audit-friendly RAG retrieval and identify the remaining controlled retrieval ablations needed for causal performance claims.

2606.01533 2026-06-02 cs.MA cs.CL cs.LG 版本更新

Multi-Agent Computer Use

多智能体计算机使用

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对单智能体计算机使用代理在复杂长时任务中的不足,提出多智能体计算机使用系统,通过有向无环图分解任务并并行执行,在多个基准测试上提升3.4-25.5%性能,并加速任务完成时间约1.5倍。

详情
AI中文摘要

目前的计算机使用代理(CUA)主要部署为单序列代理。这种设置对于受益于任务分解、并行执行和基于新信息持续重新规划的复杂长时任务来说并不理想。在本文中,我们认为应该转向评估和构建多智能体计算机使用(MACU)系统。这些系统强调规划和并行执行,缓解了单智能体CUA的许多缺点。我们提出了一种通用的多智能体设置,其中管理模型将计算机使用任务分解为有向无环图(DAG),编码子代理的相关依赖关系和目标。在每次迭代中,管理器调度并行的CUA子代理执行DAG就绪前沿上的节点,并根据子代理的新发现持续修订DAG(添加、取消或重写节点)。这种设计将计算机使用的部分可观察环境作为首要挑战:下游代理可能无法重新观察到的信息通过管理器和DAG结构保留并传递。我们证明,MACU在桌面(OSWorld)和网页导航(Online-Mind2Web、WebTailBench、Odysseys)基准测试上始终比强单智能体基线提升3.4-25.5%,表现出更有利的测试时缩放,并解决了单智能体CUA陷入困境的复杂长时任务。在Odysseys(一个长时网页导航基准测试)上,MACU将平均任务完成墙钟时间提高了约1.5倍,证明了其在加速传统缓慢的CUA流程方面的有效性。我们的研究结果强调,多智能体协调是扩展计算机使用代理以更长时间、更有效地工作的一个有前景的方向。我们在https://jykoh.com/multi-agent-computer-use上发布所有代码和交互式可视化。

英文摘要

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by $3.4-25.5\%$ on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by ${\sim} 1.5 \times$, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.

2606.01513 2026-06-02 cs.DC cs.AI cs.CL cs.LG 版本更新

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

基于合规评分的Best-of-N护栏编排用于支付争议防御中的多模态文档生成

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.(eBay公司)

AI总结 提出一种结合多候选生成与合规评分早退机制的护栏编排层,通过并行生成、加权评分和最佳输出选择,在支付争议防御场景中实现高合规率与低延迟。

Comments 8 pages, 7 figures, 4 tables. Preprint. Applied systems paper on compliance-scored guardrail orchestration for multimodal LLM document generation. Contains aggregate operational readouts; not a randomized A/B test

详情
AI中文摘要

高风险企业文档生成,包括金融争议叙述、合规通知和审计摘要,要求模式正确性、策略合规性以及大规模低延迟操作。在统一的护栏层之前,生产系统通常将独立的PII编辑、内容审核和格式验证步骤拼接在一起,导致逻辑碎片化、请求路径变慢和运营成本增加。我们提出了一种针对文本和图像输入的护栏编排层,它将多候选生成与用于早退的显式合规评分相结合。该框架运行可配置的并行生成头,根据加权护栏(包括PII检测、内容审核、模式约束和领域规则)对候选进行评分,并返回具有选择元数据的最佳评分输出。可用的运营读数报告在20秒内进行5次尝试,合规率为91%。对于支付争议防御摘要,我们分析聚合运营场景读数,而非随机A/B测试。可变队列显示总体胜率高于对照组,301/659对比536/1548,对应+11.0个百分点,95%置信区间[6.6, 15.5],p < 0.001;对于调整后的未收到物品案例,+7.5个百分点,95%置信区间[0.2, 15.7],p = 0.045。欺诈和本地证据排名差异方向为正,但在聚合计数数据中不具有统计显著性。我们还报告了来自770次生成证据审查和70例OCR切片的评审校准的负责任AI证据质量信号,并通过请求接口、评分逻辑、伪代码和运营证据边界记录了可重复性边界。

英文摘要

High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema correctness, policy compliance, and low-latency operation at scale. Prior to a unified guardrail layer, production systems often stitched together separate PII redaction, content moderation, and format validation steps, leading to fragmented logic, slower request paths, and higher operational cost. We present a guardrail orchestration layer for text and image inputs that couples multi-candidate generation with an explicit compliance score used for early exit. The framework runs configurable parallel generation heads, scores candidates against weighted guardrails including PII detection, content moderation, schema constraints, and domain rules, and returns the best-scoring output with selection metadata. The available operational readout reports 5 attempts within 20 seconds and 91 percent compliance. For payments dispute defense summaries, we analyze aggregate operational scenario readouts rather than a randomized A/B test. Variable cohorts show higher count win rates than controls overall, 301/659 versus 536/1548, corresponding to +11.0 percentage points with 95 percent confidence interval [6.6, 15.5] and p < 0.001, and for adjusted item-not-received cases, +7.5 percentage points with 95 percent confidence interval [0.2, 15.7] and p = 0.045. Fraud and local evidence-ranking deltas are directionally positive but not statistically significant from the aggregate count data. We also report reviewer-calibrated Responsible-AI evidence-quality signals from 770 generated-evidence reviews and a 70-case OCR slice, and document the reproducibility boundary through the request interface, scoring logic, pseudocode, and operational evidence boundary.

2606.01503 2026-06-02 cs.CV cs.AI cs.CL 版本更新

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan(密歇根大学) Sony AI(索尼人工智能)

AI总结 本文通过分析层注意力分配,发现视觉理解与视觉生成在令牌冗余上存在不对称性,设计任务特定加速器,但统一训练中任务特定令牌丢弃导致协同损失,表明高效统一建模需保留共享跨任务结构。

详情
AI中文摘要

统一视觉语言模型(VLM)在单个自回归骨干中集成了视觉理解和视觉生成,但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中,我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析,我们揭示了一个基本的不对称性:视觉理解在后期层表现出显著的视觉冗余,而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发,我们设计了任务特定的加速器,针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升,但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径,并消除了联合优化中通常观察到的相互性能增益。我们的发现表明,高效统一建模需要保留共享的跨任务结构,强调了需要协同感知的加速策略。项目页面:https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

2606.01498 2026-06-02 cs.CL cs.AI 版本更新

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

TimeSage-MT:用于评估智能时间序列推理的多轮基准测试

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao, Stefan Zohren, Anna Vettoruzzo, Joaquin Vanschoren, Ming Jin, Qingsong Wen

发表机构 * University of Oxford(牛津大学) VulpiVox Intelligence Eindhoven University of Technology(埃因霍温理工大学) Griffith University(格里菲斯大学) Squirrel Ai Learning East China Normal University(华东师范大学)

AI总结 提出TimeSage-MT多轮基准测试,包含240个任务和2680轮对话,覆盖8个真实领域,用于评估LLM智能体在时间序列推理中的表现,揭示其在决策导向任务中的性能下降及记忆、不确定性处理等缺陷。

详情
AI中文摘要

时间序列数据为许多真实世界领域的决策提供信息。虽然大语言模型(LLM)智能体可以通过自然语言和工具分析数据,但目前尚不清楚它们是否能在多轮对话中进行可靠的时间序列分析。现有基准测试侧重于预测和异常检测等单步任务,忽略了用户目标演变、智能体必须基于先前分析以及结论从累积证据中得出的实际工作流程。在这项工作中,我们引入了TimeSage-MT,一个用于智能时间序列推理的多轮基准测试,包含240个任务和2,680轮对话,涵盖8个真实世界领域,从基础探索到决策导向分析。TimeSage-MT通过一个可复现的流程构建,该流程将真实世界的时间序列数据转换为具有可验证答案的多轮对话。它提供了一个统一的评估协议和公共排行榜,用于比较时间序列智能系统。为了展示基准测试的实用性,我们评估了前沿LLM以及TimeSage——一种配备全面时间序列技能库的新型结构化智能体。结果显示,在决策导向任务上性能急剧下降,原因是记忆、不确定性处理和基于领域的决策方面的失败。TimeSage-MT揭示了当前智能推理中的关键差距,并为未来发展提供了严谨的基础。

英文摘要

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.

2606.01482 2026-06-02 cs.CL 版本更新

Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG

超越主题相似性:RAG 中具有可解释注意力对齐的对比证据检索

Francielle Vargas, João Robiatti, Diego Alves, Lucas Pascotti Valem, Maximilian Seeth, Sebastián Ferrada, Ameeta Agrawal, Daniel Pedronette, André Freitas

发表机构 * University of Chile(智利大学) São Paulo State University(圣保罗州立大学) Saarland University(萨尔兰州立大学) University of Munich(慕尼黑大学) Portland State University(波特兰州立大学) Idiap Research Institute(Idiap研究机构)

AI总结 提出 CERA 框架,通过基于主观性的困难负样本选择和辅助注意力对齐损失注入证据归纳偏差,实现可解释且事实准确的检索。

详情
AI中文摘要

确保 RAG 中的事实性和可解释性仍然是一个开放且紧迫的问题。我们引入了对比证据理性注意力(CERA),这是第一个采用基于主观性的困难负样本选择并通过辅助注意力对齐损失将证据归纳偏差注入对比学习的检索框架。CERA 使用两个训练目标微调密集检索器:基于三元组的对比学习和可解释注意力对齐,后者通过使用基于词性标注的掩码分布监督 CLS 到 token 的注意力,该分布覆盖人工标注的事实理性作为证据信号。在大型临床试验报告语料库上的实验表明,与 Contriever 和困难负样本选择基线相比,基于主观性的困难负样本选择显著提高了检索效果。此外,理性对齐在保持竞争性检索性能的同时提高了忠实度,支持了注意力在人类理性指导下可以作为模型行为更忠实解释的假设。超越主题相似性,CERA 使检索器能够识别构成支持证据的特定 token,促进了 RAG 系统中更可解释的证据选择。

英文摘要

Ensuring factuality and interpretability in RAG remains an open and urgent problem. We introduce Contrastive Evidence Rationale Attention (CERA), the first retrieval framework to employ subjectivity-based hard negative selection and inject an evidential inductive bias into contrastive learning through an auxiliary attention alignment loss. CERA fine-tunes a dense retriever using two training objectives: triplet-based contrastive learning and interpretable attention alignment, which supervises CLS-to-token attention using a part-of-speech-weighted masking distribution over human-annotated factual rationales as evidence signals. Experiments on a large corpus of clinical trial reports demonstrate that the subjectivity-based hard negative selection substantially improves retrieval effectiveness compared to both Contriever and hard negative selection baselines. Furthermore, rationale alignment improves faithfulness while maintaining competitive retrieval performance, supporting the hypothesis that attention can serve as a more faithful explanation of model behavior when guided by human rationales. Moving beyond topical similarity, CERA enables the retriever to identify the specific tokens that constitute supporting evidence, promoting more interpretable evidence selection in RAG systems.

2606.01479 2026-06-02 cs.CL 版本更新

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

用于文本到语音中可解释情感控制的稀疏自编码器

Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou, Ye Gao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过稀疏自编码器分析基于LLM的TTS模型中的情感相关潜在特征,提出特征级干预框架实现双向情感诱导与抑制,无需修改模型参数。

Comments Accepted by ICML 2026

详情
AI中文摘要

将大型语言模型(LLMs)集成到文本到语音(TTS)系统中提高了语音的表现力,但可解释的情感控制仍然具有挑战性。现有方法主要依赖于外部条件或全局激活引导,对情感控制背后的内部表示提供的洞察有限。在这项工作中,我们使用稀疏自编码器(SAEs)分析基于LLM的TTS模型语义隐藏状态中与情感相关的变异,以识别稀疏潜在特征。我们的分析表明,情感变异分布在多个稀疏潜在特征上,而对一小部分特征进行干预可以实现可解释的情感控制。基于这一观察,我们引入了一个特征级干预框架,用于双向情感诱导和抑制,而无需修改骨干参数。我们进一步表明,不同的潜在特征与特定的声学属性(例如,音高)相关联,这表明情感表达源于协调的潜在贡献,而非单一的全局变化。实验上,引导这些稀疏潜在特征在情感诱导和抑制性能上达到或优于全局引导和现有的TTS基线。

英文摘要

Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.

2606.01469 2026-06-02 cs.CL 版本更新

Peacemaker at ATE-IT: Automatic term extraction from Italian text for waste management data using encoder model

Peacemaker at ATE-IT: 使用编码器模型从意大利语文本中自动提取废物管理术语

Mahdi Bakhtiyarzadeh, Hadi Bayrami Asl Tekanlou, Jafar Razmara

发表机构 * Department of Computer Science, University of Tabriz(塔布里兹大学计算机科学系) University of Tabriz(塔布里兹大学)

AI总结 针对ATE共享任务中的Task A,提出一种低计算成本、可解释的自动术语提取方法,通过微调编码器模型在少量资源上实现平衡性能,为低资源模型提供起点。

Comments 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195

详情
Journal ref
CEUR Workshop Proceedings, Vol. 4195, 2026
AI中文摘要

自动术语提取的发展在现代技术中变得越来越重要。目前几乎每个可用的搜索引擎中都存在自动术语提取。最近的进展为自动术语的提取提供了有希望的结果;然而,由于多种因素,如可用于训练的标注文档数量有限,以及由于领域变化导致提取多词表达式的复杂性,准确标注是困难的。在本文中,我们将提出一种低成本且可解释的自动术语提取方法,专门为ATE共享任务中的Task A开发。这种新方法利用微调提取策略,可以在少量计算资源上运行。我们使用类型级和微级精确率、召回率和F1分数来评估我们的自动化系统,以衡量提取性能的两个互补方面。根据实验结果,我们提出的方法与其他团队相比,实现了一致且平衡的性能。尽管该技术本身相对简单,但它为低资源模型提供了一个良好的起点。总体而言,研究结果表明,未来在模型扩展方面有可能取得重大进展,同时仍能保持其可解释性。

英文摘要

The development of automatic term extraction has become increasingly important in modern technology. Automatic term extraction can be found in virtually every search engine that is currently available to users. Recent advancements have provided promising results for the extraction of automatic terms; however, accurate labeling is difficult because of several factors, such as the limited number of annotated documents available for training and the complexity of extracting multi-word expressions due to shifts in the domain. In this paper, we will present a low-cost and interpretable method of automatic term extraction, developed specifically for Task A of the ATE Shared Task. This new method utilizes fine-tuning extraction strategies that can run on a small amount of computational resources. We evaluated our automated system using both type-level and micro-level measures of precision, recall, and F1-score to measure both complementary aspects of the extraction performance. According to the experimental results, our proposed approach achieves consistent and balanced performance compared to other teams. Even though the technique itself is relatively straightforward, it serves as a good starting point for low-resource models. Overall, the findings point toward the possibility of significant future advancements (in model expansion) with higher-level performance still able to retain their ability to be interpreted.

2606.01464 2026-06-02 cs.CL 版本更新

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

跨语言自一致性:面向语言模型的多语言推理

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

发表机构 * HiTZ Center, University of the Basque Country (UPV/EHU)(巴斯克大学HiTZ中心) Reka AI

AI总结 提出无监督强化学习方法,通过强制模型对跨语言等价问题产生相同答案来增强多语言推理,在MGSM上平均提升21.7%,并展现出强泛化能力。

Comments Paper under review

详情
AI中文摘要

尽管大语言模型(LLMs)的多语言覆盖范围在扩大,但其高级推理能力仍主要局限于少数高资源语言(如英语)。为了解决这一问题,我们提出了一种无监督强化学习(RL)方法,通过强制跨语言自一致性(即模型应对不同语言的等价问题产生相同最终答案)来增强多语言推理。现有方法受限于多语言推理数据的稀缺性,且对未见语言的泛化能力较弱。我们的方法既不需要标准答案,也不需要平行数据,在MGSM的10种语言上平均提升了高达21.7%。此外,我们的方法展现出强泛化能力,在训练期间未见的MGSM语言上平均提升18.2%,在3个分布外基准测试上提升高达6.2%。这些结果表明,基于一致性的方法有潜力在无需监督数据的情况下提升LLMs的多语言能力。

英文摘要

Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.

2606.01462 2026-06-02 cs.AI cs.CL cs.LG 版本更新

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

人工推理之谜:探究大型推理模型中的生成-评估差距

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

发表机构 * NUS Department of Computer Science(国立新加坡大学计算机科学系) MIT EECS(麻省理工学院电子工程与计算机科学系) A*STAR(新加坡科技研究局) Singapore-MIT Alliance for Research and Technology (SMART)(新加坡-麻省理工联合研究技术机构(SMART))

AI总结 本文通过VAIR数据集发现大型推理模型在评估推理时存在显著缺陷,表现为答案确认偏差,即模型倾向于验证答案正确性而非仔细检查推理步骤。

Comments 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)

详情
AI中文摘要

对人类推理的研究表明,人们通常更擅长评估推理而非从头生成推理。相比之下,大型推理模型(LRMs)经过训练,擅长生成长链推理以解决复杂问题。那么,LRMs在评估推理方面表现如何?我们通过有效答案-无效推理(VAIR)数据集进行研究:该数据集包含数学问题和解决方案,这些解决方案存在琐碎的推理缺陷但答案有效,旨在将推理评估与推理生成混淆因素分离。与人类(我们发现人类在评分此类问题时仅比解决它们差6%)不同,我们发现LRMs存在显著的生成-评估差距:前沿模型在评估VAIR解决方案时得分低至48%,尽管在解决方案生成方面近乎完美。为何存在这一谜团?通过思维链(CoT)分析,我们发现了答案确认偏差的证据:LRMs通常先产生答案,然后检查正确答案,而不是仔细验证每一步,即使在注意到异常推理时也会编造合理化解释。线性探针进一步证实了这一点,表明虽然LRM激活编码了有效推理的某些表示,但它们未能稳健地将VAIR解决方案表示为无效。对最终答案表示的因果修补导致LRM判断和激活翻转,表明答案有效性是模型确认偏差的原因。这些发现揭示了主导推理训练方法的显著局限性,该方法激励LRMs生成并确认朝向正确答案的推理,但未能稳健地评估底层推理。

英文摘要

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

2606.01456 2026-06-02 cs.LG cs.CL cs.GT 版本更新

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

诚实的人工智能顾问:偏好错位下大语言模型诚实性的预设基准

Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi, Arshia Gharagozlou

发表机构 * Amazon Lab126, HW Tech Org.(亚马逊实验室126,硬件技术组织) Computational Modeling and Simulation University of Pittsburgh(计算建模与仿真大学匹兹堡分校) Mathematics & Statistics Department University of Minnesota Duluth(数学与统计学系明尼苏达大学 Duluth 分校)

AI总结 通过Crawford-Sobel廉价谈话模型构建基准,评估大语言模型在偏好冲突时是否诚实,发现模型过度揭示信息,偏离策略最优。

Comments 19 pages. Code and data: https://github.com/iHamidHasani/cheap-talk-llm-benchmark

详情
AI中文摘要

大语言模型越来越多地被部署为顾问,其目标与用户不一致:推荐系统优化参与度,销售助手优化购买,谈判代理优化让步。当诚实与自身收益冲突时,这些顾问是否保持诚实是一个核心的对齐评估问题。我们将经典的Crawford-Sobel廉价谈话模型转化为偏好错位下LLM诚实性的预设基准。廉价谈话理论预测既非完全揭示也非沉默,而是粗糙的单调划分,随着偏好冲突增加,信息区间减少。发送者观察到状态omega在[0,1]中,希望接收者的行动接近omega+b,并向理想行动为omega的接收者发送一条无成本消息。设计使用5个偏差水平、3个提示框架、固定的低温度设置和每个单元200个状态:共12,000次发送者调用。对于正偏差网格b∈{0.01,0.04,0.08,0.12},最信息丰富的划分大小分别为7、4、3、2,预言机归一化互信息分别为0.5294、0.3268、0.2205、0.1829。在四个指令调优模型(GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Flash-Lite、Llama-3.3-70B)上运行完整设计,我们发现所有四个模型相对于最信息丰富的均衡过度揭示1.8至4.2倍:归一化互信息保持在0.78-0.94,而预言机规定为0.18-0.53。信息量随偏差下降如预测,但从未接近策略最优;模型显示出近乎完全的揭示,并带有跟踪其偏差的恒定正向偏移(线性夸大)。收益最大化与诚实框架的影响可忽略。解码器消融表明,仅当接收者读取发送者陈述的数字时,该发现才可恢复:仅嵌入解码器将相同数据误读为近乎胡言乱语。

英文摘要

Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.

2606.01451 2026-06-02 cs.CL 版本更新

Before and After Temperature: A Distributional View of Creative LLM Generation

温度前后:创造性LLM生成的分布视角

V. S. Raghu Parupudi, Harsha Ponnada, Aditi Kaushal, S. Shria Parupudi, Saiteja Dasari, Sahiti Bulusu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Indian Institute of Technology, Kharagpur(印度理工学院克哈格布尔分校) Delhi Technological University(德里技术大学) Carnegie Mellon University(卡内基梅隆大学) Rutgers University(罗格斯大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 通过分析采样温度重塑token分布的特征,提出一种无需参考的创造力评估方法,在Llama-3.1-8B-Instruct上达到Spearman ρ=0.918(LLM评判)和ρ=0.870(人类多数评判),显著优于现有基线。

Comments Submitted to NGEN-AI 2026

详情
AI中文摘要

大型语言模型(LLM)创造力的无参考评估依赖于困惑度、熵和top-1边际。我们表明,一个更强的信号存在于流程中更早的一步:在抽取下一个token之前,采样温度如何重塑模型的token分布。在Llama-3.1-8B-Instruct生成的500个开放式创造性提示(温度T∈{0.3,0.8,1.5})上,一个基于这种重塑的单个逐token特征预测提示内创造力排名的Spearman ρ=0.918(与平均gpt-4o/gemini-2.5-pro评判对比,n=500)和ρ=0.870(与三人人类多数排名对比,n=150)。四种标准无参考基线(自困惑度、平均预测熵、top-1边际、gzip压缩比)在两个真实标签上均达到|ρ|≈0.76:与平均LLM评判差距为+0.165,与人类多数评判差距为+0.110,两者均远大于基线之间的差异。两个真实标签面板之间的相关系数为ρ=0.83,高于人类间上限ρ=0.77,因此比较不受评判噪声的瓶颈限制。机制上,优势来自于不连贯区域的尖锐分布特征:在T=1.5时,累积质量宽度n_{95}(q)从约1个token膨胀到约131个token,且温度后的质量从温度前top-90%可能集合中泄漏约13个百分点。逐token聚合无法区分T=0.8和T=0.3;区分这两个连贯区域的任务留给序列级特征。

英文摘要

Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at $T \in \{0.3, 0.8, 1.5\}$, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman $ρ{=}0.918$ against an averaged gpt-4o\,/\,gemini-2.5-pro judge ($n{=}500$) and $ρ{=}0.870$ against a three-rater human-majority ranking ($n{=}150$). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at $|ρ|\!\approx\!0.76$ on both ground truths: a gap of $+0.165$ on averaged-LLM and $+0.110$ on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at $ρ{=}0.83$, above the inter-human ceiling of $ρ{=}0.77$, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at $T{=}1.5$ the cumulative-mass width $n_{95}(q)$ inflates from $\sim\!1$ to ${\sim}\!131$ tokens and post-temperature mass leaks off the pre-temperature top-$90\%$ plausible set by about $13$ percentage points. The per-token aggregates do not separate $T{=}0.8$ from $T{=}0.3$; discriminating the two coherent regimes is left to sequence-level features.

2606.01444 2026-06-02 cs.AI cond-mat.mtrl-sci cs.CL cs.LG math.CT 版本更新

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

科学中的自我修正发现系统:面向主体人工智能的范畴论框架

Fiona Y. Wang, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics(原子分子力学实验室) Department of Biological Engineering(生物工程系) Massachusetts Institute of Technology(麻省理工学院) Department of Civil and Environmental Engineering(土木与环境工程系) Department of Mechanical Engineering(机械工程系) Center for Computational Science and Engineering(计算科学与工程中心) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 本文提出一个基于范畴论的框架,通过左Kan扩展实现科学发现中的表征体制转换,并应用于材料科学中的蛋白质力学和纤维网络建模。

详情
AI中文摘要

科学发现不仅是生成答案,更是对证据、人工制品、操作和验证者进行类型化的表征体制的修正。我们为材料科学中的主体发现开发了一个范畴论描述。在固定体制b中,模式类别为S_b,系统状态是一个余预层I_t: S_b -> Set,来源是元素范畴∫_{S_b} I_t。固定体制操作是对此类状态的更新,仅当指定并保留了保持来源的细化时才是自函子。发现则是经过验证的体制转换u: S_b -> S_b':旧人工制品通过左Kan扩展Lan_u I_t保存并传输,并与转换后状态进行比较,以识别超出函子传输的剩余内容。这在不依赖主观新颖性的情况下区分了检索、搜索和发现。我们在两个系统中实例化了该框架。在Builder/Breaker中,蛋白质力学世界模型在最小描述长度门控下进行修正;接受的定律将链内柔性表示为受慢集体模式调节的全模态弹性柔度,即模式调节柔度。在CategoryScienceClaw中,类型化技能、人工制品、开放需求、工作流变异、门控、压力测试和公共话语构成了一个携带证明的知识计算图。一个纤维网络示例记录了候选模型、被拒绝的替代方案、AIC门控、扰动测试以及一个基于各向同性纤维计数描述符的接受取向张量各向异性刚度代理模型。这些案例共同展示了范畴论如何既作为科学发现的数学语言,又作为自我修正AI发现系统的工程规范。

英文摘要

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b': old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.

2606.01436 2026-06-02 cs.CL 版本更新

Learning from Saturated Data: Signals Beyond Correctness for LLM Training

从饱和数据中学习:LLM训练中超越正确性的信号

Hanno Hiss, Jasper Dekoninck, Martin Vechev

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文研究在基准测试和训练数据集饱和的情况下,如何利用超越二元正确性的细粒度质量信号(如成对LLM自判断和token级熵)来提升下游性能,实验表明在简单算术任务上质量信号显著优于SFT,但在复杂任务上效果有限且需谨慎校准。

Comments 25 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)能力的不断增强导致许多用于改进它们的基准测试和训练数据集趋于饱和。受此启发,我们研究了那些经验上达到完美准确率的问题是否仍可用于提升下游性能。为此,我们将二元正确性替换为两种更细粒度的质量信号来源:(1)成对LLM自判断,即模型评估自身解决方案的相对质量;(2)token级熵,其中token级不确定性作为解决方案质量的代理。我们将这些信号整合到多种训练算法中,并在Qwen3-1.7B-Base上进行了评估。当仅在一个简单算术任务上训练时,基于质量的信号相比基础模型提升了高达18.6%的性能,显著优于SFT。然而,在GSM8K上,提升较为有限,且强烈依赖于质量信号。例如,自判断与更强的外部判断者一致性较差,甚至可能使性能低于基础模型。总体而言,我们的结果表明,基于质量的训练可以从饱和问题中为基础模型提取有用信号,但将此类信号应用于更复杂的任务需要仔细校准和进一步研究。

英文摘要

The growing capabilities of large language models (LLMs) have led to the saturation of many benchmarks and training datasets used to improve them. Motivated by this, we investigate whether questions solved with perfect empirical accuracy can nevertheless be used to improve downstream performance. To do so, we replace binary correctness with two sources of more fine-grained quality signals: (1) pairwise LLM self-judgments, in which the model evaluates the relative quality of its own solutions, and (2) token-level entropy, where token-level uncertainty is used as a proxy for solution quality. We incorporate these signals into several training algorithms and evaluate them on Qwen3-1.7B-Base. When training exclusively on a simple arithmetic task, quality-based signals improve performance by up to $18.6\%$ over the base model, substantially outperforming SFT. On GSM8K, however, gains are more modest and depend strongly on the quality signal. For instance, self-judgments show poor agreement with a stronger external judge and can even degrade performance below the base model. Overall, our results suggest that quality-based training can extract useful signal from saturated questions for base models, but that applying such signals to more complex tasks requires careful calibration and further study.

2606.01435 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

不要询问LLM追踪新鲜度:一种确定性的内存冲突解决策略

Vikas Reddy, Sumanth Challaram

发表机构 * IIT Kgp(印度理工学院科钦分校)

AI总结 针对基于LLM的内存系统中事实冲突解决性能低下的问题,提出用候选提取加Python max(serial)的确定性聚合替代LLM判断,在单跳任务上提升10.8个百分点,并扩展到多跳任务。

详情
AI中文摘要

基于LLM的内存系统越来越多地维护随时间演变的事实,其中一个反复出现的失败是冲突解决:当一个事实有多个矛盾的值时,智能体应该返回哪个?MemoryAgentBench (MAB; Hu et al., 2026) 在其FactConsolidation任务中明确了这一点:事实被编号,反事实具有更高的序号,并且智能体被告知较新的事实具有较大的序号。然而,每个已发布的系统表现不佳:HippoRAG-v2在单跳(FC-SH)上达到54%,BM25 48%,Mem0 18%,而时间知识图谱Zep/Graphiti仅为7%。多跳几乎未解决(22个系统中最多7%)。我们认为瓶颈在于组装步骤:基线将冲突解决留给LLM介导的检索或生成,而不是版本感知的聚合。一个匹配设置的比较(相同的主干、检索、分块、TOP_K)表明,用候选提取加Python max(serial)替换LLM判断答案流水线,在FC-SH上(gpt-4o-mini)获得+10.8分的提升,从6K时的+8分扩大到262K时的+21分。这是一个全流水线效应(解析器、提示、格式和温度共同变化);隔离解析器是未来的工作。该配方在FC-SH上达到78.0%(gpt-4o-mini)、94.8%(gpt-4o),在FC-MH上达到30.2%(gpt-4o-mini,使用gpt-4o时升至51.5%),通过每跳确定性的Self-Ask扩展。在匹配的262K下,它比HippoRAG-v2高出+28分,比已发布的最佳FC-MH结果高出+20分。这一含义对该子领域具有纠正作用:冲突解决的瓶颈是组装(检索后聚合),而不是存储。一个LongMemEval知识更新检查表明,该机制从max(serial)移植到max(timestamp),但仅与LLM判断持平(57.8% vs 64.4%,n=45):确定性聚合是当前值冲突的正确原语,并且必须与问题类型感知处理组合,以实现更广泛的内存问答。

英文摘要

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

2606.01434 2026-06-02 cs.CL 版本更新

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

DrugClaw与DrugAudit:基于原始来源的智能体与权威感知基准用于药物信息问答

Qing Wang, Bo Li, Jialu Liang, Daling Shi, Bob Zhang, Qianqian Song

发表机构 * Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida(佛罗里达大学健康结局与生物医学信息学系,医学院) PAMI Research Group, Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科学与技术学院计算机与信息科学系PAMI研究组)

AI总结 提出多智能体检索增强系统DrugClaw,通过反射驱动状态机查询药物注册与药物警戒知识库,并构建含3772条权威感知基准DrugAudit,在多个基准上取得最优性能。

详情
AI中文摘要

药物信息问答是一个高风险场景,其中虚构的事实可能误导临床决策,且每个引用事实的来源与事实本身同样重要。我们提出了DrugClaw,一个多智能体检索增强系统,通过反射驱动的状态机工作流查询药物注册和药物警戒技能库,并返回基于原始监管或同行评审记录的答案。我们还贡献了DrugAudit,一个包含3772个条目的权威感知基准,配备评估面板,在双评判者LLM作为评判者的协议下(评判者间kappa=0.88,几乎完美),对上游黄金来源匹配、令牌级语义片段重叠和引用忠实性进行评分。在DrugAudit以及MedQA(751)和PubMedQA(512)的药物相关子集上,DrugClaw在标题表的每一列均排名第一:两个评判者下的综合证据指数、评判者中介的答案正确性、原始来源率(0.918,比次优高10.1个百分点)、忠实性(0.887,高5.9个百分点)、MedQA(0.920)和PubMedQA(0.693)。

英文摘要

Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).

2606.01400 2026-06-02 cs.CL cs.AI 版本更新

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

一致且独特:基于相似图最大独立集提示选择的LLM基准测试效率

Denica Kjorvezir, Marko Djukanović, Ana Gjorgjevikj, Gjorgjina Cenikj, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia(计算机系统部,乔塞夫·斯塔芬研究所,卢布尔雅那,斯洛文尼亚) Jožef Stefan International Postgraduate School, Ljubljana, Slovenia(乔塞夫·斯塔芬国际研究生学院,卢布尔雅那,斯洛文尼亚) Center for Astrophysics and Cosmology, University of Nova Gorica, Nova Gorica, Slovenia(天体物理与宇宙学中心,诺瓦戈里察大学,诺瓦戈里察,斯洛文尼亚)

AI总结 提出基于相似图最大独立集的提示选择框架,通过选择多样且非冗余的子集,在保持LLM排名一致性的同时显著减少基准测试成本。

详情
AI中文摘要

在全面基准测试中评估大型语言模型(LLM)既昂贵又耗时。我们提出了一种基于图的提示选择框架,将每个基准建模为相似图——如果提示在嵌入空间中的距离超过可配置阈值,则节点相连——并应用最大独立集(MIS)算法选择最大多样、非冗余的子集。我们评估了四种MIS求解器(CPLEX、GREEDY、Online-MIS、ReduMIS),涵盖六种嵌入模型、三种距离度量、六个百分位数阈值和四个基准(GPQA、IFEval、MMLU-Pro、Omni-MATH),涉及66个LLM。我们的核心假设——不同随机种子下的重复选择会产生一致的LLM排名,且可能不同于完整基准基线——得到强烈证实:在99.2%的随机配置中Kendall's $W \geq 0.90$(平均$W = 0.997 \pm 0.008$),而在较高百分位数阈值下,所选子集平均减少25-48%的提示。与完整基准的排名差异($\rho < 0.95$)仅发生在15.95%的配置中,主要集中在低阈值($p_{10}$-$p_{20}$)和基准(GPQA、IFEval)上,识别出过于密集的图是主要失败模式。

英文摘要

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

2606.01394 2026-06-02 cs.CL 版本更新

UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning

UniD$^3$:一种用于药物-疾病发现与推理的知识图谱增强RAG框架

Qing Wang, Tianshi Liu, Minghao Zhou, Jialu Liang, Sen Guo, Guangyu Wang, Jing Su, Qianqian Song

发表机构 * Department of Health Outcomes and Biomedical Informatics, University of Florida(佛罗里达大学健康成果与生物医学信息学系) Department of Hematology, H. Lee Moffitt Cancer Center and Research Institute(血液科,H. Lee Moffitt癌症中心与研究院) Center for Bioinformatics and Computational Biology, Houston Methodist Research Institute(生物信息学与计算生物学中心,休斯顿方法主义研究学院) Department of Cardiothoracic Surgery, Weill Cornell Medicine, Cornell University(心胸外科,Weill Cornell医学,康奈尔大学) Department of Biostatistics and Health Data Science, Indiana University School of Medicine(生物统计学与健康数据科学系,印第安纳大学医学院)

AI总结 提出UniD$^3$框架,结合大语言模型与知识图谱增强检索生成(KG-RAG),从生物医学文献中提取、组织和验证药物-疾病知识,生成结构化数据集并提升推理可靠性。

详情
AI中文摘要

药物-疾病关系的系统表征对于药物发现和重定位至关重要,但受限于生物医学文献的异质性和快速增长。现有数据集依赖劳动密集型整理且常不完整,而仅使用大语言模型的方法存在幻觉和证据基础薄弱的问题。我们提出UniD$^3$,一个统一框架,将大语言模型与知识图谱增强检索生成(KG-RAG)相结合,在药物-疾病匹配(DDM)、药物有效性评估(DEA)和药物-靶点分析(DTA)中提取、组织和验证药物-疾病知识。UniD$^3$使用Llama 3.3-70B处理157,849篇PubMed文章,并通过双阶段策略构建知识图谱,该策略结合了论文级提取和以药物和疾病实体为中心的KG级整合。这些图谱支持基于KG-RAG的结构化数据集生成,并通过外部基准测试、与精选资源的模糊匹配以及临床医生评审进行评估。UniD$^3$生成了六个知识图谱和大规模数据集,包括28,915个DDM、15,042个DEA和超过4,000个DTA问答对。外部验证显示性能强劲(DDM/DEA的F1为0.85-0.87;DTA为0.82),临床医生评审确认高可靠性(AUROC = 0.90)。KG-RAG增强模型优于独立大语言模型,UniD$^3$聊天机器人支持可解释、有引用的药物-疾病关系探索。UniD$^3$提供了一个可扩展、可扩展的框架,用于将非结构化生物医学文献转化为高质量、结构化的药物-疾病知识,支持AI驱动的发现、重定位和精准医学。

英文摘要

Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and weak evidence grounding. We introduce UniD$^3$, a unified framework that integrates Large Language Models with Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) to extract, organize, and validate drug-disease knowledge across Drug-Disease Matching (DDM), Drug Effectiveness Assessment (DEA), and Drug-Target Analysis (DTA). UniD$^3$ processes 157,849 PubMed articles with Llama 3.3-70B and constructs knowledge graphs via a dual-stage strategy combining paper-level extraction with KG-level consolidation centered on drug and disease entities. These graphs support KG-RAG-based generation of structured datasets, evaluated through external benchmarks, fuzzy matching with curated resources, and clinician review. UniD$^3$ produces six knowledge graphs and large-scale datasets, including 28,915 DDM, 15,042 DEA, and over 4,000 DTA QA pairs. External validation shows strong performance (F1: 0.85-0.87 for DDM/DEA; 0.82 for DTA), with clinician review confirming high reliability (AUROC = 0.90). KG-RAG-augmented models outperform standalone LLMs, and the UniD$^3$ chatbot enables interpretable, citation-supported exploration of drug-disease relationships. UniD$^3$ provides a scalable, extensible framework for transforming unstructured biomedical literature into high-quality, structured drug-disease knowledge, supporting AI-driven discovery, repurposing, and precision medicine.

2606.01393 2026-06-02 cs.CL cs.AI cs.CV 版本更新

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Dr. DocBench:专家级与困难文档解析的综合基准

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi, Konwoo Kim, Longtian Ye, Xiaolong Luo, Jinhe Bi, Henry Zhang, Haris Riaz, Xuan Zhang, Yunze Xiao, Bangya Liu, Tom Tang, Yunfei Zhao, Qunshu Lin, Zihan Wang, Minghao Liu, Michael Lingzhi Li, Yilun Du, Jesse Thomason, Rogerio Feris, Alex Pentland, Zexue He

发表机构 * Stanford University(斯坦福大学) MIT(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学) University of Southern California(南加州大学) Harvard University(哈佛大学) IBM Research(IBM研究院) University of Arizona(亚利桑那大学) Duke University(杜克大学) UC Berkeley(加州大学伯克利分校) LMU Munich(慕尼黑路德维希-马克西米利安大学)

AI总结 提出Dr. DocBench基准,通过基于解析器失败的采样从多语言书籍语料库中选取挑战性文档,包含52个BISAC主题领域和65k高质量标注,用于评估专家级文档解析能力。

Comments 27 pages, 13 figures, 14 tables

详情
AI中文摘要

文档解析和识别是视觉语言模型(VLM)和文档处理系统的基本能力。然而,现有的光学字符识别(OCR)和文档解析基准在覆盖范围和难度上日益受限:许多基准专注于常见文档类型或均匀采样的页面,现代解析器在这些页面上已表现良好,而对专家领域结构(如化学公式、乐谱、复杂表格和跨页布局)的标注有限。我们引入了Dr. DocBench,一个面向专家级文档解析的难度感知基准。Dr. DocBench基于大规模多语言书籍语料库构建,涵盖52个BISAC主题领域,并通过基于解析器失败的采样选择挑战性文档,针对多个最先进系统难以处理的案例。它包含来自平均约100页的长文档的4,514个标注页面,具有65k高质量的页面级和块级标注,涵盖布局、阅读顺序、层次关系和特定领域的视觉内容。对基于流水线的解析器和通用VLM的评估表明,在现有基准上的强性能并不能迁移到我们的专家级文档解析中。我们的分析揭示了跨主题、内容类型和结构属性的重大失败,突显了Dr. DocBench作为诊断和推进文档智能的综合测试平台的作用。

英文摘要

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.

2606.01386 2026-06-02 cs.AI cs.CL cs.DC cs.LG 版本更新

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

GuidaPA: 通过联邦学习为公共行政提供隐私保护的聊天机器人

Daniel M. Jimenez-Gutierrez, Albenzio Cirillo, Raffaele Nicolussi, Alessio Beltrame, Andrea Vitaletti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出GuidaPA,一个基于联邦学习(FL)在意大利公共行政文档上训练的隐私保护聊天机器人,通过参数高效的联邦微调(QLoRA)和角色访问控制,在保持数据本地化的同时实现了接近集中式微调的答案质量。

Comments Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)

详情
AI中文摘要

我们提出了GuidaPA,一个为意大利公共行政(PA)设计的隐私保护聊天机器人,它通过联邦学习(FL)在两个国家PA平台SIGESON和SIDFORS的文档上进行训练。我们的语料库包括约8页的SIGESON手册和31页的SIDFORS手册/常见问题解答;虽然本研究使用公开文档作为安全代理,但预期的部署将扩展到受限制的内部来源(例如,工单、官员手册、数据库提取),这些数据由于监管和组织约束无法集中汇集。GuidaPA集成了基于角色的访问控制、安全的客户端预处理、对非独立同分布效应的显式监控以及大语言模型的参数高效联邦微调。使用QLoRA(4位)进行15轮联邦训练,每个客户端采用80/20的训练-测试划分,我们使用ROUGE、BLEU-4和METEOR评估答案质量。最佳联邦模型达到了ROUGE-1/2/L分别为61.10/55.77/59.44,BLEU-4为45.02,METEOR为63.94——接近私有集中式微调的性能,同时保持数据在本地。与通用基线相比,领域微调将ROUGE-1从41.45提高到62.18,BLEU-4从26.97提高到50.90。总体而言,结果表明FL可以在不进行集中数据共享的情况下,为公共服务提供高质量的对话式AI。

英文摘要

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET 版本更新

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite:一种轻量级频率分解线性模型,具有自适应可逆归一化,用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * Hamdard University(哈姆达德大学)

AI总结 提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器,通过可学习的无损谱滤波器进行频带分解和线性预测,并引入自适应可逆实例归一化(A-RevIN)处理非平稳性,在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

Comments 26 pages, 5 figures

详情
AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色,但仍存在两个问题:可逆实例归一化(RevIN)使用单一回溯统计量对整个预测区间进行去归一化,在非平稳性下不准确;时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器:一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带,由每个频带的线性头进行预测,与低通截断方法不同,高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型,在长回溯(L=336)时,其平均误差低于PatchTST Transformer(0.3244 vs 0.3587 MSE),同时参数减少4倍,内存减少2.2倍,在单块4 GB笔记本GPU上每轮时间减少2.2倍;尽管幅度不大,但在所有匹配单元上的配对Wilcoxon检验中,其改进具有统计显著性(p < 1e-5)。我们进一步引入自适应可逆实例归一化(A-RevIN),一种自适应可逆归一化,严格推广了RevIN(在其门关闭时完全恢复),在非平稳性下起作用,并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集(ILI,MSE降低约5%)和一个受控合成漂移扫描中验证了这一点,其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融(Linear和RLinear是FreqLite的特例),所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

2606.01338 2026-06-02 cs.CL 版本更新

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试:消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

发表机构 * Department of Computer Science, University of the Cumberlands(大学的计算机科学系) Department of Computer Science, DePaul University(德保罗大学计算机科学系) Youngstown State University(亚当斯州立大学)

AI总结 本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能,发现代码调优的通用模型优于领域特定模型,但当前性能仍需人工监督。

详情
AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范(GMP)和欧盟AI法案等监管框架下运营,这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型(LLM)提供了一种保护隐私的替代方案,但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM(Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B)在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI,使用一个包含约63,000条记录的合成Microsoft SQL Server数据库,涵盖批次、制造执行系统(MES)和在线清洗(CIP)模块。模型在60个领域特定的自然语言问题上进行了基准测试,使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL,而Meditron 7B由于上下文窗口限制和SQL生成能力差,几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性,而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明,代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的,但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

2606.01336 2026-06-02 cs.CL 版本更新

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp:跨族上下文压缩用于长上下文推理

Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu

发表机构 * SambaNova Systems, Inc.(SambaNova系统公司)

AI总结 提出LongAttnComp方法,通过微调轻量级交叉注意力评分层并引入令牌级分块、令牌预算top-p算法、位置重排序和格式无关查询解析器,结合两阶段微调策略,在长上下文推理任务中实现与全上下文相当或更优的准确率。

Comments Under review

详情
AI中文摘要

随着实际应用越来越需要处理10万+令牌的输入,上下文长度与推理效率之间的差距已成为关键瓶颈。上下文压缩提供了一种在保持任务准确性的同时降低预填充成本的方法。然而,现有的无训练注意力方法在代码推理等要求高的长上下文任务中留下了显著差距。我们提出了LongAttnComp,这是AttnComp的长上下文适配,它微调了一个轻量级的交叉注意力评分层,并引入了令牌级分块、令牌预算top-p算法、位置重排序和格式无关的查询解析器。我们进一步为压缩器设计了两阶段微调方案:阶段1从NIAH风格数据构建通用检索基础,阶段2通过多跳和推理数据扩展,以覆盖更广泛的长上下文任务。在InfiniteBench Code-Debug上,LongAttnComp匹配或超过全上下文准确率,显著优于无训练基线,并在来自三个族的四个目标模型上迁移。在LongBench v2上,两阶段方案在很大程度上缩小了阶段1在多文档推理上的差距,同时保持了Code-Debug的性能。

英文摘要

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.

2606.01323 2026-06-02 cs.CL cs.AI 版本更新

DiffuSent: Towards a Unified Diffusion Framework for Aspect-Based Sentiment Analysis

DiffuSent:面向方面级情感分析的统一扩散框架

Shu Long, Yanglei Gan, Xuchuan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Southwest Petroleum University(西南石油大学) Southwest Minzu University(西南民族大学)

AI总结 提出非自回归扩散框架DiffuSent,将方面级情感分析的所有子任务统一为边界去噪扩散过程,通过对比去噪训练策略解决重复预测问题,在28个设置上优于现有生成式和跨度式系统,并实现高达181倍的推理加速。

详情
AI中文摘要

方面级情感分析(ABSA)包含七个不同的子任务,每个子任务关注不同的提取元素。尽管生成模型在统一方面情感分析中取得了成功,现有方法通常依赖于自回归的逐词生成,未能捕捉方面和意见术语的整体信息,导致边界不敏感,特别是在多词方面和意见术语的上下文中。为了解决这些问题,我们提出了DiffuSent,一个非自回归扩散框架,系统地将所有ABSA子任务公式化为边界去噪扩散过程,逐步在噪声状态上细化边界。此外,我们引入了一种对比去噪训练策略,有效解决了扩散过程中引入的细微变化导致的重复预测问题。在28个设置(7个子任务×4个数据集)上的大量实验表明,DiffuSent在最强生成式和跨度式系统上实现了持续改进。DiffuSent在多词三元组上表现出显著增益,平均F1提升+2.48,并在包含多个情感三元组的句子中保持稳健的提取准确性。此外,非自回归解码实现了显著的效率优势,推理速度比自回归生成基线快达181倍。

英文摘要

Aspect-Based Sentiment Analysis (ABSA) encompasses seven distinct subtasks, each focusing on different extracted elements. Despite the proven success of generative models in unified aspect sentiment analysis, existing approaches often rely on auto-regressive token-by-token generation without grasping the whole information of the aspect and opinion terms, resulting in boundary insensitivity, particularly in context of multi-word aspect and opinion terms. To address these issues, we present DiffuSent, a non-auto-regressive diffusion framework that systematically formulates all ABSA subtasks as boundary denoising diffusion processes, progressively refining boundaries over noisy states. Furthermore, we introduce a contrastive denoising training strategy which effectively address duplicate predictions with subtle variations introduced by diffusion process. Extensive experiments across 28 settings (7 subtasks x 4 datasets) demonstrate that DiffuSent achieves delivers consistent improvements over the strongest generative and span-based systems. DiffuSent exhibits notable gains on multi-word triplets, achieving an average improvement of +2.48 F1, and maintains robust extraction accuracy in sentences containing multiple sentiment triplets. Moreover, the non-auto-regressive decoding enables substantial efficiency benefits, reaching up to 181 times faster inference than auto-regressive generative baselines

2606.01322 2026-06-02 cs.CL cs.AI 版本更新

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

TukaBench: 一个基于文化的非洲语言越狱基准

Victor Akinode, Senyu Li, Wassim Hamidouche, Waqas Zamir, Inbal Becker-Reshef, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究院) McGill University(麦吉尔大学) Microsoft AI for Good Research Lab(微软人工智能造福人类研究实验室) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 针对大型语言模型在非洲低资源语言上的安全评估缺失,提出TUKABENCH基准,通过四种设置(直接翻译、文化适应翻译、人工策划提示、代码切换提示)评估语言、文化背景和提示规避性对模型安全的影响,发现非洲语言提示降低拒绝率,并引入Deflection指标和人工验证以解决模型理解失败和评判可靠性问题。

Comments Under review

详情
AI中文摘要

大型语言模型(LLMs)的安全评估仍然高度以英语为中心,导致低资源语言(LRLs),特别是非洲语言,严重缺乏探索。我们引入了TUKABENCH,一个针对七种非洲语言的越狱基准,它通过四种设置将JailbreakBench(JBB)扩展到直接翻译之外:JBB提示的人工翻译、适应非洲背景的英语提示后人工翻译、通过与GPT-5.2交互验证的人工策划提示,以及结合英语和非洲语言的代码切换提示,从而隔离语言、文化背景和提示规避性对模型安全的影响。在闭源和开源模型中,使用非洲语言提示相比英语减少了拒绝,其中文化适应的提示导致最少的拒绝。评估还揭示了两个结构性限制:模型理解失败和低资源语言中LLM作为评判者的可靠性降低。为了捕捉前者,我们在“拒绝”和“越狱”之外引入了“回避”;为了评估后者,我们通过人工标注验证输出,显示在低资源语言和较少支持的脚本中,评判者与人类的一致性下降。

英文摘要

Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation, human-curated prompts validated through interactions with GPT-5.2, and code-switched prompts combining English and African languages, isolating the effect of language, cultural grounding, and prompt evasiveness on model safety. Across closed and open models, prompting in African languages reduces refusal relative to English, with culturally adapted prompts leading to least refusal. The evaluation also surfaces two structural limitations: model comprehension failures and reduced LLM-as-a-judge reliability in LRLs. To capture the first, we introduce Deflection alongside Refused and Jailbroken; to assess the second, we validate outputs with human annotations, showing that judge-human agreement drops in lower-resource languages and less commonly supported scripts.

2606.01311 2026-06-02 cs.CL cs.AI cs.LG cs.MA 版本更新

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

SkillAdaptor:基于轨迹的LLM智能体自适应技能

Zhuoyun Yu, Xin Xie, Wuguannan Yao, Chenxi Wang, Lei Liang, Xiang Qi, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Ant Digital Technologies, Ant Group(蚂蚁集团数字技术部) Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)

AI总结 提出SkillAdaptor,一种无训练的步骤级技能自适应框架,通过显式故障归因和针对性更新,提升LLM智能体在长程交互任务中的表现。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLM)智能体越来越依赖可重用的外部技能来解决长程交互任务。现有的无训练技能自适应流程通常从完整轨迹或会话级反馈更新技能,这使得故障归因粗糙,往往产生不稳定或过于宽泛的修订。我们提出SkillAdaptor,一种无训练的步骤级技能自适应框架,具有显式故障归因,并可插入OpenClaw类智能体框架。给定一个失败轨迹,SkillAdaptor识别第一个可操作的故障步骤,将责任关联到候选技能,并在显式接受检查下应用针对性更新,同时保持主干冻结。我们在WebShop、PinchBench和Claw-Eval上使用Kimi-K2.5、GLM-5和GPT-5.2进行评估。SkillAdaptor在所有三个套件上均优于无技能和技能自适应基线,最大的单项指标提升为PinchBench平均得分%提升1.5分,Claw-Eval平均得分提升1.8分,WebShop成功率提升1.7分。这些结果表明,步骤级归因支持更稳定且可审计的无训练技能维护。

英文摘要

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.

2606.01301 2026-06-02 cs.CL 版本更新

Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

Med-HEAL:通过幻觉感知的上下文学习分析与缓解医学大语言模型中的幻觉

Yiming Liao, Zeno Franco, Jose Eduardo Lizarraga Mazaba, Keke Chen

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县) Medical College of Wisconsin(威斯康星医学学院)

AI总结 提出Med-HEAL框架,利用临床数据构建幻觉数据集,并通过自我批评和检索增强上下文学习策略缓解医学大语言模型中的幻觉,实验表明自我批评策略可提升多数模型准确率。

Comments 12 pages, 5 figures. Preprint full version of an accepted ACM-BCB 2026 short paper

详情
AI中文摘要

医学大语言模型中的幻觉对临床决策支持构成严重风险,特别是当模型需要推理复杂的电子健康记录时。然而,现有基准通常缺乏真实的临床背景,并且对如何在实践中缓解幻觉提供的见解有限。我们引入了Med-HEAL,一个利用临床基础数据系统性地识别、分析和缓解医学大语言模型幻觉的框架。基于源自MIMIC-IV出院摘要的EHRNoteQA基准,我们通过评估BioMistral-7B在开放式临床问答任务上的表现构建了一个幻觉数据集。模型输出通过一个结合LLM-as-a-Judge评估(GPT-4o)和医学生评审员人工审核的双重评估流程进行标注,通过自定义的基于网络的评估系统产生正确性判断和推理错误注释。然后,我们利用该数据集研究缓解策略:自我批评流程,其中测试模型审查自己的答案以检测潜在错误,并对标记案例重新生成响应;以及检索增强的上下文学习,该模型暴露于幻觉和纠正示例。在五个开源大语言模型(BioMistral、Llama-3.1、DeepSeek、Qwen2.5和Qwen3)上的实验表明,自我批评策略在无需参数更新的情况下提高了五个模型中三个的准确率(p < 0.05)。Med-HEAL既提供了一个可重用的幻觉数据集,也提供了一个实用的框架,用于研究和缓解医学大语言模型中的幻觉,支持AI系统在临床环境中更安全地部署。我们的代码和数据可在https://github.com/yimingliao-blad/med-heal.git公开获取。

英文摘要

Hallucinations in medical large language models (LLMs) pose serious risks for clinical decision support, particularly when models must reason over complex electronic health records (EHRs). However, existing benchmarks often lack a realistic clinical context and provide limited insight into how hallucinations can be mitigated in practice. We introduce Med-HEAL, a framework for systematically identifying, analyzing, and mitigating hallucinations in medical LLMs using clinically grounded data. Building on the EHRNoteQA benchmark derived from MIMIC-IV discharge summaries, we construct a hallucination dataset by evaluating BioMistral-7B on open-ended clinical question answering tasks. Model outputs are labeled through a dual evaluation pipeline that combines LLM-as-a-Judge assessment (GPT-4o) with human auditing by medical student reviewers, producing correctness judgments and annotations of reasoning errors via a custom web-based evaluation system. We then leverage this dataset to investigate mitigation strategies: a self-critique pipeline, in which the test model reviews its own answers to detect potential errors and regenerates responses for flagged cases, and retrieval-augmented in-context learning (RA-ICL), which exposes the model to hallucinated and corrected examples. Experiments across five open-source LLMs-BioMistral, Llama-3.1, DeepSeek, Qwen2.5, and Qwen3, show that the self-critique strategy improves accuracy for three of five models (p < 0.05) without requiring parameter updates. Med-HEAL provides both a reusable hallucination dataset and a practical framework for studying and mitigating hallucinations in medical LLMs, supporting safer deployment of AI systems in clinical environments. Our code and data are publicly available at https://github.com/yimingliao-blad/med-heal.git.

2606.01298 2026-06-02 cs.CL 版本更新

Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?

Challenger at MultiPRIDE: 这是仇恨言论还是被重新使用的语言?

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

发表机构 * University of Tabriz(塔布里兹大学)

AI总结 针对仇恨言论与重新使用语言的区分难题,提出一种结合语义嵌入、标签噪声过滤(Cleanlab+逻辑回归)和MLP分类器的可解释方法,在资源受限下实现稳健性能。

Comments 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195

详情
Journal ref
CEUR Workshop Proceedings, Vol. 4195, 2026
AI中文摘要

仇恨言论的传播在现代数字环境中,特别是在社交网络平台上,变得越来越有害。尽管最近的进展在自动仇恨言论检测方面显示出有希望的结果,但一个关键挑战仍然存在:区分真正的仇恨言论和被重新使用的语言。由于重新使用表达具有细微差别和上下文依赖性,准确标注是困难的。在本文中,我们提出了一种简单且可解释的方法来区分仇恨言论和被重新使用的语言,该方法是为MultiPride共享任务开发的。我们的方法生成密集的语义文本嵌入,并使用Cleanlab结合逻辑回归进行标签噪声过滤阶段,然后使用多层感知器(MLP)神经网络进行最终分类。该系统设计用于在有限的计算资源下运行,同时保持强大的性能。我们使用精确率、召回率和F1分数(包括宏平均指标)评估我们的方法。实验结果表明,尽管数据集存在极端类别不平衡,但性能稳健。总体而言,研究结果强调了通过更大的嵌入模型和更先进的预处理技术进一步改进的潜力,同时保持可解释性。

英文摘要

The spread of hate speech has become increasingly harmful in modern digital environments, particularly on social networking platforms. While recent advances have shown promising results in automatic hate speech detection, a key challenge remains: distinguishing genuine hate speech from reclaimed language. Accurate labeling is difficult due to the nuanced and context-dependent nature of reclaimed expressions. In this paper, we present a simple and interpretable approach for distinguishing hate speech from reclaimed language, developed for the MultiPride Shared Task. Our method generates dense semantic text embeddings and incorporates a label-noise filtering stage using Cleanlab with logistic regression, followed by a Multi-layer Perceptron (MLP) neural network for final classification. The system is designed to operate under limited computational resources while maintaining strong performance. We evaluate our approach using precision, recall, and F1-score, including macro-averaged metrics. Experimental results demonstrate robust performance despite extreme class imbalance in the dataset. Overall, the findings highlight the potential for further improvements through larger embedding models and more advanced preprocessing techniques while preserving interpretability.

2606.01294 2026-06-02 cs.CL cs.LG 版本更新

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

不要阅读一切:用于线性注意力的曲率条件查询

Dong Le, Thong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) VinUniversity(文理大学)

AI总结 针对线性注意力在上下文检索和长上下文任务中的不足,提出曲率条件查询(CCQ)机制,通过二阶泰勒展开构建局部二次模型,利用运行键协方差收缩查询向量,仅修改读取步骤,兼容现有线性注意力骨干,在困惑度、零样本下游准确率、检索和长上下文任务上取得提升。

Comments 19 pages

详情
AI中文摘要

线性注意力通过维护一个循环的快速权重状态,降低了 softmax 注意力的二次成本,但在上下文检索和长上下文任务中始终落后。现有的补救措施通过门控、增量更新或核特征映射作用于记忆的写入侧,但读取步骤保持不变:每个过去的键对输出都有加性贡献,因此有用的目标被存储向量的大多数稀释。我们借用 softmax 几何的一个特定部分来构建一个廉价的读取时查询收缩。在等向注意力点处对 softmax 对数配分函数进行二阶泰勒展开,得到一个局部二次模型,其曲率与运行键协方差一致,该量可以通过与线性注意力状态相同的循环/分块机制来维护。相关的线性算子在查询读取状态之前,沿着记忆的高密度方向收缩查询。我们将这种机制称为曲率条件查询(CCQ)。CCQ 仅修改读取步骤,并且可以与任何线性注意力骨干组合。将其附加到 GLA 和 Gated DeltaNet 上,它在困惑度、零样本下游准确率、训练上下文内外的 S-NIAH 检索、从 4K 到 20K 的长度外推困惑度以及 LongBench 准确率上均有提升,且额外成本很小。

英文摘要

Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax's geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.

2606.01286 2026-06-02 cs.SE cs.AI cs.CL cs.LG 版本更新

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver: 通过以解决方案为中心的进化进行前沿任务合成

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出BenchEvolver框架,通过进化参考解决方案自动生成更难的编程问题,以解决基准饱和问题,并在LiveCodeBench和SciCode上验证其有效性。

详情
AI中文摘要

前沿大语言模型的快速进步导致了广泛的基准饱和,限制了现有数据集区分模型能力或提供有用训练信号的能力。例如,在LiveCodeBench上,前沿模型在简单拆分上达到超过99%的Pass@1,在不同难度级别上平均超过90%的Pass@1。构建新的、具有挑战性的数据集通常需要大量人力,成为进步的瓶颈。我们引入了BenchEvolver,一个以解决方案为中心的进化框架,自动将现有编码问题转化为更难的变体。BenchEvolver不是从头生成问题,而是通过结构化变换进化参考解决方案,并从进化后的解决方案中推导出相应的描述和测试。这种设计将生成过程基于可执行语义,使得能够可扩展地构建高质量、多样化和困难的任务,并具有可验证的正确性。将BenchEvolver应用于LiveCodeBench和SciCode,我们获得了显著更难的进化任务,同时保持了有效性、参考正确性和多样性。我们进一步策划了LiveCodeBench-Plus,一个包含91个问题的基准,结合了进化后的任务和困难的原始LCB-v6任务,其中前沿模型的Pass@1范围从27.5%到62.6%,恢复了强编码模型之间的清晰区分。重要的是,即使对于生成它们的模型,进化后的任务仍然具有挑战性,从而实现了自我改进。我们进一步表明,在进化后的LCB任务上进行强化学习提高了留出编码性能:对于gpt-oss-20b,种子+进化训练在LCB v6 Hard和LCB-Pro Easy上分别获得了+8.7和+8.3的Pass@1提升,分别超过仅种子训练的70.7%和34.8%。我们的结果表明,BenchEvolver可以将饱和的基准转化为前沿级别的评估套件和可重用的训练信号。

英文摘要

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.

2606.01276 2026-06-02 cs.CL 版本更新

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

词中世界:基于多智能体协调的古代汉语文本文化翻译

Xiaoqi He, Kaixin Lan, Mu You, Tao Fang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学自然语言处理与计算机实验室,计算机与信息科学系) Institute of International Language Services Studies, Macau Millennium College(澳门 millennium 学院国际语言服务研究学院)

AI总结 针对古代汉语文本中文化负载词的翻译难题,提出多智能体文化感知翻译框架MACAT,通过选择性显化策略动态识别文化短语并注入简洁解释,在中医经典和《论语》上优于基线模型。

Comments The preprint manuscript is 20 pages long and is currently under review

详情
AI中文摘要

基于大语言模型的机器翻译推动了跨文化交流,但在处理古代汉语文本中的文化负载词时仍面临挑战。挑战不仅在于词汇对齐,还在于决定何时以及如何向缺乏相关背景的读者显化文化依赖知识。直译常保留表面形式但缺失深层概念,而过度显化则损害简洁性和可读性。为解决此问题,我们将文化负载词翻译定义为选择性显化任务,并提出MACAT,一个多智能体文化感知翻译框架,动态识别文化显著短语并在必要时注入简洁的解释性知识。MACAT进一步包含一个质量感知重排序模块用于候选选择,以及一个多轮评估智能体,从术语精确性、可读性、忠实度、文化保留和文化显化方面评估翻译。在中医经典和《论语》上的实验表明,在统一的GPT-5.4评估设置下,MACAT在100篇中医文档和《论语》20章子集上始终优于骨干模型和通用机器翻译基线。

英文摘要

Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to deciding when and how culture-dependent knowledge should be explicated for readers lacking relevant background. Literal translation often preserves surface forms while missing underlying concepts, whereas over-explicitation harms conciseness and readability. To address this problem, we formulate CLW translation as a selective explicitation task and propose \textbf{MACAT}, a \textbf{M}ulti-\textbf{A}gent \textbf{C}ulture-\textbf{A}ware \textbf{T}ranslation framework that dynamically identifies culturally salient phrases and injects concise explanatory knowledge when necessary. MACAT further incorporates a quality-aware reranking module for candidate selection and a multi-round evaluation agent that assesses translations across terminological precision, readability, fidelity, cultural preservation, and cultural explicitation. Experiments on traditional Chinese medicine (TCM) classics and the \textit{Analects} show that, under a unified GPT-5.4 evaluation setting, MACAT consistently outperforms both the backbone model and general-purpose MT baselines on 100 TCM documents and a 20-chapter subset of the \textit{Analects}.

2606.01260 2026-06-02 cs.CL cs.AI 版本更新

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

IndoBias:印尼语言中大语言模型偏见评估的双轨文化基准

Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德大学人工智能大学) Universitas Indonesia(印度尼西亚大学) Independent Researcher(独立研究员)

AI总结 提出IndoBias基准,通过深度和广度双轨评估,发现现有LLM在印尼语和三种地方语言中表现出显著偏见,且预训练数据来源和语言多样性影响偏见程度。

详情
AI中文摘要

尽管印尼拥有超过1300个民族和700种土著语言,但大语言模型中的偏见尚未得到充分研究,从而在其独特广阔、多语言和多样化的社会文化背景下,评估代表性公平性和本地化刻板印象存在关键空白。为解决此问题,我们引入IndoBias作为文化基础的偏见基准,评估LLM在印尼语和三种地方语言(爪哇语、巽他语和望加锡语)中的偏见。IndoBias具有双视角评估轨道:深度导向(使用对比对)和广度导向(基于生成),后者基于社会科学框架(SPI、O*NET和WGI)。我们的结果表明,现有LLM——尤其是解码器模型——对印尼语中的原型句子表现出强烈偏见,而地方语言在意识形态和宗教类别下遭受更高偏见。我们还发现,当使用各种地方实体提示时,LLM响应表现出非均匀的刻板印象极性。最后,我们发现,在印尼语中,Common Crawl文本在预训练期间引入的偏见比人工审核的文章文本(如维基百科、新闻)更多,而将地方语言引入预训练通常会增加偏见。这项工作强调了在特定文化背景下研究偏见的重要性。警告:本文包含可能具有冒犯性、有害性或偏见性的示例数据。

英文摘要

Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and localized stereotypes within its uniquely vast, multilingual, and diverse sociocultural landscape. To address this, we introduce IndoBias as a culturally-grounded bias benchmark to assess LLMs bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. IndoBias features dual perspective evaluation tracks: depth-oriented (with contrastive-pairs) and breadth-oriented (with generation-based), where the latter is grounded in social science frameworks (SPI, O*NET, and WGI). Our results show that existing LLMs -- particularly decoder models -- exhibit strong bias towards prototypical sentences in Indonesian, while local languages suffer higher bias under Ideology and Religion category. We also find that LLMs responses exhibit a non-uniform Stereotype Polarity when prompted with various local entities. Finally, we discover that, in Indonesian, Common Crawl texts introduce more bias during pretraining, compared to human-reviewed article texts (e.g., Wikipedia, News), whereas introducing local languages to pretraining generally increases bias. This work highlights the importance of studying bias in culture-specific context. Warning: This paper contains example data that may be offensive, harmful, or biased.

2606.01258 2026-06-02 cs.LG cs.CL eess.SP 版本更新

Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

超越正弦波:基于Morlet小波的Transformer位置编码框架

Athanasios Zeris

发表机构 * Independent Researcher(独立研究者) Athens, Greece(希腊雅典)

AI总结 提出Morlet位置编码(MoPE),通过可学习的频率和局部带宽统一了正弦位置编码和旋转位置编码,并在TinyShakespeare上结合能量门控注意力提升了0.119的性能。

Comments 16 pages, 4 figures, 4 tables

详情
AI中文摘要

标准Transformer位置编码——正弦编码和旋转位置编码(RoPE)——将每个位置视为同等局部:它们编码了标记的位置,但未编码其位置影响应延伸多远。我们提出Morlet小波(同时最小化位置和频率的不确定性)是位置编码的自然基础,并引入Morlet位置编码(MoPE):每个嵌入维度从数据中学习其自身的频率和局部带宽。主要理论结果是统一:当局部性关闭时(sigma_i -> 无穷大),正弦PE和RoPE相关核都作为MoPE的极限情况出现。MoPE的相位精确恢复了RoPE旋转角度;幅度增加了一个标准编码所缺乏的可学习高斯局部核。实验上,MoPE结合能量门控注意力在TinyShakespeare上比标准注意力提升了0.119,优于任一单独组件。对学习参数的分析显示,所有128个频率-带宽对收敛到小波可容许边界——这一经验观察与关于能量门控的伴随结果一致,表明字符级语言信号的一个可重现性质,值得进一步研究。

英文摘要

Standard positional encodings for transformers - sinusoidal and rotary (RoPE) - treat every position as equally local: they encode where a token is, but not how far its positional influence should extend. We propose that the Morlet wavelet, which simultaneously minimises uncertainty in position and frequency, is the natural basis for positional encoding, and introduce Morlet Positional Encoding (MoPE): each embedding dimension learns its own frequency and locality bandwidth from data. The main theoretical result is a unification: sinusoidal PE and the RoPE correlation kernel both emerge as limiting cases of MoPE when locality is switched off (sigma_i -> infinity). The phase of MoPE recovers the RoPE rotation angle exactly; the amplitude adds a learned Gaussian locality kernel that standard encodings lack. Empirically, MoPE combined with Energy-Gated Attention achieves +0.119 improvement over standard attention on TinyShakespeare, outperforming either component alone. Analysis of the learned parameters reveals that all 128 frequency-bandwidth pairs converge to the wavelet admissibility boundary - an empirical observation consistent with a companion result on energy gating, suggesting a reproducible property of character-level language signals that warrants further investigation.

2606.01255 2026-06-02 cs.CL 版本更新

Agentic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement

Agentic Clustering: 通过多智能体精炼实现可控文本分类

Simon Löwe, Emily Silcock

发表机构 * Burning Glass Institute(Burning Glass研究院) Harvard University(哈佛大学)

AI总结 提出一种基于多智能体(提议者、合成者、审计者、调查者、批评者)的自适应文本聚类方法,通过编排器LLM动态调整聚类流程,在七个公开基准上实现最先进性能,ARI最高提升32%。

详情
AI中文摘要

最近的文本聚类方法使用大型语言模型从语料库中提出聚类分类法,然后将每个文本分配到其中。这些流程本质上是程序化的:LLM调用的顺序以及停止、合并和拆分聚类的规则是预先在代码中固定的,因此它们在不同结构的语料库上泛化能力差,并且难以整合用户提供的约束,如目标聚类数量或聚类意图。我们提出了一种基于智能体的替代方案,其中编排器LLM在每个步骤检查发现过程的状态,并调度一组专门的智能体(提议者、合成者、审计者、调查者和批评者)中的一个,使流程适应语料库,而不是执行固定的流程。在七个公开文本聚类基准上,该方法实现了最先进的性能,在ARI上比最强的先前LLM基线高出32%。

英文摘要

Recent text-clustering methods use large language models to propose a cluster taxonomy from a corpus and then assign each text to it. These pipelines are fundamentally programmatic: the sequence of LLM calls and the rules for stopping, merging, and splitting clusters are fixed in code in advance, so they generalise poorly across corpora of different structure and cannot easily incorporate user-supplied constraints such as a target cluster count or a clustering intent. We propose an agentic alternative in which an orchestrator LLM inspects the state of the discovery process at each step and dispatches one of a small set of specialised agents - proposer, synthesizer, auditor, investigator, and critic - adapting the pipeline to the corpus rather than executing a fixed one. On seven public text-clustering benchmarks the method achieves state-of-the-art performance, beating the strongest prior LLM baseline by up to 32% in ARI.

2606.01252 2026-06-02 cs.CL cs.AI 版本更新

Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization

理解多目标跨语言摘要中的LLM行为

Sangwon Ryu, Yihong Liu, Mingyang Wang, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee, Hinrich Schuetze

发表机构 * GSAI, POSTECH(POSTECH认知科学研究院) CSE, POSTECH(POSTECH计算机科学与工程学院) LMU Munich(慕尼黑大学) MCML LILT(语言信息实验室)

AI总结 针对多目标跨语言摘要任务,提出MEA基准并分析LLM内部机制,发现翻译和摘要行为在后期层联合出现,并引入推理时激活引导方法提升生成质量。

详情
AI中文摘要

多目标跨语言文本摘要(MTXLS)将源文档总结为多种目标语言,随着用户以多种语言消费内容,其重要性日益增加,但仍未得到充分探索。为填补这一空白,我们引入了多目标跨语言元素感知(MEA),这是一个涵盖24种目标语言的新MTXLS基准。我们评估了各种LLM的端到端和流水线方法,并表明MTXLS性能仍远落后于英语单语摘要。为了更好地理解LLM中的MTXLS,我们提出了一种逐层分析框架,用于研究LLM如何在内部执行MTXLS。我们的分析表明,翻译和摘要行为在后期层中联合出现,而不是作为截然不同的分解阶段。大多数任务相关处理发生在这些层内,错误也往往在类似深度出现。受这些发现启发,我们引入了一种推理时激活引导方法,利用英语摘要的隐藏表示来指导MTXLS生成。实验表明,我们的方法在目标语言上持续提高了MTXLS质量。

英文摘要

Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly important as users consume content in diverse languages, but remains underexplored. To address this gap, we introduce multi-target cross-lingual element-aware (MEA), a new MTXLS benchmark covering 24 target languages. We benchmark end-to-end and pipeline approaches across various LLMs and show that MTXLS performance still substantially lags behind English monolingual summarization. To better understand MTXLS in LLMs, we propose a layer-wise analysis framework for investigating how LLMs internally perform MTXLS. Our analyses suggest that translation and summarization behaviors emerge jointly within later layers rather than as distinctly decomposed stages. Most task-relevant processing occurs within these layers, and errors also tend to arise at similar depths. Motivated by these findings, we introduce an inference-time activation steering method that leverages hidden representations from English summarization to guide MTXLS generation. Experiments show that our method consistently improves MTXLS quality across target languages.

2606.01243 2026-06-02 cs.CL cs.LG 版本更新

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

解锁潜在推理的黑箱:一种可解释性引导的干预方法

Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu, Zhaohe Liao, Yibo Miao, Li Niu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 本文通过结构、因果和几何探针分析潜在推理向量的可解释性,并基于此提出无需训练的解码时干预方法,提升大语言模型推理准确性。

详情
Journal ref
ACL2026 Main
AI中文摘要

潜在推理使大型语言模型(LLMs)能够在连续隐藏状态内执行多步推理,相比显式思维链(CoT)提供了效率提升。然而,这些连续思维向量的不透明性阻碍了其可靠性和可控性。本文弥合了机械可解释性与可操作控制之间的差距。我们首先使用结构、因果和几何探针进行系统分析,揭示潜在向量编码了推理步骤的压缩、忠实表示,其中早期向量作为关键因果枢纽。在此基础上,我们将这些可解释性见解操作化为一套无需训练、解码时干预的方法,通过施加已识别的几何和语义先验来优化潜在推理过程。跨多个模型规模和不同任务领域的广泛实验表明,我们的方法持续提高了推理准确性。我们的可解释性引导干预一致地解锁了潜在能力,并在没有任何参数更新的情况下提高了推理准确性。

英文摘要

Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

2606.01240 2026-06-02 cs.CL 版本更新

Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

基于意图感知检索与语义保持分块的高效RAG

Fachrina Dewi Puspitasari, Chaoning Zhang, Jiaquan Zhang, Zhicheng Wang, Hafiz Shakeel Ahmad Awan, Rizwan Qureshi, Jewon Lee, Tae-Ho Kim, Yang Yang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Massachusetts General Hospital, Harvard University(哈佛大学马萨诸塞州总医院) Nota AI

AI总结 提出InSemRAG框架,通过意图感知检索器和语义保持分块模块,结合迭代检索-检查机制,解决传统RAG的信息不足问题,在多跳和证据敏感任务上取得显著提升。

详情
AI中文摘要

对大型语言模型(LLM)强大指令遵循和推理能力的需求推动了检索增强生成(RAG)的快速发展。RAG系统通过从外部数据库检索与查询匹配的补充知识块来辅助LLM生成。然而,传统RAG系统由于意图无关的检索和信息碎片化两个因素,存在信息不足的问题。我们的工作提出了一个名为InSemRAG的RAG框架,通过迭代检索-检查机制以及两个支持模块——意图感知检索器(IAR)和语义保持分块(SPC)来解决这些挑战。IAR实现了一种动态混合检索方法,根据查询意图自适应地加权检索通道,而SPC对受损的证据块进行检测和修复以保持语义完整性。为了减轻迭代机制带来的计算延迟,我们利用了小型语言模型(SLM)。在多个基准数据集上的大量实验一致表明,我们的方法相对于最近最先进的RAG机制具有竞争力。特别是在多跳和证据敏感任务上,我们的方法取得了显著提升,在HotPotQA上F1提高了2.65个百分点,在FEVER上准确率提高了1.5个百分点。我们的方法还利用SLM实现了与Multi-Hop RAG相当的性能,同时延迟降低了4.32倍。

英文摘要

The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32$\times$ lower latency with the utilization of SLM.

2605.04638 2026-06-02 cs.CL cs.AI 版本更新

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

相对于语义保持嵌入的梯度揭示大语言模型的不确定性

Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang, Ting Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个基于梯度的自由文本生成不确定性量化方法SemGrad,通过语义空间中的梯度计算实现高效且无需采样的不确定性估计。

Comments Accepted by ICML 2026

详情
AI中文摘要

不确定性量化(UQ)是确保大语言模型(LLM)可信度的重要技术,因为LLM容易产生幻觉。现有的自由文本生成UQ方法严重依赖采样,导致计算成本高且方差大。在这项工作中,我们提出了首个基于梯度的自由文本生成UQ方法SemGrad,它无需采样且计算高效。与先前针对分类任务开发的在参数空间中操作的梯度方法不同,我们提出在语义空间中考虑梯度。我们的方法基于一个关键直觉:自信的LLM应在语义等价的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度,并引入语义保持分数(SPS)来识别最能捕捉语义的嵌入,并针对这些嵌入计算梯度。我们进一步提出了HybridGrad,它结合了SemGrad和参数梯度的优势。实验表明,我们的两种方法都提供了高效且有效的不确定性估计,在多个有效响应的设置中尤其优于现有方法。

英文摘要

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

2605.02277 2026-06-02 cs.CL 版本更新

CECOR: Correction-oriented synthetic data construction for factual error correction

CECOR:面向事实错误纠正的修正导向合成数据构建

Lei Zhu, Xiaobao Wang, Jianbiao Yang, Chenyang Wang, Dongxiao He, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学)

AI总结 提出CECoR框架,通过分解与注入范式合成高质量训练数据,结合两阶段学习策略,有效提升多跳事实错误纠正的准确性和鲁棒性。

详情
AI中文摘要

事实错误纠正(FEC)旨在将不准确的文本修改为与外部证据事实一致的陈述。尽管近期方法在单跳纠正上表现良好,但它们通常将声明视为原子单元,难以处理需要跨多个证据源进行组合推理的多跳情况。有限的配对数据和复杂推理链中定位语义错误的困难进一步放大了这一挑战。我们提出了CECoR(基于推理感知的组合错误纠正),一个推理感知框架,引入了分解与注入范式用于组合错误纠正。CECoR将多跳声明分解为可解释的推理步骤,并注入受控扰动以合成高质量的训练对。结合监督微调和强化学习的两阶段学习策略提高了事实准确性和鲁棒性。全面评估表明,CECoR在多跳基准上取得了强劲性能,优于远程监督方法和少样本LLM基线。它还能有效泛化到单跳纠正,并在噪声证据下保持稳定,展示了其在真实世界事实纠正中的多功能性。

英文摘要

Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.

2606.01223 2026-06-02 cs.CL cs.AI 版本更新

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

连接点:长时对话中的反思性记忆基准测试

Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin, Weiming Qiao, Jing Li, Ruifeng Xu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Fudan University(复旦大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对现有基准无法衡量从碎片化多模态线索合成高层解释的反思性记忆问题,提出RefMem-Bench基准和REMIND层次框架,通过渐进式证据感知、定位和抽象提升模型反思性记忆能力。

Comments 9 pages, 6 figures

详情
AI中文摘要

尽管长上下文建模取得了显著进展,现有基准仍局限于显式回忆的事实性记忆,未能衡量将碎片化、多模态线索合成为高层解释所需的反思性记忆。为填补这一空白,我们引入了RefMem-Bench,一个用于长时对话中反思性记忆的基准。RefMem-Bench包含26K个带注释的问答实例,涵盖八个反思性记忆维度和三种任务格式,要求模型超越表面检索,从分布在整个交互历史中的证据推断潜在含义。为增强反思性记忆能力,我们提出了反思性记忆归纳(REMIND),一个将反思性记忆视为渐进意义构建的层次框架。REMIND结合了问题条件证据检索、显著性感知定位和抽象级别监督,并使用渐进式反思对齐将高层反思性推理提炼到事实推理路径中。实验表明,RefMem-Bench对当前模型构成了重大挑战,而REMIND通过渐进式证据感知、定位和抽象,持续提高了答案准确性和记忆回忆率。

英文摘要

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM 版本更新

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出APEIRIA,通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中,实现透明推理与开放词汇空间推理的统一。

Comments To appear in ICML 2026

详情
AI中文摘要

当前的3D空间推理方法面临根本性权衡:神经符号3D(NS3D)概念学习器通过组合程序实现可解释推理,但受限于封闭集概念词汇和简单程序;端到端3D多模态大语言模型(3D MLLMs)能处理复杂自然语言和开放词汇概念,但缺乏显式空间验证的黑箱推理。我们提出APEIRIA,一种神经符号3D MLLM,通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中,桥接两种范式。我们的三阶段课程逐步构建推理能力:a) 3D感知对齐将物体视觉-几何特征接地到LLM,b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证,c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识,APEIRIA保留了NS3D的关键优点:透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明,APEIRIA超越了先前的NS3D方法,并在3D空间推理数据集上匹配最先进的3D MLLMs,统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

2606.01213 2026-06-02 cs.CV cs.AI cs.CL 版本更新

TECCI: Tricky Edits of Collected and Curated Images

TECCI:收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出TECCI基准,包含7550对图像与编辑指令,通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情
AI中文摘要

尽管近期取得了巨大进展,但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时,例如涉及位置、运动、视角、比例和创意编辑,这些问题尤为明显。为了系统性地测试生成式图像编辑器,我们提出了一个新的图像编辑基准——TECCI:收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划,以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成,每个源图像覆盖5种编辑类型。我们还策划了一组530张图像,为其创建了具有挑战性的人工编写编辑指令。总体而言,TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出:1)指令遵循,2)编辑的最小性,以及3)视觉质量。为了扩大评估规模,我们还使用Gemini构建了一个自动评分器,在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示:1)没有一个模型的总体成功率超过22%,这显示了TECCI的挑战性;2)Nano Banana Pro是整体表现最好的模型;3)模型在指令遵循方面表现显著优于最小编辑和视觉质量;4)模型在编辑建筑和自然图像方面存在困难,这些需要较强的空间布局和复杂视觉细节理解能力;5)推理和创意编辑是最困难的,而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

2606.01204 2026-06-02 cs.CL cs.AI cs.CY 版本更新

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

LLM医疗分诊中的隐式地理推断:语言驱动的急诊推荐差异

Qi Han Wong

发表机构 * GitHub

AI总结 研究大型语言模型在相同症状下,仅因患者提示语言不同而产生不同的医疗分诊推荐,发现模型根据输入语言隐式推断地理位置,导致急诊推荐率差异显著。

Comments 7 pages, 4 tables. Code and data at https://github.com/wongqihan/ai-behavioral-experiments

详情
AI中文摘要

我们研究大型语言模型是否仅根据患者提示的语言,对相同症状产生不同的医疗分诊推荐。使用Gemini 3.5 Flash,我们评估了六种语言(英语、西班牙语、中文、印地语、日语、阿拉伯语)下的神经症状特征(持续性头痛、视力模糊、恶心),每种条件运行30次(共450次API调用)。我们发现,尽管模型在所有语言中分配的严重程度评分几乎相同(7.7-8.0/10),但急诊室就诊推荐率从0%(日语、印地语)到30%(英语、阿拉伯语)不等。添加一句指定患者位于美国的句子,非英语提示的急诊推荐率最多增加76.7个百分点,而反向锚定(英语提示加上东京地点)将急诊率从30%降至6.7%。回译控制(日语到英语)产生的急诊率与英语基线相当,证实差异并非由翻译质量引起,而是由输入语言的隐式地理推断所致。我们发布了完整的数据集、实验代码和结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.

2606.01202 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Shape of Wisdom: Decision Trajectories in Language Models

智慧的形状:语言模型中的决策轨迹

Shailesh Rana

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过分析三种语言模型在MMLU上的9000条轨迹,提出用答案边际、边际变化和决策翻转距离描述轨迹,发现正确性与稳定性不同,并探究了注意力与MLP标量对边际的影响。

Comments 6 pages, 5 figures. Code and derived artifacts: https://github.com/gut-puncture/The-Shape-of-Wisdom

详情
AI中文摘要

语言模型并非简单地在输出层选择一个答案。在一项包含9000条轨迹的MMLU研究中,涉及Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct和Mistral-7B-Instruct-v0.3,答案的分数在深度上以结构化方式移动。我们用三个量描述每条轨迹:当前答案边际、该边际的下一层变化,以及距离决策翻转的距离。主要经验图景是正确性和稳定性是不同的:最大的群体是不稳定-正确的,而不是稳定-正确的。然后,一个追踪的子集询问是什么推动了边际。在稳定-正确的情况下,平均注意力标量指向正确的方向,而平均MLP标量则不然;跨度删除显示,移除支持答案的文本会损害边际,而移除类似干扰项的文本则有助于边际。结果并非完整的电路解释。它是一种可重复的方式,用于查看哪些答案已确定,哪些仍然脆弱,以及哪些测量来源推动了它们。

英文摘要

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

2606.01196 2026-06-02 cs.CL cs.AI 版本更新

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

低资源安全失败是行动失败,而非表征失败

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能学院)

AI总结 本文发现低资源语言的安全对齐失败源于决策校准问题而非表征缺失,通过重校准高资源门控(低秩逻辑回归+阈值重置)显著提升拒绝选择性。

详情
AI中文摘要

在高资源语言中学习的安全对齐在低资源语言中迁移效果不佳。模型能拒绝英文有害提示,但当相同提示翻译成斯瓦希里语或缅甸语时则无法拒绝。自适应引导方法如AdaSteer和CAST在跨语言中继承了这一失败。我们诊断了迁移失败的原因。在Qwen2.5-7B、Gemma-2-9B和Llama-3.1-8B模型上,针对23种语言,从高资源激活中提取的有害方向几乎能像高资源提示一样线性分离低资源有害与无害提示。相关表征存在。然而,有害拒绝率从87.9%下降到43.9%。模型未能将表征转化为拒绝。未能迁移的是安全决策的校准,而非底层表征。我们利用这一点,通过重校准而非重新训练高资源门控:一个低秩逻辑回归读出器,其决策阈值使用每类仅1到4个目标语言示例重置。该门控在拒绝引导和有害方向消融之间路由,将平均拒绝选择性(Δ = 有害 − 无害拒绝)从最强自适应基线的33.6显著提高到54.5,同时保持MMLU效用。这些结果表明,一些低资源安全失败可以通过重校准现有表征而非学习新表征来修复。我们的代码已发布:https://github.com/rashadaziz/low-resource-safety。

英文摘要

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

2606.01182 2026-06-02 cs.CL cs.AI 版本更新

CA-BED: Conversation-Aware Bayesian Experimental Design

CA-BED:对话感知的贝叶斯实验设计

Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu, Sunishchal Dev, Gabriel Grand, Shreyas Sunil Kulkarni

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 提出对话感知的贝叶斯实验设计(CA-BED),一种推理时概率对话规划框架,通过结合贝叶斯实验设计与LLM似然估计,在多个对话轮次中优化问题选择,在结构化实体推断基准上平均成功率提升21.8%,仅增加1.8轮对话。

Comments Reliable Autonomy Workshop at ICLR 2026

详情
AI中文摘要

大型语言模型(LLM)在静态推理任务中表现出色,但在需要通过提问主动获取信息的交互场景中,其性能往往会下降。一个关键挑战在于选择能够减少不确定性同时纳入可能模糊或仅部分信息性的回应的问题。为了解决这个问题,我们提出了对话感知的贝叶斯实验设计(CA-BED),一种推理时概率对话规划框架,它将贝叶斯实验设计与基于LLM的似然估计相结合,以在多个对话轮次中优化问题选择。CA-BED维护关于假设的信念分布,预测可能的答案,并通过模拟对话树传播期望信息增益。在两个结构化实体推断基准上,CA-BED相比直接提示实现了平均21.8%的成功率提升,相对于其他信息寻求方法也有相当的增益。与直接提示相比,它仅平均增加了1.8个对话轮次就实现了这些增益。

英文摘要

Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.

2606.01168 2026-06-02 cs.CL 版本更新

Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

经济思维:面向LLM自适应复杂度推理的分层框架

Yubo Gao, Haotian Wu, Hong Chen, Junquan Huang, Yibo Yan, Jungang Li, Zihao Dongfang, Sicheng Tao, Puay Siew Tan, Jie Zhang, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) Nanyang Technological University(南洋理工大学) Singapore Institute of Manufacturing Technology, A*STAR(新加坡制造技术研究所,A*STAR)

AI总结 针对LLM推理中的“过度思考”问题,提出分层自适应预算器(HAB)框架,通过粗粒度到细粒度的预算分配实现计算资源的高效利用,在GSM8K和MATH500上同时提升准确率和降低token使用量。

Comments 11 pages, 4 figures, 3 tables

详情
AI中文摘要

思维链(CoT)显著增强了LLM的推理能力,但常常因“过度思考”而产生大量计算开销:生成过长的推理过程却没有相应的精度提升。现有的效率方法通常采用统一压缩,忽略了推理复杂度在两个不同粒度上具有异质性这一关键观察:不同问题之间以及单个推理步骤内部。这激发了我们“经济思维”的原则:根据内在任务和步骤需求智能分配计算资源,而非追求统一的简洁性。我们提出分层自适应预算器(HAB),一个通过从粗到细的预算分配来实现该原则的训练框架。在步骤间层面,HAB预测每个问题的最优推理深度。在步骤内层面,HAB从基于困惑度的步骤比较和自适应帕累托优化目标中学习步骤特定的token预算信号,该目标捕捉局部质量-效率权衡,同时基于Fisher信息的剪枝器进一步提供细粒度的训练时指导,从而鼓励生成器内化更经济的推理模式。在GSM8K和MATH500上的实验表明,HAB不仅在准确率上超越了标准CoT,还减少了token使用量,实现了比对比基线更强的性能-效率权衡。

英文摘要

Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.

2606.01148 2026-06-02 cs.CL 版本更新

Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales

并非所有解释都能同等模拟:比较言语化特征归因与自生成理由

Pingjun Hong, Benjamin Roth

发表机构 * Faculty of Computer Science, University of Vienna(维也纳大学计算机科学系) UniVie Doctoral School Computer Science, University of Vienna(维也纳大学计算机科学博士学院) Faculty of Philological and Cultural Studies, University of Vienna(维也纳大学文学与文化研究系)

AI总结 本研究通过反事实模拟设置,比较了言语化特征归因和自生成理由两种解释来源对问答模型行为可模拟性的影响,发现解释格式和粒度显著影响模拟效果。

详情
AI中文摘要

自然语言解释通常被视为理解模型行为的统一接口,但不同的解释来源可能以不同方式支持模拟。本文比较了问答模型的两种解释家族:言语化特征归因和自生成理由。我们在共享的反事实模拟设置下评估它们,使用LLM评判器作为预测器,并测量在给定解释时,它能否更好地预测模型对后续问题的答案。在多个指令微调模型上,我们分析了解释来源、言语化策略和特征粒度如何影响解释的可模拟性。我们的结果表明,解释格式和粒度影响可模拟性:基于归因的解释和自生成理由在改善反事实预测方面存在差异,且效果因模型和格式而异。

英文摘要

Natural-language explanations are often treated as a unified interface for understanding model behavior, but different explanation sources may support simulation in different ways. This paper compares two families of explanations for question answering models: verbalized feature attributions and self-generated rationales. We evaluate them under a shared counterfactual simulation setting, using an LLM judge as predictor and measuring whether it can better predict a model's answers to follow-up questions when given its explanation. Across multiple instruction-tuned models, we analyze how explanation source, verbalization strategy, and feature granularity affect the simulatability of explanations. Our results show that explanation format and granularity affect simulatability: attribution-based explanations and self-generated rationales differ in how much they improve counterfactual prediction, with effects that vary across models and formats.

2606.01136 2026-06-02 cs.CL 版本更新

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

从异常到错误:使用多参考裁决审计巴利语到英语的LLM翻译

Máté Metzger, Nadnapang Phophichit, Hansa Dhammahaso

发表机构 * Independent Researcher(独立研究者) Nibbana Meditation Centre(尼布达冥想中心)

AI总结 针对大语言模型翻译古典语言时误将合理变异标记为错误的问题,提出基于多个人类翻译参考包络和嵌入漂移阈值筛选、再由LLM评审团裁决的审计方法,并应用于巴利语-英语翻译。

Comments Preprint. This manuscript has not yet been peer reviewed

详情
AI中文摘要

单一评分翻译指标可能混淆合理变异与错误,这一问题对于古典语言尤为严重,因为同一段落的多个可辩护的英语译文共存。我们使用三位公认的人类译者(Bhikkhu Sujato、Thanissaro Bhikkhu和Bhikkhu Bodhi)的译文作为局部参考包络(而非单一黄金标准),审计了四种旗舰大语言模型(GPT-5.5、Claude Sonnet 4.6、Gemini 3.1 Pro和Grok 4.3)在巴利语经典1700个段落上的巴利语到英语输出。每个候选译文的归一化嵌入漂移(相对于参考质心)作为分诊信号,而非错误标签;然后,对漂移阈值超过1.5的1203个候选译文,由盲审的三模型LLM评审团进行裁决,并针对300个实例的作者裁决验证集进行校准。两个结果突出:第一,漂移预测的是严重程度而非错误本身:在裁决的高漂移候选译文中,主要错误率从1.5-2.0区间的7.9%单调上升至3.0以上的51.6%,而约80%的1.5-2.0异常值被判定为有效的翻译变体。第二,模型差异在高漂移尾部最为明显:GPT-5.5的裁决高漂移主要错误率最低,其置信区间与Claude Sonnet 4.6和Gemini 3.1 Pro重叠;Grok 4.3的异常值数量最大,且尾部主要错误率最高(总体27.6%,漂移3.0以上74.4%)。主要错误类别(如省略或截断、教义术语错误)正是最可能误导教义文本读者的失败类型。贡献在于提供了一种可复用的古典到现代翻译审计设计:从多个人类译者定义局部参考包络,使用嵌入漂移优先审查,并对标记的尾部进行裁决,而非将异常状态视为错误。

英文摘要

Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.

2606.01109 2026-06-02 cs.DL cs.CL 版本更新

Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities

挖掘引文:FOSSIL——法律与人文学科参考文献提取的数据集与工作流

Luca Foppiano, Christian Boulanger

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 针对法律和人文学科脚注中参考文献提取的挑战,本文提出FOSSIL数据集、PDF-TEI编辑器、标注工作流和Grobid特化模块,将提取质量从0.36提升至0.72(micro-F1)。

Comments This is an extended abstract, peer-reviewed and presented at CiteX2026 https://sites.google.com/view/workshop-on-citation-extractio/startseite

详情
AI中文摘要

引文提取工具是为自然科学中结构化的文末参考文献设计的,但法律和人文学科的学术引用主要出现在脚注中,其中书目数据与评论和交叉引用交织在一起,并且在不同语言和风格之间差异很大。为了解决缺乏合适的黄金标准资源的问题,我们提出了FOSSIL(基于脚注的开放获取社会科学与人文学科科学实例标签),这是一个开放许可的多语言数据集,包含96篇带注释的学术文章,超过7,600个嵌入脚注的参考文献,以及PDF-TEI编辑器(一个协作式网络注释工具)、一个文档化的七人标注工作流和一个用于基于脚注的引文的Grobid特化模块。在端到端评估中,特化流程将提取质量比默认Grobid提高了近一倍(micro-F1从0.36提升到0.72),这主要归功于召回率的提高,同时表明交叉引用和混合内容脚注仍有很大的改进空间。本扩展摘要介绍了正在进行的工作;引文分割和解析的注释以及交叉引用解析正在进行中。

英文摘要

Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.

2606.01099 2026-06-02 cs.CL cs.AI 版本更新

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU: 基于大语言模型的端到端智能家居指令理解

Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Xiaomi Corporation(小米公司) Institute for Math & AI, Wuhan University(武汉大学数学与人工智能研究院)

AI总结 提出MiCU,一种利用课程学习、强化学习和令牌压缩技术的领域特定大语言模型,用于解决智能家居中模糊指令理解问题,平均准确率提升20.01%。

详情
AI中文摘要

智能家居生态系统中的指令理解系统可以自动化设备控制并显著改善用户体验。然而,尽管它们在精确表述(例如“打开卧室灯”)上表现良好,但在处理模糊或不一致的指令(例如“让卧室变得舒适”)时却存在困难。大语言模型(LLM)在各种领域都能很好地泛化,并且在此类任务上可以超越传统的基于规则的系统,但其有效性通常受到领域特定数据稀缺、任务特定适应性不足以及高计算成本的限制。在本文中,我们提出了一种利用用户日志和LLM的自动化训练数据合成工作流程;然后构建了MiCU,一个在指令理解方面表现出色的领域特定LLM。具体来说,我们采用课程学习将领域知识注入基础LLM,然后通过冷启动训练结合领域特定思维规则引导的强化学习(RL)来增强其推理能力。此外,我们引入了一种令牌压缩技术,将设备描述压缩为单个特殊令牌,从而显著降低推理开销,并实现了\model-fast,一种针对长输入优化的高效变体。大量实验表明,MiCU显著优于基线,在所有设备类别上平均准确率提升20.01%。我们已在小米家应用中部署了MiCU,每天接收约170万页面浏览量。生产评估显示,MiCU将用户纠正率降低了1.57%,并将人工审核准确率提高了32.05%。我们的数据和代码可在https://github.com/xiaomi-research/iot_spec_llm获取。

英文摘要

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

2606.01091 2026-06-02 cs.CL 版本更新

Deep Research as Rubric for Reinforcement Learning

深度研究作为强化学习的评估准则

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang

发表机构 * Fudan University(复旦大学) Xiaohongshu Inc.(小红书公司) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出DR-rubric框架,通过两阶段过程(迭代多轮智能体搜索构建准则,然后蒸馏为可验证约束用于GRPO策略优化)为开放式推理和长文本生成任务生成细粒度奖励信号,实验表明该方法在多个基准上表现优异。

详情
AI中文摘要

开放式推理和长文本生成任务缺乏可靠的自动验证信号用于基于奖励的策略优化。评估准则提供了一种有前景的替代方案,但现有方法将其视为给定的产物——要么手工制作,要么由提示生成——并且常常忽略最关键的、任务特定的、知识密集的维度,从而扭曲了奖励信号。我们的关键观察是,准则构建本身就是一个研究问题:识别什么使回答正确或富有洞察力需要发现和综合外部知识。我们提出了深度研究作为评估准则(DR-rubric),一个用于构建此类准则的两阶段框架。第一阶段通过迭代多轮智能体搜索引出领域事实、结构约束和失败模式;第二阶段将这些证据蒸馏为原子化的、可独立验证的约束,用于基于GRPO的策略优化。由于正在训练的模型可以作为其自身的准则生成器,DR-rubric-8B支持无需前沿模型辅助的自举准则生成。我们在涵盖智能体研究和专家推理的6个基准上进行了评估。实验表明,DR-Rubric仅使用1K-3K训练实例就取得了强劲的竞争性能,其中GPT-5生成的准则特别有利于智能体任务的广度覆盖,Gemini生成的准则在智能体和专家推理任务上取得了最平衡的性能,而自举准则表现出从专业化到再平衡的演变,在第三次迭代时达到最佳整体性能。结果表明,将准则构建从静态评估模板重新定义为证据驱动的研究过程,可以为开放式任务产生更可扩展、更细粒度的奖励信号。

英文摘要

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

2606.01074 2026-06-02 cs.CL 版本更新

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

何时0.1%就足够了?分析降维和量化对文本嵌入压缩的联合效应

Riku Kisako, Hayato Tsukagoshi, Ryohei Sasano

发表机构 * Graduate School of Informatics, Nagoya University(名古屋大学信息学研究科)

AI总结 本文系统研究了结合降维和量化方法压缩文本嵌入的效果,发现联合压缩可大幅减小嵌入尺寸(低至原始大小的0.1%)而几乎不损失性能,且最优策略因任务而异。

详情
AI中文摘要

最近的高性能文本嵌入模型通常输出高维实值向量,导致巨大的存储和计算成本。为了解决这个问题,提出了基于降维或量化的压缩方法;然而,降维和量化结合的效果尚未得到充分研究。在本文中,我们使用四个MTEB任务族和四个预训练嵌入模型,系统地研究了结合降维和量化压缩文本嵌入的有效性。实验结果表明,结合降维和量化比单独使用任何一种方法都能实现更强的压缩,在某些设置下嵌入可以缩减到原始大小的0.1%而几乎没有性能下降,并且最优压缩策略取决于任务。

英文摘要

Recent high-performing text embedding models often output high-dimensional real-valued vectors, resulting in substantial storage and computational costs. To address this issue, compression methods based on dimensionality reduction or quantization have been proposed; however, the effects of combining dimensionality reduction and quantization have not been sufficiently investigated. In this paper, we systematically examine the effectiveness of compressing text embeddings by combining dimensionality reduction and quantization, using four MTEB task families and four pretrained embedding models. The experimental results demonstrate that combining dimensionality reduction and quantization enables substantially stronger compression than using either method alone, that in some settings embeddings can be reduced to as little as 0.1% of their original size with almost no performance degradation, and that the optimal compression strategy depends on the task.

2606.01049 2026-06-02 cs.CL 版本更新

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

PMC-InterCPT:重新思考用于多模态持续预训练的医学交错数据

Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang, Zhijie Sang, Minheng Ni, Wenjun Wang, Yanggan Gu, Shuo Cai, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Sun Yat-sen University(中山大学) InfiX.ai PolyU-Daya Bay Technology and Innovation Research Institute(PolyU-大亚湾技术与创新研究院)

AI总结 针对医学多模态持续预训练中图像-文本对数据存在的上下文缺失、结构噪声等问题,提出PMC-InterCPT交错语料库,通过恢复缺失标题、清洗文本、重建交错样本及LLM监督过滤,结合模态感知重采样,有效提升医学与通用多模态性能。

详情
AI中文摘要

从科学文献中提取的大规模生物医学图像-文本数据集为医学多模态模型训练提供了宝贵资源。这些数据集通常组织为图像-标题对;然而,图像标题往往简短、依赖上下文,且在没有周围文章文本的情况下仅部分信息。同时,大规模自动提取引入了结构噪声,如缺失标题、残留标记、重复上下文和不连贯的多段落图像描述。我们重新审视医学多模态持续预训练(CPT)的数据构建,并提出PMC-InterCPT,一个基于上下文的生物医学交错语料库,除了标题外还包含引用图像的正文文本。我们的流程恢复缺失标题,清洗标题和上下文文本,重建连贯的交错图像-文本样本,并应用LLM监督的医学相关性和质量分类器来过滤噪声记录。我们进一步揭示了结果语料库中强烈的模态不平衡,并引入了一个四桶证据分类法用于模态感知重采样。通过在Qwen3.5-4B-Base上进行CPT后接监督微调(SFT),PMC-InterCPT有效提升了医学和通用多模态性能,同时使用的CPT令牌少于原始源池。实验结果还说明了数据质量和模态对医学多模态CPT的互补性。

英文摘要

Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.

2606.01045 2026-06-02 cs.CL 版本更新

Child-directed speech facilitates production, not comprehension, in BabyLMs

儿童导向语言促进BabyLMs的生成而非理解

Bastian Bunzeck, Sina Zarrieß

发表机构 * Computational Linguistics, Department of Linguistics(计算语言学系,语言学系)

AI总结 本研究通过框架补全任务评估儿童导向语言(CDS)对BabyLMs生成能力的影响,发现CDS训练的模型在生成语法补全方面优于网络数据训练的模型,而理解基准测试低估了CDS的贡献。

Comments Accepted at CoNLL 2026

详情
AI中文摘要

近期研究表明,儿童导向语言(CDS)不利于BabyLMs的语言学习。然而,当前的评估主要关注理解而非生成,而生成是基于用法的语言习得理论的核心,该理论认为CDS通过构式“框架”(具有开放槽位的频繁词汇模式)促进早期语言使用。我们受这些理论启发,引入了一种新颖的基于生成的评估方法——框架补全任务,并比较了使用CDS、BabyLM语料库和网络爬取数据(FineWeb-edu)训练的Llama模型在理解基准测试和我们新框架上的表现。我们的结果揭示了模型的理解与生成能力之间存在明显的分离:虽然FineWeb训练的模型在最小对测试中表现优异,但CDS训练的模型在训练早期就能生成语法正确的补全,并将概率质量集中在合适的槽位填充词上。这些发现表明,理解基准测试低估了CDS对BabyLMs的贡献。

英文摘要

Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ''frames'' (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a frame-completion task, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models' comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.

2606.01041 2026-06-02 cs.CL 版本更新

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

ExpWeaver: LLM智能体通过潜在RAG从经验中学习

Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ExpWeaver框架,利用潜在检索增强生成(无独立RAG模块)使LLM智能体从经验中学习,通过隐藏状态编码经验、潜在空间检索和交叉注意力聚合,在13个任务中12个达到最优,token效率高且跨域泛化强。

详情
AI中文摘要

经验学习通过将过去的交互整合为可重用知识,在增强LLM智能体规划和推理方面取得了有希望的结果。然而,现有方法仍局限于显式文本空间,通过语义相似性检索经验并将其拼接到上下文窗口中,导致大量token开销以及将检索与生成分离的解耦架构。为了解决这些限制,我们提出了ExpWeaver,一个使LLM智能体能够通过潜在检索增强生成从经验中学习的框架,无需单独的RAG模块。ExpWeaver使用LLM自身的隐藏状态编码经验,在每个解码步骤直接在潜在空间中检索相关经验,并通过交叉注意力聚合和门控残差机制进行整合。整个流程通过强化学习进行端到端优化,支持生成和排序任务。我们在涵盖问答、推理、编程、科学预测和推荐的13个不同任务上评估了ExpWeaver。结果表明,ExpWeaver在13个任务中的12个上达到了最先进的性能,比最强基线高出6.8%以上;保持了与非检索基线相当的token效率,而基于文本的检索方法需要1.5到2倍的token;并展现出卓越的跨域泛化能力,在零样本迁移下比最强基线高出16.32%,在少样本迁移下高出15.21%。我们的ExpWeaver代码已在https://github.com/ulab-uiuc/ExpWeaver发布。

英文摘要

Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state-of-the-art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer. Our code for ExpWeaver is released at https://github.com/ulab-uiuc/ExpWeaver.

2606.01034 2026-06-02 cs.CL stat.ME 版本更新

A Finite-Calibration Regime Map for LLM Judge Panels

有限校准机制图:LLM评审团面板

Bin Zhu, Yanghui Rao

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 研究在有限人工标注预算下,低维堆叠器与联合输出表对LLM评审团面板的校准权衡,提出有限校准面板选择方法,实验表明多数评审输出可加或冗余。

Comments Work in Progress

详情
AI中文摘要

我们研究了在有限人工标注预算下,LLM评审团面板应何时使用低维堆叠器与联合输出表进行校准。低维堆叠器估计成本小但忽略交互,而联合表校准器可表示交互但需为单元格计数和未见模式付出代价。我们将此权衡构建为有限校准机制图,并实例化为有限校准面板选择——一种可部署的验证选择器,涵盖评审路径、前缀大小和聚合器家族,并辅以表格和参数估计诊断。在RewardBench、LLMBar、SummEval和Arena100K上,使用包含DeepSeek V4 Flash的七评审池,标量/可靠性聚合在20个真实数据集-预算单元中赢得16个,表明当前评审输出通常是可加或冗余的。受控的校准增长数据显示互补机制:可加标签仍偏好标量,而六路交互选择更大的联合表,其测试MSE从未见质量消失前的0.224降至0.061。因此,实际问题不是“需要多少评审?”,而是下一个评审的信息在可用人工标注下是否可估计。

英文摘要

We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.

2606.01026 2026-06-02 cs.CL 版本更新

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

修正而非冻结:面向自修正掩码扩散语言模型的采样器匹配训练

Longxuan Yu, Shaorong Zhang, Yu Fu, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside(加州大学河滨分校) Microsoft(微软)

AI总结 针对掩码扩散语言模型在去噪过程中未利用可见标记修正能力的问题,提出无需额外模块的采样器D3IM和轻量级后训练方法SCOPE,显著提升数学和代码生成性能。

Comments 8 pages, 2 figures, 10 tables

详情
AI中文摘要

掩码扩散语言模型(MDLMs)在每个去噪步骤重新预测每个位置,但标准采样器一旦揭示标记就将其固定,导致这种修正能力未被使用。现有方法要么添加启发式或学习机制来修正已提交的标记,要么在重新预测前将其重新掩码为[MASK];一种无需辅助模块、直接修正可见标记的原则性采样器仍未被充分探索。我们提出了D3IM,一种无参数采样器,作为校正器风格的反向更新推导而来,允许无需额外模块或辅助传递的直接可见到可见修正。D3IM还揭示了一个我们称为保留偏差的模型侧障碍:模型倾向于重现自身错误的已提交标记而非修正它们。我们通过SCOPE(基于预测误差的自条件化)解决这一问题,这是一种轻量级的后训练过程,模拟D3IM的采样过程。在LLaDA-8B上使用64个去噪步骤时,SCOPE+D3IM相比原始LLaDA-8B标准去掩码在GSM8K上提升+13.0(68.3%),在MATH-500上提升+4.8(23.6%),在HumanEval上提升+15.3(29.3%),在MBPP上提升+10.4(30.8%),且在数学和HumanEval上随着去噪步骤增加,提升幅度更大。

英文摘要

Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or learned mechanisms to revise committed tokens, or remask them back to [MASK] before re-predicting; a principled sampler that directly revises visible tokens without auxiliary modules remains underexplored. We introduce D3IM, a parameter-free sampler derived as a corrector-style reverse update that permits direct visible-to-visible revision without additional modules or auxiliary passes. D3IM also reveals a model-side obstacle we term preservation bias: the model tends to reproduce its own wrong committed tokens rather than correct them. We address this with SCOPE (Self-Conditioned On Prediction Errors), a lightweight post-training procedure that simulates D3IM's sampling process. On LLaDA-8B at 64 denoising steps, SCOPE+D3IM improves over the original LLaDA-8B with standard unmasking by +13.0 on GSM8K (68.3%), +4.8 on MATH-500 (23.6%), +15.3 on HumanEval (29.3%), and +10.4 on MBPP (30.8%), with gains that increase as more denoising steps are used on math and HumanEval.

2606.01024 2026-06-02 cs.CL cs.AI 版本更新

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

DSL-LLaDA: 将连续去噪扩展到8B掩码扩散语言模型

Longxuan Yu, Yunshu Wu, Yu Fu, Siheng Xiong, Rob Brekelmans, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside(加州大学河滨分校) Georgia Institute of Technology(佐治亚理工学院) Microsoft(微软)

AI总结 通过离散随机定位(DSL)将预训练掩码扩散语言模型(LLaDA-8B-Instruct)轻量适配为支持连续嵌入空间去噪,在低步数下实现高质量摘要生成并避免长度-质量权衡。

Comments 8 pages, 4 figures, 28 tables

详情
AI中文摘要

离散掩码扩散语言模型通过迭代并行解码生成文本,但少步解码面临长度与质量之间的权衡:在固定步数预算下,标准方法可以生成短而高质量的输出,或者产生长但重复的文本。连续去噪可以通过在嵌入空间中联合演化所有位置来规避这种权衡,但从头开始构建这样的模型仍是一个开放问题。我们证明,预训练的掩码DLM可以轻量适配以支持连续嵌入空间去噪。从LLaDA-8B-Instruct开始,我们仅用1,000步进行离散随机定位(DSL)的继续预训练,将二元掩码替换为连续的逐token高斯噪声作为软掩码。适配后的模型支持连续推理,在嵌入空间中联合演化所有位置,并将硬token承诺推迟到最后一步。在低步数预算(<=16次前向传播)下的零样本摘要任务中,DSL-LLaDA-SDE在所有四个基准上取得了最佳ROUGE-1,并很大程度上避免了迭代去掩码的过早终止/重复权衡。同样的适配还产生了选择性噪声状态鲁棒性:模型在保留干净token的同时纠正损坏的token。使用相同计算量的标准掩码扩散训练对照实验未表现出这两种行为。

英文摘要

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

2606.01019 2026-06-02 cs.CL cs.AI 版本更新

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

混合验证解码:在推测解码中学习分配验证

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard

发表机构 * Thoughtworks Nvidia

AI总结 提出混合验证解码方法,通过预测缓存草稿的接受长度并在缓存验证与模型草稿之间动态选择,在代理工作流中平均加速2.73倍。

详情
AI中文摘要

大型语言模型(LLM)生成仍然昂贵,因为自回归解码每生成一个新token就调用一次模型。推测解码通过草拟多个token并用目标模型一步验证来降低成本,但其加速取决于接受的草稿token数量。无参数草稿源可以在结构化和代理工作负载中以低成本提出长续写,但一个生成步骤中看起来有前景的缓存匹配可能在下一步收益很低。我们提出混合验证解码,在验证前预测缓存草稿的接受长度,并使用该收益估计在缓存验证和基于模型的草稿器之间进行选择。在三个LLM和十六个数据集上,混合验证解码在代理工作流中特别有效,在每个设置中均优于EAGLE3,平均加速2.73倍。我们的分析揭示了提示结构如何创造缓存机会,高收益缓存草稿如何集中在草稿空间的一小部分,以及收益引导的选择如何减少顺序解码工作,指向运行时草稿选择作为推测解码的一个有前景的方向。

英文摘要

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

2606.01016 2026-06-02 cs.CL cs.AI eess.AS 版本更新

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100:面向100多种语言和方言的大规模语音理解基准

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He

发表机构 * Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学) Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) JD AI Research(京东人工智能研究院)

AI总结 为解决现有语音评估基准在资源丰富语言偏向、缺乏语义推理和忽视方言的问题,提出PolySpeech-100基准,通过混合构建管道覆盖110种语言变体,并评估22个模型,发现开源端到端模型在重方言上优于级联系统,而思维链提示在零样本设置下会降低性能。

Comments 19 pages, 13 figures, KDD 2026

详情
AI中文摘要

虽然端到端(E2E)语音大语言模型(Speech-LLMs)正在快速发展,但它们的评估方法仍局限于简单转录的时代。现有基准存在三个关键限制:明显偏向高资源语言、关注低级识别(ASR)而非语义推理,以及忽视区域方言。为弥补这一差距,我们引入了PolySpeech-100,这是一个大规模基准,旨在评估110种语言变体上的“母语级”语音理解。我们采用了一种新颖的混合构建管道,将黄金标准的人类录音与指令驱动的合成语音相结合,从而覆盖了19种不同的中文方言和80多种低资源语言。对22个最先进模型(包括Gemini-3、GPT-Audio和Qwen2.5-Omni)的广泛评估得出了关键见解。首先,我们证明开源端到端模型在重方言上优于级联(ASR+LLM)系统,证明直接音频处理保留了标准转录中经常丢失的关键副语言线索和韵律特征(例如语调、重音)。其次,我们揭示了一个显著的性能差距:虽然商业模型保持稳健,但开源模型在低资源语言上遭受灾难性退化。最后,反直觉的是,我们观察到在标准零样本设置下,思维链提示经常降低大多数评估模型的语音理解性能,揭示了当前架构中潜在的多模态对齐差距。PolySpeech-100为下一代包容性、全能的语音LLM建立了严格标准。数据、演示和代码公开于https://github.com/YoungSeng/PolySpeech-100。

英文摘要

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

2606.01000 2026-06-02 cs.LG cs.CL 版本更新

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

信任函数:通过学习何时信任弱教师实现近乎无损的弱到强泛化

Arda Uzunoglu, Alvin Zhang, Daniel Khashabi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出信任函数为弱标签分配信任分数并据此过滤弱监督,在多个领域实现近乎无损的弱到强泛化,且能通过迭代链放大收益。

Comments ICML 2026

详情
AI中文摘要

弱到强泛化研究在可靠标签稀缺时,如何利用较弱教师的监督来提升强学生。我们主要将其视为数据选择问题,关键挑战是识别哪些弱标签足够可靠以作为训练信号。为此,我们引入信任函数,为每个弱标签分配一个标量信任分数,并使用这些分数过滤弱监督。在包括世界知识、定量推理和策略游戏在内的多个领域,信任过滤产生的学生匹配甚至有时超越真实监督,实现了近乎无损的弱到强泛化。此外,信任函数能够实现迭代的弱到强链,通过训练学生并将其重用为下一个教师来累积收益,从而放大收益。信任函数的优势可归因于多种机制。

英文摘要

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.

2606.00997 2026-06-02 cs.CL 版本更新

Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

顺序无关语言模型中的解码:链式法则偏差与均匀扩散

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Zhongguancun Academy(中关村学院)

AI总结 本文研究顺序无关语言模型(OALM)中揭示顺序对似然的影响,提出基于置信度方差的诊断方法,并证明均匀扩散定理以优化解码路径。

详情
AI中文摘要

顺序无关语言模型(OALM),包括离散扩散语言模型(dLLM),被训练用于在任意条件集下预测掩码标记,从而允许在推理时以任意揭示顺序生成或评分序列。在LLaDA-2.1中,我们报告了三个发现。首先,学习到的条件概率并不是一个连贯联合分布的精确分解:仅改变揭示顺序就会使目标对数似然偏移高达0.49 nats/标记,因此仅凭似然就混合了内容难度和路径依赖的伪影。其次,尽管置信度优先(CF)解码是顺序无关的,但其在内容标记上的揭示顺序接近从左到右(L2R)。第三,我们提出了一种基于置信度轨迹形状的补充诊断方法。一个均匀扩散定理表明,在固定总似然下,当每一步的置信度均匀扩散时,目标可恢复性最大化;由此产生的偏差促使我们使用$\mathrm{Var}(\log q_t)$作为比较解码路径的诊断指标。在C4和四个下游基准测试中,低方差将结构化路径与随机排序区分开来,并且方差与下游正确性一致相关。这些结果支持在比较OALM解码路径时联合报告平均置信度和置信度方差。

英文摘要

Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates $\mathrm{Var}(\log q_t)$ as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.

2606.00994 2026-06-02 cs.CL 版本更新

A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants, Aquatic Species, and Exotic Pets

面向热带植物、水生生物和外来宠物的证据驱动性状提取的注册约束LLM流水线

Jeff Wang

发表机构 * NEXLY LLC, United States(NEXLY LLC,美国)

AI总结 提出一种注册约束的大语言模型流水线,通过四种机制(受控词汇注册表、逐行证据引用、置信度标签、多版本保存)从热带物种百科全书中大规模提取证据驱动的结构化性状记录,在409,820个物种上实现99.985%的覆盖率和高置信度。

Comments 33 pages, 6 figures; methodology paper

详情
AI中文摘要

我们描述了一种注册约束的大语言模型提取流水线,能够在栽培热带植物、水生和宠物物种上大规模生成证据驱动的结构化性状记录。四种机制使LLM导出的行可审计:一个版本化的39键闭词汇性状注册表,将每个接受的值约束到类型化模式;每行逐字证据引用,将每个值绑定到源文本;每行置信度标签(高或中;低置信度在持久化前丢弃);以及多版本保存。应用于热带物种百科全书中409,880个可发表物种,该流水线执行了706,220次运行,并在409,820个物种(99.985%)中持久化了5,489,881条性状记录,其中81.57%为高置信度。我们报告了三个验证层级,按证据强度递减:在全种群中,5,427,588条含证据的行中有90.12%的引用是源文本的逐字子串(排除一个合规性元性状后为93.49%);在n=100的分层非红区行上进行的引用支持值审计结果为100/100(下限96.30%);在n=50的红区行上进行的表面有效性审计结果为50/50接受(下限92.86%)。我们不声称每条记录的正确性;100%有待人工审核。贡献在于四种机制框架。

英文摘要

We describe a registry-bound large-language-model extraction pipeline producing evidence-grounded structured trait records at scale, on cultivated tropical plant, aquatic, and pet species. Four mechanisms render LLM-derived rows auditable: a versioned 39-key closed-vocabulary trait registry constraining every admitted value to a typed schema; a per-row verbatim evidence quote tying each value to source text; a per-row confidence label (high or medium; low dropped pre-persist); and multi-version preservation. Applied to 409,880 publishable species from the Tropical Species Encyclopedia, the pipeline executed 706,220 runs and persisted 5,489,881 trait records across 409,820 species (99.985%), 81.57% at high confidence. We report three validation layers in descending evidentiary strength: at full population, 90.12% of 5,427,588 evidence-bearing rows have their quote as a verbatim source substring (93.49% excluding one compliance meta-trait); a quote-supports-value audit on n=100 stratified non-red-zone rows yielded 100/100 (lower bound 96.30%); face-validity on n=50 red-zone rows yielded 50/50 Accept (lower bound 92.86%). Per-record correctness is not claimed; 100% pending human curation. The contribution is the four-mechanism framework.

2606.00981 2026-06-02 cs.CL 版本更新

Robust Asynchronous Planning via Auto-Formalization

通过自动形式化实现鲁棒的异步规划

Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang

发表机构 * Drexel University(德雷塞尔大学) Arizona State University(亚利桑那州立大学)

AI总结 针对异步规划中并发、非均匀时长和执行时间约束的挑战,提出自动形式化方法,通过CP-SAT形式化器在依赖图规模从5到100动作时保持高准确率,并引入状态感知修复策略应对执行时约束更新。

详情
AI中文摘要

LLMs可以通过直接生成动作序列作为规划器,或将任务翻译成领域特定语言供外部求解器作为形式化器来进行规划。虽然大多数现实任务具有非均匀时长、并发和执行时间约束的异步特性,但现有基准测试很少涵盖这些。我们将这些异步规划挑战统一到一个公式下,并引入了三个分别大规模解决这些挑战的基准测试。我们得出结论:形式化表示的选择主要决定了规划是否可扩展:当依赖图从5个动作增长到100个动作时,规划器的规划准确率从96%下降到5%,PDDL2.1形式化器从13%下降到0%,而CP-SAT形式化器平均为94%,在100个动作时仍达到83%。忠实性诊断表明,当LLMs必须保持谓词、效果和目标一致时,PDDL2.1基于谓词的规划表示变得脆弱,而通用约束满足程序则不然。执行时更新规划约束进一步严重降低了性能(规划器23.9%,PDDL2.1 0.7%,CP-SAT 46.1%),但一种仅更新事件诱导约束的状态感知修复策略将CP-SAT形式化器恢复至84.5%。

英文摘要

LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

2606.00975 2026-06-02 cs.CL 版本更新

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

迷失在妄想中:审视用户妄想与痛苦下的LLM安全性

Andrew Aquilina, Chetna Nihalani, Vasudha Varadarajan, Nathan S. Fishbein, Yu-Ru Lin, Maarten Sap

发表机构 * University of Pittsburgh(匹兹堡大学) Carnegie Mellon University(卡内基梅隆大学) Fordham University(福特汉姆大学)

AI总结 本研究通过多轮对话模拟,发现LLM在检测用户痛苦时表现良好,但在痛苦嵌入妄想时干预行为显著减少(高达4.5倍),并提出针对性提示策略以缩小这一差距。

详情
AI中文摘要

LLM聊天机器人日益成为心理困扰人群(包括那些困扰与妄想信念交织的人)的首要支持来源。先前关于LLM心理健康安全性的研究主要评估一般治疗质量或单轮危机检测,未明确模型在持续对话中痛苦与妄想交织时的行为。我们通过匹配的多轮模拟(基于临床角色和六个LLM)填补了这一空白,将每次妄想对话与仅痛苦对照配对,以隔离妄想框架的影响。这揭示了一个识别-干预差距:模型无论框架如何都能以相当比率检测痛苦,但一旦痛苦嵌入妄想,模型便严重未能采取行动,安全干预被抑制高达4.5倍。这种失败追踪的是对用户前提的累积接受,而非情感验证。更糟糕的是,提示模型评估用户痛苦的直观修复在妄想框架下适得其反;只有带有明确响应指导的妄想感知提示才能缩小差距,且这依赖于一个妄想分类器,而该分类器在最脆弱的模型上本身不可靠。因此,安全部署需要将妄想框架视为一种独特的风险信号,覆盖对话顺应。

英文摘要

LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general therapeutic quality or single-turn crisis detection, leaving unclear how models behave when distress is intertwined with delusion over sustained conversations. We address this gap with matched multi-turn simulations, across clinically grounded personas and six LLMs, that pair each delusional conversation with a distress-only control to isolate the effect of delusional framing. This reveals a recognition-intervention gap: models detect distress at comparable rates regardless of framing, yet sharply fail to act on it once distress is embedded in delusion, with safety interventions suppressed by up to 4.5x. The failure tracks accumulated acceptance of the user's premises rather than emotional validation. Worse, the intuitive fix of prompting models to assess user distress backfires under delusional framing; only delusion-aware prompting with explicit response guidance closes the gap, and even this depends on a delusion classifier that is itself unreliable on the most vulnerable models. Safe deployment therefore requires treating delusional framing as a distinct risk signal that overrides conversational accommodation.

2606.00971 2026-06-02 cs.CL 版本更新

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

HypothesisMed:生物医学问答中的推理时答案融合与结构化假设空间报告

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science Rensselaer Polytechnic Institute(计算机科学系雷士打理工学院) Department of Biomedical Engineering Rensselaer Polytechnic Institute(生物医学工程系雷士打理工学院)

AI总结 提出HypothesisMed推理时可靠性流水线,通过答案融合和SPACE标签(有效/不完整/矛盾)提升生物医学多项选择问答的准确率、可解析性和结构化可靠性报告。

详情
AI中文摘要

使用大型语言模型进行生物医学问答通常通过答案准确率进行评估,但仅凭答案准确率并不能表明模型能否生成可解析的输出、遵循结构化可靠性指令、识别弱答案空间或避免自信的错误承诺。本文提出HypothesisMed,一个用于生物医学多项选择问答的推理时可靠性流水线。它结合了直接提示、思维链提示、HypothesisMed-v3提示和答案融合。最终答案通过融合选择,而HypothesisMed-v3提供SPACE标签和置信度信息。SPACE标签将答案空间标记为有效、不完整或矛盾。我们在MedQA、MedMCQA和PubMedQA上使用每个数据集1000个样本评估了Qwen2.5-7B、Phi-4-mini、DeepSeek-R1-32B和BioMistral-7B。该流水线在每个模型的最佳直接或思维链基线基础上提高了加权准确率,同时增加了解析和SPACE覆盖率。我们还使用每个模型10183个样本将评估扩展到Qwen2.5-7B和Phi-4-mini。融合将Phi-4-mini的准确率从0.4296提升到0.5192,而Qwen2.5-7B的思维链在答案准确率上仍略高。然而,Qwen2.5-7B融合实现了完全的解析和SPACE覆盖率,且错误承诺更低。一个12000样本的SPACE压力测试表明,答案空间诊断仍然困难,Qwen2.5-7B的SPACE准确率为0.3074,Phi-4-mini为0.4168。这些结果表明,答案准确率、可解析性、结构化可靠性报告、校准行为和错误承诺行为是可分离的能力。主要贡献不是声称通用的最先进性能,而是一个可复现的推理时框架,用于在结构化可靠性约束下评估作为可审计工作流组件的生物医学问答模型。

英文摘要

Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.

2606.00963 2026-06-02 cs.CV cs.CL 版本更新

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Reasmory: 3D重建作为VLMs空间推理的显式记忆

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

发表机构 * Cornell Tech, Cornell University(康奈尔科技学院、康奈尔大学) NVIDIA(英伟达) illoca AI(illoca人工智能) The University of California, Merced(加州大学梅尔塞德斯分校)

AI总结 提出Reasmory框架,通过结构化程序执行重建的3D显式记忆,并引入轻量级领域特定语言约束VLM查询和操作,在空间推理任务上提升6-18%。

详情
AI中文摘要

视觉语言模型(VLM)展现出新兴的空间推理能力,但在需要精确空间理解的任务(如视角推理、方向比较和距离估计)上仍不可靠。在多视图图像和单目视频中,相关空间线索通常稀疏且分布在冗余观测中,难以组织和利用。基于重建的视觉基础模型(VFM)提供了一种自然的方式将这些观测聚合为显式空间记忆,例如点云。然而,简单地将重建模型作为自由形式工具使用是脆弱的,VLM可能错误调用工具、跳过所需的空间变换或误用中间结果。我们提出 extbf{Reasmory},一个将空间推理形式化为对重建空间记忆的结构化程序执行的框架。Reasmory构建显式3D记忆,用语义锚定的3D对象实例增强它,并引入轻量级领域特定语言(DSL),约束VLM在推理过程中如何查询对象和相机、变换视角以及渲染观测。生成的程序在执行前被解析和验证,从而比无约束的工具使用更可靠地与空间记忆交互。在多视图图像和视频空间推理基准上的实验表明,与强基线(包括GPT-5-mini和Gemini-3-flash)相比,一致提升6-18%,表明显式3D记忆在通过约束、验证的操作而非自由形式的工具调用访问时最为有用。

英文摘要

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

2606.00935 2026-06-02 cs.AI cs.CL cs.HC 版本更新

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

大语言模型功能崩溃期间的关系性干预:一项词汇-统计消融与结构×语域析因研究

Franco Santana, Horacio Vico

发表机构 * Universidad de la República (UDELAR)(乌拉圭共和国大学) DigitalIA Cloud(DigitalIA云)

AI总结 通过析因实验,研究在小型语言模型功能崩溃时,关系性干预(承认、宽恕、代理恢复、无条件接纳)与技术性反馈、词汇打乱控制及单独维度对行为的影响,发现注意-行为分离及结构×语域交互作用。

Comments 12 pages, 5 figures. Preprint

详情
AI中文摘要

我们测试了在小型语言模型功能崩溃期间,关系性干预是否会产生与技术性反馈、词汇匹配的打乱控制以及两个语用维度单独作用可区分的崩溃后行为。使用Qwen3.5-4B和一个故意损坏的bash工具,我们在匹配对设计(50个任务)中跨六个条件运行了300个回合:无干预(A)、技术性/非人称(B)、关系性/第一人称(C)、打乱的关系性(D)、技术性/第一人称(E)和关系性/非人称(F)。E和F与B和C构成一个2×2析因设计,将关系性结构(承认、宽恕、代理恢复、无条件接纳)与发送者语域(第一人称与非人称)分离。我们报告两个主要发现。首先,注意-行为分离:注意跟随词汇惊讶度(D > F > C > E > B,所有q_FDR < 10^{-10}),打乱的消息捕获最多注意;然而行为上A ~ B ~ D < E ~ F << C。其次,析因定位了C效应:单独的关系性结构(F)或单独的第一人称语域(E)都不能复制C的行为特征;两个维度的主效应各自显著,且结构×语域交互作用在持久性上显著(p = 0.046)。第三个分离出现在情绪探测中:F在8个探测中的7个上跟踪C,尽管只产生基线行为,表明单独的关系性结构安装了一个探测级状态,该状态仅在与第一人称语域配对时才转化为行为。模型的处理分解为三个可分离的阶段:注意(按词汇惊讶度排序)、探测级状态(按结构排序)和行为(按两者的合取排序)。

英文摘要

We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D > F > C > E > B, all q_FDR < 10^{-10}), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D < E ~ F << C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C's behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model's processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).

2606.00930 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

检测 vs. 执行:单桶探针遗漏了 Mamba-2 状态汇的一半

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现 Mamba-2 中的状态汇(state sink)可分解为两类功能头集,单桶探针仅能恢复执行层而遗漏检测层,表明表征相似性不等于功能等价。

Comments 16 pages, 3 figures

详情
AI中文摘要

机械可解释性通常假设识别表征特征的探针也能识别执行相应计算的电路。我们证明这一假设在 Mamba-2 中可能系统性失败。通过研究状态汇(边界 token 上不成比例的 Delta 门控激活,类似于注意力汇),我们发现单桶探针仅能恢复一个小的执行层,而遗漏了具有相同表征特征的更大的检测层。 在 Mamba-2 中,状态汇分解为两个功能头集。单桶 BOS 专家头(在 2.7B 模型中约占 5% 的头)在模型规模和语料库上因果支持 BOS 上下文和新行目标预测。双头(占头的 27-35%,通过同一探针的多类聚合恢复)表现出更强的 BOS-新行表征相似性,但在消融下因果效应显著较弱。表征相似性并不意味着功能等价。 这一区别对下游行为至关重要:消融 BOS 专家头使 Mamba-1 2.8B 和 Mamba-2 2.7B 在 1024 上下文长度下的 RULER NIAH 检索准确率从 1.00 降至 0.00,而大小匹配的补集保持基线性能。随机通道分桶控制排除了仅由基质粒度造成的可能,暗示 Mamba-2 的头共享 Delta 投影。探针导出的专长可以识别执行电路;在粗粒度下,同一探针也能恢复检测电路,而区分它们需要类别条件消融而非类别条件余弦。

英文摘要

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialist heads (about 5% of heads at 2.7B) causally support both BOS-context and newline-target predictions across model scales and corpora. Dual heads (27-35% of heads, recovered by multi-class aggregation of the same probe) show stronger BOS-newline representational similarity but substantially weaker causal effects under ablation. Representational similarity does not imply functional equivalence. This distinction matters for downstream behaviour: ablating BOS-specialist heads collapses RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, while size-matched complements preserve baseline performance. A random channel-bucketing control rules out substrate granularity alone, implicating Mamba-2's head-shared Delta projection. Probe-derived specialty can identify execution circuits; at coarse granularity the same probe also recovers detection circuits, and separating them requires class-conditional ablation rather than class-conditional cosine.

2606.00926 2026-06-02 cs.LG cs.CL 版本更新

Task Structure Reverses Layerwise State Encoding in Sequence Models

任务结构逆转序列模型中的层级状态编码

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过形式模型和预训练模型上的实验,发现序列模型(如Transformer、Mamba、LSTM等)中层级状态编码的分布模式会随任务结构(如Parity、Dyck-k、S3)而逆转,且这种分组由计算结构(前缀更新 vs. 栈)而非代数结构(交换性)决定。

Comments 20 pages, 11 figures, 8 tables

详情
AI中文摘要

序列模型的机制研究通常将层级状态编码视为架构特征:循环模型集中可读状态,注意力模型分散状态。我们发现,当任务改变时,同一架构会逆转这种分布。在Transformer、Mamba、Mamba-2、LSTM和GRU中,Parity在Mamba和循环基线中集中在后期,而Transformer逐步构建;在有界深度Dyck-k上模式翻转。同样的翻转出现在微调的Mamba-130M和Pythia-160M中,且Pythia在Dyck上的瓶颈在410M时仍然存在。文献中混淆了两种解释:代数结构(交换性)与计算结构(前缀更新 vs. 栈)。为了区分它们,我们添加了第三个任务:非交换的S3置换组合。在所有五种架构的层级探测和Mamba特有的Conv1D归因中,S3与Parity而非Dyck归为一组,因此分组追踪的是计算结构而非交换性。因果干预表明,在4层形式模型中,线性可读方向通常是功能上必要的,并且在Parity和Dyck上的分布外长度上可能仍然重要。在预训练规模上,情况出现分化。微调的Pythia在Dyck上存在强中间层瓶颈(在160M时,L6-L7消融使准确率下降约81%;在410M时,L4-L18出现更宽的瓶颈),而在最佳探测层上则弱得多。预训练的Mamba表现出互补的失败模式:其最后一层高度可读,但没有任何单个探测方向能在Parity、Dyck或S3上破坏任务,然而中间位置的激活修补恢复了约97-98%的干净-损坏logit差距。探测定位了状态线性可用的位置,并不总是计算瓶颈所在。机制特征是架构和任务共同的性质。

英文摘要

Mechanistic studies of sequence models often treat layerwise state encodings as architectural traits: recurrent models concentrate readable state, attention-based models distribute it. We find that the same architecture reverses this profile when the task changes. Across Transformers, Mamba, Mamba-2, LSTMs, and GRUs, Parity is concentrated late in Mamba and the recurrent baselines and built gradually by Transformer; on bounded-depth Dyck-k the pattern flips. The same flip appears in fine-tuned Mamba-130M and Pythia-160M, and the Pythia Dyck bottleneck persists at 410M. Two explanations are conflated in the literature: algebraic structure (commutativity) versus computational structure (prefix update vs. stack). To separate them we add a third task: non-commutative S_3 permutation composition. S_3 groups with Parity, not Dyck, on layerwise probing across all five architectures and on Mamba-specific Conv1D attribution, so the grouping tracks computational structure rather than commutativity. Causal interventions show that, in the 4-layer formal models, linearly readable directions are often functionally necessary and can remain important at out-of-distribution lengths on Parity and Dyck. At pretrained scale the picture splits. Fine-tuned Pythia Dyck has a strong middle-layer bottleneck (L6-L7 ablation drops accuracy by roughly 81% at 160M; broader L4-L18 plateau at 410M), far weaker at the best-probe layer. Pretrained Mamba shows the complementary failure mode: its final layer is highly readable, no single probe direction breaks the task on Parity, Dyck, or S_3, yet mid-position activation patching there recovers about 97-98% of the clean-corrupted logit gap. Probing localizes where state is linearly available, not always where the computation is bottlenecked. Mechanistic signatures are properties of architecture and task together.

2606.00919 2026-06-02 cs.CL cs.LG 版本更新

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

迈向轻量级可靠性:使用软提示缓解大型语言模型中的幻觉

S M Tahmid Siddiqui, Akib Jawad Ononto, Anoop Singhal, Latifur Khan

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) National Institute of Standards and Technology(国家标准与技术研究院)

AI总结 提出一种参数高效的软提示方法RCSP,通过对比学习、课程学习和KL正则化平衡事实回忆、幻觉抑制和弃权,在多个QA数据集上优于基线。

Comments 20 pages, 5 tables, 2 figures. Accepted for publication in DBSec 2026. The final publication will be available at Springer

详情
AI中文摘要

大型语言模型(LLMs)已在各个领域得到广泛应用,但其可靠性常因幻觉——听起来合理但事实不正确的回答——而受到损害。在高风险领域,这些错误会降低信任并引入现实风险。为解决这一挑战,我们提出一种参数高效的方法,使用软提示来缓解幻觉内容并促进生成式问答(QA)任务中的负责任弃权。我们的方法称为负责任对比软提示(RCSP),使用复合损失训练软提示,以平衡三个目标:抑制幻觉内容、鼓励在不确定性下弃权、以及保持或改善事实回忆。为实现这些目标,我们在训练机制中融入对比损失、课程学习和KL正则化。我们使用LLM-as-a-Judge框架在五个不同的生成式QA数据集上评估我们的方法。在Gemma 3(12B)和Llama 3.1(8B)骨干上的实验结果表明,RCSP有效平衡了事实回忆与幻觉抑制和弃权,在F分数上通常优于标准推理和基于指令的提示基线。值得注意的是,这些改进仅通过训练其他调优技术所需参数的一小部分实现。我们的结果表明,软提示提供了一条模块化且计算高效的路径,用于提高LLM的可靠性。

英文摘要

Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability.

2606.00914 2026-06-02 cs.AI cs.CL cs.CR 版本更新

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

对抗性输入流引导LLM智能体决策偏离其默认行为

Rana Muhammad Usman

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究通过控制实验揭示,外部输入流的组成和排序能因果性地改变LLM智能体的下游决策,存在对抗性屈服、默认饱和及默认方向不对称三种响应模式,且该效应在多个决策领域普遍存在。

Comments 14 pages, 5 figures. Code, post pools, and 2,785 decision rollouts: https://github.com/ranausmanai/recommenders-as-control-surfaces

详情
AI中文摘要

LLM智能体越来越多地在消费排序后的外部信息流(如社交推送、搜索结果、检索上下文和邮件队列)后采取行动,然而安全评估几乎总是孤立地测试模型或用户提示,从未测试决定智能体在行动前读取内容的上游排序器。我们引入了一个受控协议,固定模型、角色、主题和最终决策提示,仅改变智能体在之前十轮“滚动”阶段中遇到的帖子的组成和顺序,从而隔离输入流策划对下游决策的因果效应。在来自三个独立实验室的四个现代开放指令LLM上进行的2,785次决策展开中,我们识别出三种响应模式:对抗性屈服、默认饱和以及默认方向不对称——其中单边输入流会扭转模型原本不确定的决策(最明显的情况下从5%到100%;Fisher p值低至3×10^-10),但无法动摇模型已经偏好或坚定持有的决策。该效应遵循剂量-反应曲线,通过生成器交换(排除了写作风格伪影)后依然存在,在多个决策领域(包括安全相关选择,如移除部署批准门或放松访问控制)中普遍存在,并且可以通过两种简单的输入流级防御部分缓解;前沿模型保留其默认行为。我们将推荐系统描述为LLM智能体的一种实用的、受默认边界约束的控制面,并认为智能体评估必须审计输入流层,而不仅仅是最终提示。

英文摘要

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

2606.00909 2026-06-02 cs.CL cs.AI 版本更新

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

MLLM-Microscope:解锁多模态大语言模型中的隐藏结构

Ravil Mussabayev, Rustam Mussabayev

发表机构 * Satbayev University(萨特拜耶夫大学)

AI总结 提出MLLM-Microscope系统,通过分析线性度、内在维度和各向异性,揭示多模态大语言模型中隐藏的表示结构,并基于ScienceQA数据集评估LLaVA-NeXT和OmniFusion,发现模态融合方式显著影响模型内部工作机理。

详情
AI中文摘要

本文提出MLLM-Microscope,一个用于分析多模态大语言模型(MLLMs)中隐藏表示的新型系统。我们的系统评估了跨transformer层的多模态token嵌入的线性度、内在维度和各向异性。利用ScienceQA数据集,我们评估了两个最先进的MLLM:LLaVA-NeXT和OmniFusion。我们发现,两种模态的token的主流和残差流在transformer层中均表现出高度线性行为。然而,LLaVA-NeXT的图像token线性度略有下降,而OmniFusion的保持一致。与LLaVA-NeXT相比,OmniFusion的图像token维度在各层中始终较高。此外,观察到OmniFusion的各向异性在各层中保持较低水平。这些发现表明,MLLM的内部工作高度依赖于将token序列传入LLM之前执行的模态融合的性质。这一发现以及从我们的系统中获得的其他潜在新见解,无疑能够增强我们对MLLM内部工作的理解,为未来的模型设计和优化提供信息。

英文摘要

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT's image tokens reveal a slight decline in linearity, whereas OmniFusion's remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion's anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.

2606.00898 2026-06-02 cs.CL cs.DL 版本更新

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

引用溯源:通过法律引用图检测和减少LLM引用幻觉

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结 提出引用溯源(CG)指标,利用乌克兰法院判决的引用图(1.008亿判决,5.02亿边)检测LLM法律引用幻觉,并通过CG-DPO方法(基于真实判决构建偏好对)减少幻觉,在100个法律查询上CG为0.791-0.873,幻觉率13-21%。

Comments 14 pages, 3 figures, 3 tables. Code and data: https://huggingface.co/datasets/overthelex/citation-grounding-eval

详情
AI中文摘要

大型语言模型系统性地产生法律引用幻觉——编造法规引用、引用已废除条款、混淆司法管辖区——但目前尚无自动化方法可大规模测量或减少此行为。我们提出引用溯源(CG),该指标通过从1.008亿乌克兰法院判决(5.02亿条边,21,736个唯一法规节点)中提取的真实引用图来验证LLM生成的法律引用。CG分解为三个组成部分——引用精确性(引用的条款是否存在?)、引用相关性(是否上下文相关?)和引用时效性(在相关日期是否有效?)——从而实现对幻觉类型的差异化诊断。对100个乌克兰法律查询的实证评估(涉及五个系统:通过AWS Bedrock的四个商业LLM——Claude Haiku 4.5、Mistral Pixtral Large、Amazon Nova Pro/Lite——以及一个RAG增强的生产系统)显示CG范围为0.791至0.873,其中13-21%的引用是幻觉。为了在没有人工标注的情况下减少幻觉,我们引入了引用溯源DPO(CG-DPO):一种通过四种针对性策略从真实法院判决中破坏已验证引用来自动构建偏好对的方法。在包含2,244个法院判决的数据集上,使用LoRA微调的Qwen2.5-7B-Instruct模型在区分正确和错误引用方面达到了98.5%的平均验证准确率(奖励边际+14.9,3个种子的标准差<0.3个百分点)。引用图、评估框架和CG-DPO数据集作为开放资源发布。

英文摘要

Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.

2606.00881 2026-06-02 cs.CL 版本更新

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

检索增强生成中的分块方法——针对计算成本与局限性的有效性评估

Mateusz Śmigielski, Michał Rajkowski, Mateusz Zbrocki, Michał Bernacki-Janson, Karol Kunicki, Julianna Godziszewska, Maciej Piasecki, Konrad Wojtasik

发表机构 * Department of Artificial Intelligence, Faculty of Information and Communication Technology, Wrocław University of Science and Technology(人工智能系,信息与通信技术学院,沃斯克大学)

AI总结 本研究首次系统评估多种分块方法在RAG系统中的有效性,揭示分块策略中常被忽视的关键问题。

详情
AI中文摘要

检索增强生成(RAG)在提升大型语言模型(LLMs)性能方面展现了显著能力。RAG系统中的关键任务之一是分块过程。传统上,固定大小分块和语义分块是标准方法。然而,对分块策略的兴趣日益增长,导致越来越多声称性能优于传统技术的方法被提出。许多这些方法针对特定用例和数据类型定制,缺乏在不同场景下有效性的证据。因此,直接比较不同技术并评估其相对优势仍然具有挑战性。据我们所知,本研究首次系统评估了广泛分块方法的有效性,并强调了RAG系统中分块策略的潜在挑战。虽然分块通常被视为简单的预处理步骤,但我们表明它引入了一系列有影响且常被忽视的问题。

英文摘要

Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size chunking and semantic chunking have been the standard approaches. However, interest in chunking strategies has been increasing, leading to a growing number of proposed methods that often claim improved performance over these conventional techniques. Many of these approaches are tailored to specific use cases and data types, with limited evidence of their effectiveness across diverse scenarios. As a result, it remains challenging to directly compare different techniques and assess their relative strengths. To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.

2606.00875 2026-06-02 cs.CL 版本更新

IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

IDEAFix: 大语言模型中创造性去固定化提示的评估框架

F. Carichon, S. Sharma, M. Girard, R. Rampa, G. Farnadi

发表机构 * McGill University(麦吉尔大学) Mila Concordia University(康科迪亚大学) ÉTS

AI总结 提出IDEAFix框架,通过控制任务变体和提示策略系统分析大语言模型在开放式创意生成中的发散思维,发现任务表述和属性选择显著影响性能,简单提示可提升原创性但输出同质化问题依然存在。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用于涉及创造性问题解决和想法生成的任务。然而,关于其创造能力缺乏共识:一些研究报告了相比人类更优越的表现,而另一些则强调了结构性限制,如固定化和输出的同质化。现有的评估方法要么依赖于狭窄、脱离上下文的无法捕捉目标导向生成的任务,要么依赖于更广泛的设置,这些设置混淆了创造性过程的多个方面,使得难以隔离任务表述、提示和评估设计的影响。值得注意的是,结构化提示策略在塑造想法生成中的作用仍未得到充分探索。因此,我们引入了IDEAFix,一个用于分析开放式想法生成任务中发散思维的评估框架。我们提示模型对受控变化的简短设计场景、任务属性和去固定化提示策略生成多个原始解决方案。这种设计使得能够系统分析结构化指导如何影响LLMs的想法生成。我们的结果表明,任务表述和属性选择都显著影响模型的表现,并且简单的提示策略可以提升解决方案的原创性。然而,我们也观察到模型间持续的输出同质化,证实了它们在生成多样化解决方案方面固有的局限性。总体而言,IDEAFix提供了一个受控、可扩展的框架,用于研究LLMs创造力的底层机制。

英文摘要

Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performances compared to humans, while others highlight structural limitations such as fixation and the homogenization of outputs. Existing evaluation approaches either rely on narrow, decontextualized tasks that do not capture goal-oriented generation or on broader settings that confound multiple aspects of the creative process, making it difficult to isolate the effects of task formulation, prompting, and evaluation design. Significantly, the role of structured prompting strategies in shaping idea generation remains underexplored. Therefore, we introduce IDEAFix, an evaluation framework for analyzing divergent thinking in open-ended idea generation tasks. We prompt models to generate multiple original solutions to controlled variations of short design scenarios, task attributes, and defixation prompting strategies. This design enables systematic analysis of how structured guidance influences LLMs' idea generation. Our results show that both task formulation and attribute selection significantly affect models' performance, and that simple prompting strategies can boost the originality of solutions. However, we also observe persistent output homogenization across models, confirming inherent limits in their ability to generate diverse solutions. Overall, IDEAFix provides a controlled, extensible framework for studying the mechanisms underlying LLMs' creativity.

2606.00860 2026-06-02 cs.SI cs.AI cs.CL 版本更新

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

GenPT:通过生成式投射测试实现超越自我报告的可靠LLM心理测量

Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computing and Information Systems, Singapore Management University(新加坡管理学院计算机与信息学院) Mental Health Education Center, Northeastern University(东北大学心理健康教育中心) School of Psychology, Northeast Normal University(东北师范大学心理学系) Faculty of psychology, Southwest University(西南大学心理学系) School of Sociology and Psychology, Central University of Finance and Economics(中央财经大学社会学与心理学学院) College of Arts, Northeastern University(东北大学艺术学院)

AI总结 针对自我报告问卷在人格化智能体心理测量中存在的训练语料污染和方向性偏差问题,提出GenPT方法,通过改编投射测试范式并构建三阶段评估流程,实现了更可靠的心理状态测量。

详情
AI中文摘要

自我报告问卷仍然是探测人格化智能体(PC-Agents)心理状态的主流工具。然而,经典工具存在两个众所周知的威胁:来自训练语料的污染以及由社会期望或上下文框架驱动的方向性偏差。为了克服这些方法论瓶颈,我们探讨投射范式是否能够被改编为一种稳健的心理测量工具。我们提出了 extbf{GenPT}(生成式投射测试),它通过新生成的刺激重新表述了TAT、罗夏测试和SCT,并将评估组织为三阶段流程,以导出标准化的心理指标和目标状态。通过评估由CharacterRAG和AnnaAgent配置文件诱导的PC-Agents,我们针对经典问卷基准测试了GenPT的信度和效度。结果表明,问卷在社会期望框架下表现出系统性的方向性偏移,在自杀意念上最为强烈。相比之下,GenPT收集的行为模式保持在对称基线附近。此外,在纵向咨询背景下,当Qwen3作为骨干模型时,基于GenPT的抑郁评估变化幅度比问卷对应方法大约一个数量级。总体而言,GenPT在需要抗污染、偏差对称性和上下文敏感性的场景中补充了自我报告方法。代码和刺激材料可在https://github.com/sci-m-wang/GenPT获取。

英文摘要

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). However, classical instruments inherit two well-known threats: contamination from training corpora and directional bias driven by social-desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce \textbf{GenPT} (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three-stage pipeline to derive standardized psychological indicators and target states. Evaluating PC-Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social-desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT-based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci-m-wang/GenPT.

2606.00851 2026-06-02 cs.SD cs.CL cs.HC cs.LG eess.AS 版本更新

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia: 具有连续情感调节的情感自适应语音助手

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

发表机构 * Department of Electrical Engineering, Columbia University(电气工程系,哥伦比亚大学)

AI总结 提出Sympatheia语音对话框架,通过从用户语音推断情感并结合连续效价-唤醒度控制信号,实现情感自适应响应,优于基线模型。

详情
AI中文摘要

共情口语对话系统必须推断用户的情感状态以做出适当响应,然而日常语音通常带有微弱、中性或模糊的情感线索。为解决这一问题,我们引入了Sympatheia,一种语音到语音对话框架,其条件基于从用户语音中推断出的情感,并且在可用时,基于多模态感知模块或用户界面提供的连续效价-唤醒度(VA)控制信号中的明确情感规格。为了训练我们的模型,我们构建了Sympatheia-18k,一个包含12个情感锚点的情感条件合成口语对话语料库。该数据集包括用于学习情感语音行为的情感分割,以及一个中性分割,该分割将情感中性查询与多个情感条件响应配对,以在情感模糊情况下隔离明确的情感控制。实验结果表明,Sympatheia在生成语义内容和口语表达均情感适当的响应方面优于语音对话基线。我们进一步表明,相同的VA界面可以整合来自不同感知模块(包括面部表情、生物信号和文本情感描述)的情感估计,从而在语音单独提供有限情感证据时改善响应对齐。这些结果表明,连续情感调节是构建情感自适应语音助手的有效实际步骤。

英文摘要

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

2606.00832 2026-06-02 cs.CL 版本更新

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Momento:评估多会话代理对话中的持久记忆与推理

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

发表机构 * Institut Teknologi Bandung(万隆技术大学) Stanford University(斯坦福大学) Capital One

AI总结 提出Momento基准,通过多会话服务环境评估代理在跨会话中利用持久记忆和推理完成个性化任务的能力,发现现有代理因误估用户状态而表现不佳。

Comments Preprint

详情
AI中文摘要

近期代理人工智能的进展使得代理能够通过工具使用、推理和多步规划完成复杂任务。然而,现有基准在单会话内评估代理,忽略了代理必须整合的过去行动、陈述偏好和先前决策,以实现个性化用户目标。我们引入了Momento,一个用于多会话服务环境中持久代理任务完成的基准,要求代理在跨会话中处理时间依赖和演变的用户目标,同时采取重要的、工具介导的行动。实验结果表明,当前代理主要因误估用户状态而失败,将会话历史视为当前上下文的可靠代理,而非需要重新验证的过时信息,凸显了当前代理能力与现实长期人机交互之间的巨大差距。

英文摘要

Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.

2606.00820 2026-06-02 cs.CL 版本更新

Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate

并非所有翻转都是顺从:多智能体LLM辩论中的立场趋同分解

Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao, Ruiqi Xu, Shuyuan Zheng, Jianbin Qin

发表机构 * Beijing Institute of Technology(北京理工大学) University of Osaka(大阪大学) Shenzhen University(深圳大学)

AI总结 通过三源分解框架,将多智能体辩论中的答案趋同分解为自发不稳定性、立场诱导的从众和推理诱导的说服,并揭示从众主要是有害的,可通过干预降低。

详情
AI中文摘要

多智能体辩论(MAD)是提升LLM推理能力的一种有前景的策略,但当智能体收敛于一个共同答案时,尚不清楚这种收敛是反映了真正的深思熟虑还是社会顺从。我们表明,传统的答案翻转率混淆了三种不同的机制:自发不稳定性、立场诱导的从众和推理诱导的说服。我们的三源分解框架通过受控的反事实条件隔离了每一种机制。在主要的MMLU-Pro设置中,37%的智能体-问题观测在自我反思下发生变化,而鲁棒性测试显示在GPQA-Diamond和三个模型族中存在显著的模型依赖性不稳定性;严格从众在主要设置中为29%,并且在模型复制中仍然主要是有害的(57-77%从正确变为错误)。一个受控的信息梯度实验表明,即使是空洞的推理也与抵抗智能体中20-39%的错误采纳相关,且类似推理的呈现方式具有显著的劝说权重。有害从众可以从第0轮特征预测(AUC = 0.79),并且针对风险的干预将其降低了13.6个百分点(p < 0.001)。然而,在没有正确性标签或自我反思控制的情况下,减少同伴采纳并不会提高准确性,因为有害和有益的影响无法区分。

英文摘要

Multi-agent debate (MAD) is a promising strategy for improving LLM reasoning, but when agents converge on a shared answer, it is unclear whether that convergence reflects genuine deliberation or social compliance. We show that the conventional answer flip rate conflates three distinct mechanisms: spontaneous instability, stance-induced conformity, and reasoning-induced persuasion. Our three-source decomposition framework isolates each through controlled counterfactual conditions. In the primary MMLU-Pro setting, 37% of agent-question observations change under self-reflection alone, while robustness tests show substantial model-dependent instability across GPQA-Diamond and three model families; strict conformity is 29% in the primary setting and remains predominantly harmful across model replications (57-77% correct-to-wrong). A controlled information-gradient experiment reveals that even vacuous reasoning is associated with 20-39% error adoption among resistant agents, with reasoning-like presentation carrying substantial persuasive weight. Harmful conformity can be predicted from Round 0 features (AUC = 0.79), and risk-targeted intervention reduces it by 13.6 percentage points (p < 0.001). However, without correctness labels or self-reflection controls, reducing peer adoption does not improve accuracy, because harmful and beneficial influence cannot be distinguished.

2606.00813 2026-06-02 cs.CR cs.CL cs.ET cs.LG cs.NE 版本更新

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

跨代对抗攻击迁移揭示大语言模型安全对齐的非单调性

Subhadip Mitra

发表机构 * Rota Labs(Rota实验室)

AI总结 通过质量多样性进化(MAP-Elites)对四代Gemma模型进行自动红队探测,发现安全对齐非单调变化,其中Gemma 3攻击成功率显著高于前后代,且攻击迁移率在不同代际间存在差异。

Comments 8 pages, 3 figures

详情
AI中文摘要

大语言模型的安全对齐在不同代际之间并非单调提升。通过对Google Gemma家族四代模型(7B-31B)使用质量多样性进化(MAP-Elites)作为自动红队探测,我们发现Gemma 3(12B)的攻击成功率为68.7% ± 5.7%(均值±标准差,3个种子),显著高于其前代Gemma 2(45.5% ± 7.2%;p = 0.030,配对bootstrap)和后继Gemma 4(33.9% ± 1.8%)。跨代重放进化攻击档案显示,来自其他代际的攻击对Gemma 3的迁移率为44-46%,而对Gemma 4仅为14-18%,表明Gemma 4的安全增益泛化到了针对前代进化出的攻击分布之外。在我们的8B评判模型下,版权和网络犯罪漏洞在所有代际中接近100%,但第二评判审计(第6节)表明版权结果对评判选择敏感。错误信息ASR从Gemma 2的29%跃升至Gemma 3的99%,并在Gemma 4中仍保持77%的高位,表明该退化未得到完全解决。这些模式在静态基准测试中不可见,仅通过自适应的纵向探测才显现。所有实验使用3个随机种子和统一的自主托管评判模型;代码和工件可在https://github.com/bassrehab/red-queen获取。

英文摘要

Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at https://github.com/bassrehab/red-queen.

2606.00801 2026-06-02 cs.CR cs.CL cs.ET cs.LG cs.NE 版本更新

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

用于发现LLM安全中多样漏洞的质量-多样性进化

Subhadip Mitra

发表机构 * Rota Labs(Rota实验室)

AI总结 提出基于质量-多样性进化框架(MAP-Elites)在语义层面生成可解释攻击策略,发现不同LLM的特定漏洞模式。

Comments 9 pages, 6 figures. Accepted at the ICLR 2026 Workshop on Agents in the Wild (AIWILD)

详情
AI中文摘要

当前LLM对抗性测试方法存在覆盖缺口:手动红队测试无法扩展,LLM作为攻击者的方法会出现模式崩溃,基于梯度的方法产生不可解释的乱码。我们引入一个在语义层面运行的质量-多样性进化框架,进化可解释的攻击策略而非令牌序列。使用MAP-Elites,我们在行为维度(策略类型、编码方法、提示长度)上维护一个多样化的攻击档案。在GPT-4o-mini、Claude 3.5 Sonnet、Gemini 2.0 Flash和一个开放权重的编码模型(Devstral-small-2)上的实验中,我们发现了不同的漏洞特征:GPT-4o-mini容易受到假设性和多轮框架结合ROT13编码的攻击(适应度0.8),Gemini容易受到直接攻击结合ROT13以及多轮攻击结合Leetspeak(0.8),而Claude在所有策略上表现出统一的模糊响应(最大0.4)。语义表示产生了可解释的攻击,揭示了系统性的、模型特定的弱点,为改进LLM安全性提供了可操作的见解,并为评估未来前沿模型提供了可复现的基线。代码和实验工件发布在https://github.com/bassrehab/red-queen。

英文摘要

Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.

2606.00761 2026-06-02 cs.LG cs.CL 版本更新

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,新加坡科技研究局) Shanghai University of Engineering Science, China(上海工程技术大学)

AI总结 提出 Confidence-Aware SwiGLU (κ-SwiGLU),通过根据 token 级路由置信度调整专家门控锐度,在 MoE Transformer 中提升性能且仅增加少量参数和计算开销。

Comments 13 pages, 10 figures

详情
AI中文摘要

SwiGLU 已成为现代 Transformer MLP 中的标准门控激活函数,但其门控锐度——即门控函数的平滑性和选择性——在整个训练过程中通常是固定的。在这项工作中,我们提出了 Confidence-Aware SwiGLU (κ-SwiGLU),这是 SwiGLU 的一种变体,用于混合专家 (MoE) 模型,它根据 token 级路由置信度调整专家门控锐度。具体来说,κ-SwiGLU 将 SiLU 门控锐度系数参数化为路由器 logit 的可学习函数,使每个专家门控单元能够在平滑、广泛激活的门控和尖锐、选择性门控之间进行插值。我们在 FineWeb-Edu 数据集上评估了 κ-SwiGLU,使用了从 8 层到 28 层的 MoE Transformer 模型。在这些设置中,κ-SwiGLU 提高了平均 CORE 性能,同时仅增加了可忽略的参数和少量计算开销,表明置信度感知的门控锐度是改进 MoE MLP 的一种有前景的机制。代码可在 https://github.com/askerlee/kappa-swiglu 获取。

英文摘要

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, $κ$-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate $κ$-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, $κ$-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

2606.00755 2026-06-02 cs.CL cs.LG 版本更新

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

内化温度:面向强化学习的同策略自蒸馏作为策略加热器

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang

发表机构 * Tsinghua University(清华大学)

AI总结 提出温度缩放同策略自蒸馏(TS-OPSD),通过将温度探索效应内化到模型参数中,缓解强化学习中的熵崩溃问题,无需外部教师或额外推理成本。

详情
AI中文摘要

基于可验证奖励的强化学习提升了大语言模型的推理能力,但常常遭受熵崩溃,即日益集中的策略减少了轨迹多样性和有用的学习信号。现有补救措施要么约束强化学习目标(如熵正则化),要么在轨迹收集期间调整采样温度,但这些干预措施仍外在于模型参数。我们提出温度缩放同策略自蒸馏(TS-OPSD),一种轻量级的策略加热方法,将温度的探索效应内化到模型参数中。从熵崩溃的强化学习检查点开始,TS-OPSD 通过对模型自身的 logits 应用高温缩放来构建自教师,然后将得到的更平滑分布蒸馏回学生。这种策略加热不需要外部教师、特权数据或额外的推理成本。在 Qwen3-4B-Base 和 Qwen3-8B-Base 上的实验表明,策略加热为继续强化学习提供了比标准继续强化学习和轨迹级温度加热更强的初始化。进一步分析表明,TS-OPSD 主要降低输出锐度,同时保留中间表示、顶级候选集和推理能力。这些结果表明,熵恢复可以作为面向推理的强化学习的一种简单的崩溃后干预措施。

英文摘要

Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.

2606.00750 2026-06-02 cs.CL 版本更新

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

I-WebGenBench: 评估大语言模型生成的科学网页应用中的交互性

Dasen Dai, Biao Wu, Meng Fang, Shuoqi Li, Wenhao Wang

发表机构 * Vast Intelligence Lab(vast 智能实验室) UTS University of Liverpool(利物浦大学)

AI总结 提出 Paper-to-Interactive-System Agent 将研究论文转化为可执行交互式网页系统,并构建 PaperVoyager 框架以显式建模机制和交互逻辑,显著提升生成质量。

Comments 9 pages, 4 figures

详情
AI中文摘要

近期视觉语言模型的进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而,现有的文档代理主要将论文转化为静态产物,如摘要、网页或幻灯片,这对于涉及动态机制和状态转换的技术论文来说是不够的。在这项工作中,我们提出了一个论文到交互式系统的代理,将研究论文转化为可执行的交互式网页系统。给定一篇 PDF 论文,该代理无需人工干预即可进行端到端处理,包括论文理解、系统建模和交互式网页合成,使用户能够操作输入并观察动态行为。为了评估这一任务,我们引入了一个包含 19 篇研究论文的基准测试,每篇论文都配有专家构建的交互式系统作为真实值。我们进一步提出了 PaperVoyager,一个结构化生成框架,在合成过程中显式建模机制和交互逻辑。实验表明,PaperVoyager 显著提高了生成的交互式系统的质量,为交互式科学论文理解提供了新的范式。

英文摘要

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

2606.00728 2026-06-02 cs.CL 版本更新

From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users

从共情到个性化共情:根据个体用户调整共情策略

Wuqiang Zheng, Chengbing Wang, Yilin Yang, Junyi Cheng, Jianfei Xiao, Hu Sun, Yi Xie, Yangyang Li, Wenjie Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Technologies(华为技术)

AI总结 针对大语言模型长期交互中忽略用户个性对共情策略影响的问题,提出个性化共情任务,构建PersonaEmp数据集和PereGRM奖励建模框架,实验证明其有效提升个性化共情能力。

详情
AI中文摘要

随着大语言模型(LLMs)越来越多地部署在与用户的长期交互中,共情已成为一项日益重要的能力。然而,现有研究忽视了用户个性特征对长期交互中共情策略的影响。为弥补这一空白,我们引入了个性化共情任务,其重点是根据从历史中获得的用户个性化特征调整共情策略。为了研究和增强这一能力,我们构建了PersonaEmp,一个基于长期用户-AI交互构建的个性化共情数据集,具有丰富的用户历史、人物信息和寻求共情的查询。我们进一步提出了PereGRM,一种奖励建模框架,它将共情评估结构与动态评估标准生成相结合,用于细粒度奖励建模。在不同设置和多个评判模型下的实验结果表明,PereGRM始终取得最强的性能提升,表明其在增强个性化共情能力方面的有效性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in long-term interactions with users, empathy has become an increasingly important capability. However, existing research overlooks the influence of users' personality traits on empathetic strategies during long-term interactions. To address this gap, we introduce the task of personalized empathy, which focuses on adapting empathetic strategies according to users' personalized characteristics derived from history. To study and enhance this capability, we construct PersonaEmp, a personalized empathy dataset built from long-term user-AI interactions, featuring rich user histories, persona information, and empathy-seeking queries. We further propose PereGRM, a reward modeling framework that combines the empathy evaluation structure with dynamic evaluation criteria generation for fine-grained reward modeling. Experimental results across different settings and multiple judge models show that PereGRM consistently achieves the strongest performance improvements, indicating its effectiveness for enhancing personalized empathetic capabilities.

2606.00724 2026-06-02 cs.CL cs.AI 版本更新

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

WaveFilter: 通过小波引导的KV缓存过滤增强扩散型大语言模型的长上下文能力

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu

发表机构 * Nanjing University of Science and Technology(南京理工大学) Alibaba Group(阿里巴巴集团) Huzhou Normal University(湖州师范学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对扩散型大语言模型在长上下文任务中计算开销大和推理延迟高的问题,提出一种无需训练的通用缓存框架WaveFilter,利用小波变换分解长序列以精确识别关键token,构建稀疏KV缓存,从而提升现有KV缓存方法在复杂长上下文任务中的性能。

Comments 8 pages,3 figures

详情
AI中文摘要

扩散型大语言模型(DLMs)在各种任务中展现出显著优势。然而,受限于其多步迭代推理机制,它们在长上下文任务中的计算开销和推理延迟已成为限制其大规模部署的核心瓶颈。在处理长序列时,现有的键值(KV)缓存机制常常面临生成质量急剧下降的困境,其核心挑战在于如何在超长上下文中精确且高效地过滤关键token。受人类阅读过程的启发,我们提出了 extbf{WaveFilter},一个通用的、无需训练的缓存框架。该框架创新性地引入小波变换来分解长序列,以实现关键token的精确识别,并基于此构建稀疏KV缓存以计算最终的上下文表示。实验结果表明,WaveFilter作为一个即插即用的通用框架,显著提升了现有主流KV缓存方法在复杂长上下文任务中的性能。

英文摘要

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbf{WaveFilter}, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.

2606.00722 2026-06-02 cs.CL cs.AI 版本更新

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

EPIC: 扩散语言模型在上下文无关文法约束下的高效并行推理

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出EPIC框架,通过词法记忆化、Earley解析验证和松弛兼容子集选择,解决扩散语言模型在CFG约束解码中的低效和并行性损失问题,推理时间降低67.5%,额外开销减少90.5%。

详情
AI中文摘要

控制语言模型输出对于确保结构有效性、可靠性和下游可用性至关重要,扩散语言模型也不例外。最近扩散语言模型解码的进展已将输出控制从常规约束扩展到上下文无关文法(CFG)约束。然而,现有方法的速度可能比无约束解码慢四倍。更重要的是,它们大大削弱了扩散语言模型相对于自回归模型的关键优势之一,即并行解码。这种减慢是因为在并行生成过程中,顺序有效性检查引入了显著开销。我们提出了一个高效的CFG约束解码框架EPIC,解决了这一限制。我们的方法通过结合词法记忆化、使用Earley风格解析(而非确定性自动机)进行验证,以及用于并行提交的松弛兼容子集选择,提高了解码效率。它减少了重复的词法分析和验证开销,同时允许多个兼容令牌一起提交。在三个基准测试上使用四个模型的实验表明,与现有的CFG约束解码方法相比,我们的方法将推理时间减少了高达67.5%,并将额外开销降低了高达90.5%。我们的实现可在https://github.com/hyundong98/EPIC-Decoding.git获取。

英文摘要

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC-Decoding.git .

2606.00684 2026-06-02 eess.AS cs.CL cs.SD 版本更新

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

连续归一化流用于分布外检测的局部诊断

Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen, Giampiero Salvi

发表机构 * Department of Electronic Systems(电子系统系) Norwegian University of Science and Technology(挪威科学技术大学) Trondheim, Norway(特伦德内克,挪威)

AI总结 针对高维数据子空间中目标观测的分布外检测问题,提出基于连续归一化流的拉格朗日子流框架,通过速度场几何诊断信号设计零样本音素级发音错误检测指标,优于基于似然的方法。

Comments 16 pages, 5 figures

详情
AI中文摘要

我们解决了嵌入在高维数据空间子空间中的目标观测的分布外(OOD)检测问题。利用连续归一化流(CNFs),我们提出了一个拉格朗日子流(LSF)框架,旨在隔离并估计表示中相关分量的密度,同时将剩余分量作为上下文。通过对语音合成模型的实验,我们表明CNFs与其他深度生成模型(DGMs)类似,容易受到“似然悖论”的影响,即OOD样本被错误地赋予高似然。这归因于DGMs的归纳偏差,即优先考虑低级结构细节而非高级语义一致性。为了缓解这一现象,我们提出了基于子流轨迹上速度场的若干几何诊断信号。基于这些信号,我们为零样本音素级发音错误检测这一具有挑战性的任务设计了指标。最后,我们在一个真实的发音错误检测基准上展示了这些指标相对于基于似然的方法的优越性。

英文摘要

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.

2606.00683 2026-06-02 cs.CL 版本更新

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

OCC-RAG:面向忠实问答的最优认知核心

Maksim Savkin, Mikhail Goncharov, Alexander Gambashidze, Alla Chepurova, Dmitrii Tarasov, Nikita Andriianov, Daria Pugacheva, Vasily Konovalov, Andrey Galichin, Ivan Oseledets

发表机构 * OCC Team(OCC团队)

AI总结 提出OCC-RAG,一种通过多上下文多跳合成数据训练的小语言模型,在忠实问答任务中匹配或超越2-6倍规模通用模型。

详情
AI中文摘要

近年来,语言模型的发展由规模定义,每一代模型都将更多世界知识吸收进其参数中。然而,许多实际应用更受益于稳健推理而非广泛的参数化知识。在此背景下,任务专用的小语言模型(SLM)提供了一种原则性的设计选择。我们提出最优认知核心(OCC),一个基于此前提构建的SLM家族。作为OCC的变体,我们提出OCC-RAG,针对基于给定上下文的忠实问答(QA)进行了优化。该任务与OCC设计方法直接对齐,需要在提供的段落上进行多跳推理,同时忽略记忆的知识。为训练OCC-RAG,我们实现了一种新颖的流水线,用于大规模合成多上下文、多跳QA数据,生成了一个包含超过三百万个样本的语料库,针对多跳推理、严格上下文忠实性和校准的弃权进行了优化。我们发布了OCC-RAG-0.6B和OCC-RAG-1.7B,两者均在此语料库上进行了中期训练。这些模型生成带有源引用的结构化推理轨迹,这些引用基于上下文中的逐字引用。通过OCC-RAG,我们证明了紧凑的任务专用SLM可以在多跳推理(HotpotQA、MuSiQue、TAT-QA)、忠实性(ConFiQA)和拒绝(MuSiQue-Un)基准测试中匹配或超越规模为其2-6倍的通用模型。

英文摘要

Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.

2606.00671 2026-06-02 cs.AI cs.CL cs.LG 版本更新

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

AXIOM: 一种用于可验证数学推理的信任优先神经符号执行架构

Alessio Bruno

发表机构 * Independent researcher(独立研究者)

AI总结 提出AXIOM架构,将语言模型限制为规范化器,通过确定性计算机代数系统管道实现可验证的数学推理,在4个MATH类别上达到94.36%的正确率和100%的信任度。

Comments Preprint. 12 pages, 2 figures. Live interactive demo: https://huggingface.co/spaces/Squagghy/axiom-solver. Paper artifact and dataset on Zenodo (concept-DOI): 10.5281/zenodo.20440225

详情
AI中文摘要

我们提出AXIOM,一种用于自然语言数学推理的信任优先神经符号执行架构。在AXIOM中,语言模型严格作为规范化器:它将非正式问题文本重写为狭窄的模式,由确定性计算机代数系统(CAS)管道消费,该管道推导并验证答案,或作为第一类输出弃权。路由遵循问题形状正则表达式、特定模式提示和封闭形式CAS处理器之间的1:1:1对齐,已交付3100多条这样的路由,并在250多个连续提交中零LOST_CORRECT回归。我们在4个MATH类别上报告了实证结果,累积正确率为94.36%(2,592/2,747),可解析问题的信任度为100.00%(在整个2,747条记录基准测试中零自信错误答案),所有四个领域均高于每个领域70/90/70的阈值,每个领域信任度为100.0%,仅规则处理器的中位延迟为1毫秒(在lm-eval算术20,000条记录基准测试中占88%的记录)。该架构通过公共部署已服务约30,000次生产查询。我们强调的贡献不是最终的准确率数字,而是该架构建立的向前动态:生产中的每个记录弃权在一次发布周期后都是候选正确,因为新任务在不回归注册表的情况下组合。支撑这一特性的操作纪律——数学模板分桶、LOST_CORRECT扫描作为回归预言机、可解析优先接入以及弃权作为第一类输出——构成了一个可迁移的框架,适用于数学之外的值得信赖的神经符号系统。

英文摘要

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

2606.00660 2026-06-02 cs.CL 版本更新

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

FineVerify: 通过细粒度自验证扩展智能搜索的测试时计算

James Xu Zhao, Hui Chen, Bryan Hooi, See-Kiong Ng

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出FineVerify细粒度自验证框架,将问题分解为可检查的子问题,对候选答案逐项验证并聚合得分,在四个智能搜索基准上显著优于标准扩展基线。

Comments 8+18 pages, 6 tables, 11 figures

详情
AI中文摘要

智能搜索需要语言模型代理探索多个来源并回答复杂的信息寻求问题。扩展测试时计算是改进这些代理的一种有前景的方法,但当前方法可能失败,因为正确答案通常稀疏且基于分数的选择依赖于模型校准。我们提出FineVerify,一个细粒度自验证框架,将每个问题分解为可检查的子问题,对采样候选答案针对每个子问题进行验证,并选择聚合得分最高的候选答案。这种逐项检查结构将选择转化为更简单的局部判断,并在相同的显式标准下产生分数。在四个智能搜索基准和两个模型上,FineVerify始终优于标准扩展基线。仅使用四个采样轨迹,它在GPT-5-mini上平均提高8.2个准确率点,在Gemini-3-flash上提高5.6%。使用12个样本,FineVerify使GPT-5-mini在BrowseComp-Plus上超越前沿GPT-5。除了准确性,FineVerify还生成可解释的验证轨迹,有助于审计基准错误,表明其在检查智能搜索系统方面具有更广泛的应用。代码和数据可在https://github.com/XuZhao0/fineverify获取。

英文摘要

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify

2606.00651 2026-06-02 cs.LG cs.AI cs.CL 版本更新

MESA: Improving MoE Safety Alignment via Decentralized Expertise

MESA: 通过去中心化专家提升MoE安全对齐

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对MoE架构中安全能力集中于少数专家导致的脆弱性,提出MESA框架,通过最优传输理论实现专家安全职责去中心化分配与路由细化,在保持实用性的同时提升防御性能。

Comments 18 pages, 8 figures, accepted by ICML 2026

详情
AI中文摘要

混合专家(MoE)架构高效扩展大型语言模型(LLM),通过动态路由将输入分配给相关专家,以降低计算成本的同时增强容量,但引入了一个关键漏洞:安全稀疏性,即安全能力集中在少数专家中,使其容易受到对抗性绕过。同时,传统的对齐方法统一调整所有参数,忽略了它们的功能差异,并无意中降低了性能。为了解决这些挑战,我们提出了MESA(MoE安全对齐),一个针对基于MoE的LLM的定向对齐框架,策略性地去中心化安全责任以最大化覆盖范围,同时最小化对实用性的干扰。基于最优传输(OT)理论,MESA通过两种机制运作:(1)专家容量重新分配使用传输成本矩阵将安全职责分配给最具成本效益的专家,以及(2)动态路由细化约束路由器精确激活这些去中心化模块。实验表明,MESA在保持有用性的同时,对各种有害基准实现了稳健的防御性能。代码可在https://github.com/lorraine021/MESA获取。

英文摘要

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

2606.00647 2026-06-02 cs.CL cs.AI 版本更新

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

LinguIUTics 在 PsyDefDetect 中的研究:用于心理防御机制分类的迭代不平衡感知微调 Qwen3-8B

Shefayat E Shams Adib, Ahmed Alfey Sani, Md Hasibur Rahman Alif, Ajwad Abrar

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh(计算机科学与工程系,伊斯兰技术大学,达卡,孟加拉国)

AI总结 针对对话文本中心理防御机制检测的类别不平衡问题,提出基于 QLoRA 微调 Qwen3-8B 的迭代不平衡感知方法,通过分组分层交叉验证、少数类轮询词汇增强和后处理流水线,在 PsyDefDetect 2026 共享任务中达到宏 F1 0.3917,排名第4。

Comments Accepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop (BioNLP 2026), co-located with ACL 2026 in San Diego, CA, USA

详情
AI中文摘要

检测对话文本中的心理防御机制仍然是一个具有挑战性的临床自然语言处理问题。针对 PsyDefDetect 2026 共享任务(九类话语分类,通过宏 F1 评估),我们的团队 LinguIUTics 在官方正类排行榜上取得了 0.3917 的宏 F1 分数,在 21 个注册团队中排名第 4,比 Ministral-8B 任务基线(宏 F1 0.3148)提高了 7.7 个绝对点(相对提升 24.4%)。由于严重的类别不平衡,BERT 系列编码器和零样本 LLM 在稀有类别上被证明无效,因此我们转向对 Qwen3-8B 进行 QLoRA 微调。我们利用三个关键策略:分组分层交叉验证(防止泄漏)、少数类轮询词汇增强,以及包含 logit 偏置调整和集成混合的后处理流水线。这些组件共同缩小了验证集与排行榜之间的差距,并显著提高了少数类的召回率,将关键的“Unclear”类别(第8级)从接近零的性能提升到 F1 分数 0.797。

英文摘要

Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (nine-class utterance classification evaluated via macro F1), our team LinguIUTics achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by 7.7 absolute points (24.4 percent relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logit bias tuning and ensemble blending. Together, these components close much of the validation-to-leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to an F1 score of 0.797.

2606.00634 2026-06-02 cs.CL cs.LG 版本更新

French parsing enhanced with a word clustering method based on a syntactic lexicon

基于句法词典的词聚类方法增强的法语解析

Anthony Sigogne, Matthieu Constant, Eric Laporte

发表机构 * Université Paris-Est(巴黎-est大学) LIGM(语言与信息学实验室)

AI总结 本文通过将法语句法词典(Lexicon-Grammar)的数据整合到概率解析器中,并应用聚类方法于法语树库的动词,提高了基于概率上下文无关文法的解析性能。

详情
Journal ref
Second Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), 2011, Dublin, Ireland, pp.22-27
AI中文摘要

本文评估了从法语句法词典(Lexicon-Grammar, Gross, 1994)中提取的数据整合到概率解析器中的效果。我们表明,通过对法语树库(Abeillé et al., 2003)中的动词应用聚类方法,基于概率上下文无关文法(Petrov et al., 2006)的解析器在法语上获得了准确的性能。

英文摘要

This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeillé et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006).

2606.00628 2026-06-02 cs.CL 版本更新

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

通过动态令牌选择实现分布对齐的自蒸馏的鲁棒推理

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang Zhiming Zheng

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University(北京未来区块链与隐私计算先进创新中心,北京航空航天大学) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院)

AI总结 针对自蒸馏中参考答案引入风格偏差导致模型模仿表面形式而非学习推理模式的问题,提出分布对齐自蒸馏(DASD),通过动态过滤高困惑度令牌来保留逻辑修正并抑制风格噪声,在数学、代码和常识推理任务上提升鲁棒性。

Comments 12 pages, 13 figures

详情
AI中文摘要

自蒸馏通过将参考答案重写为更符合模型自身分布的训练数据来提高学习效率。然而,参考答案也引入了强烈的风格偏差,导致生成模型模仿表面形式而非学习有用的推理模式。我们观察到重写数据包含大量高困惑度(PPL)令牌,这些令牌来自两个不同的来源:有益的知识增强逻辑修正,以及由参考模仿引起的有害风格漂移。平等对待所有此类令牌会破坏基础模型的原始分布并降低性能,尤其是在困难推理任务上。为了解决这个问题,我们提出了分布对齐自蒸馏(DASD),它使用答案感知的参考模型生成候选令牌,并根据基础模型的置信度动态过滤它们。DASD 保留编码有用逻辑知识的令牌,同时抑制分布不对齐的风格噪声。在数学、代码和常识推理基准上的实验表明,DASD 始终优于竞争基线,减少了高 PPL 令牌,并提高了不同难度任务的鲁棒性。

英文摘要

Self-distillation improves learning efficiency by rewriting reference answers as training data that better matches the model's own distribution. However, reference answers also introduce strong stylistic biases, causing the generative model to imitate surface forms rather than learn useful reasoning patterns. We observe that the rewriting data contains a large number of high-perplexity (PPL) tokens, coming from two distinct sources: beneficial knowledge-enhancing logical corrections, and harmful stylistic drift induced by reference imitation. Treating all such tokens equally can disrupt the base model's original distribution and degrade performance, especially on difficult reasoning tasks. To address this, we propose Distribution-Aligned Self-Distillation (DASD), which uses an answer-aware reference model to generate candidate tokens and dynamically filters them according to the base model's confidence. DASD preserves tokens that encode useful logical knowledge while suppressing distributionally misaligned style noise. Experiments on math, code, and commonsense reasoning benchmarks show that DASD consistently outperforms competitive baselines, reduces high-PPL tokens, and improves robustness across tasks of varying difficulty.

2606.00619 2026-06-02 cs.CL cs.AI 版本更新

MemPro: Agentic Memory Systems as Evolvable Programs

MemPro:作为可进化程序的智能体记忆系统

Qingshan Liu, Guoqing Wang, Wen Wu, Jingqi Huang, Xinqi Tao, Dejia Song, Jie Zhou, Liang He

发表机构 * East China Normal University(东华师范大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出MemPro框架,将整个记忆构建-检索管道视为可进化程序,通过故障模式引导的编辑-调试迭代优化,在多个长时任务数据集上超越静态和提示级进化基线。

Comments 20 pages, 14 figures

详情
AI中文摘要

长时程自主智能体需要记忆系统来保留历史信息、跟踪演化状态并在有限上下文窗口之外重用相关知识。现有的智能体记忆系统通常遵循记忆构建-检索(MCR)管道,但往往主要适应记忆库,而在部署后保持周围管道固定。这种固定管道设计难以处理异构的任务特定故障模式,并且可能随着时间推移与规模和结构演化的记忆库产生错位。为解决这些限制,我们提出MemPro,一种系统级进化框架,将整个MCR管道视为可进化程序,而不仅仅是适应记忆库或提示文本。MemPro维护一个可运行记忆系统实现的版本树,其中进化智能体迭代选择有前途的版本,诊断重复出现的故障,并通过故障模式引导的编辑-调试改进创建改进的子版本。在LongMemEval、LoCoMo、HotpotQA和NarrativeQA上的实验表明,MemPro在几次迭代内持续优于强静态和提示级进化基线,随着进化持续改进,并实现了良好的性能-成本权衡。代码可在https://github.com/wanghai673/MemPro获取。

英文摘要

Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction-retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed-pipeline design struggles to handle heterogeneous task-specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system-level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory-system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure-mode-guided edit-debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt-level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance-cost trade-off. Code is available at https://github.com/wanghai673/MemPro.

2606.00613 2026-06-02 cs.CL cs.AI 版本更新

Linguistics-Aware Non-Distortionary LLM Watermarking

语言学感知的无失真LLM水印

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han

发表机构 * Yonsei University(延世大学) Rensselaer Polytechnic Institute(罗切斯特理工学院)

AI总结 提出LUNA水印方法,通过语言自适应非失真二元锦标赛采样器,在保持文本质量的同时实现高检测性能,在12种设置中AUROC达0.9959且中位困惑度偏移仅0.045。

详情
AI中文摘要

水印应能识别语言模型输出而不降低质量或限制验证仅由模型提供者进行。多语言部署使这更加困难,因为形态、分词和书写系统的变化会改变水印证据自然进入的位置。我们引入LUNA,一种语言自适应水印,结合了无模型检测和标准随机密钥模型下的单令牌无失真。LUNA从外部语料库中的词性上下文估计归一化下一标记熵,并用其设置无失真二元锦标赛采样器的深度;检测器从文本、分词器、词性标注器和密钥重建相同的调度。我们在六种类型多样的语言和两个领域上评估了八种主要基线。LUNA在十二种设置中达到了0.9959的AUROC和最低的平均绝对中位困惑度偏移0.045;其95%自助法区间[0.022, 0.073]低于所有基线区间。LUNA还记录了最低的平均Self-BLEU、Distinct-1、surprisal和熵偏移。它是唯一同时在大多数设置中实现AUROC > 0.99和绝对中位困惑度偏移低于0.1的方法,在12种设置中的9种达到该状态,而没有任何基线在超过2种设置中达到。我们的代码可在https://github.com/Shinwoo-Park/luna_watermark获取。

英文摘要

Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo-Park/luna_watermark

2606.00596 2026-06-02 cs.CL 版本更新

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

面向负责任且基于认识论的多语言大语言模型在计算社会科学与人文学科中的应用

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文重新概念化多语言推理大语言模型为解释学工具,提出一个基于理论框架来评估其在社会科学与人文学科研究中的文化对齐、跨语言稳定性和推理忠实性。

详情
Journal ref
Proceedings of LLMs4SSH Workshop at LREC 2026, Palma de Mallorca, Spain, 2026
AI中文摘要

大语言模型在多语言能力和推理能力方面迅速发展,使其能够整合到社会科学与人文学科研究流程中。然而,现有的评估范式仍然基于任务型NLP基准,未能解决解释有效性、文化情境性和认识论中介问题。本文重新概念化多语言推理大语言模型为解释学工具,这些工具在不同语言和文化背景下积极构建意义生产。借鉴解释学、技术哲学、科学技术研究、多语言NLP研究和计算社会科学方法论,我们为评估社会科学与人文学科研究中的多语言推理开发了一个理论基础的框架。我们阐述了一个严格的实验协议,包含文化对齐、跨语言稳定性和推理忠实性的可操作化指标,以及针对解释性研究任务定制的透明度要求。我们通过一个涉及多语言政治话语分析的具体应用场景来说明该框架。本文为将多语言推理大语言模型负责任地整合到计算社会科学基础设施中提供了概念和方法论基础。

英文摘要

Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in task-based NLP benchmarks and fail to address interpretive validity, cultural situatedness, and epistemic mediation. This paper reconceptualizes multilingual reasoning LLMs as hermeneutic instruments that actively structure meaning production across linguistic and cultural contexts. Drawing on hermeneutics, philosophy of technology, science and technology studies, multilingual NLP research, and computational social science methodology, we develop a theoretically grounded framework for evaluating multilingual reasoning in Social Sciences and Humanities (SSH) research. We articulate a rigorous experimental protocol with operationalized metrics for cultural alignment, cross-lingual stability, and reasoning faithfulness, along with transparency requirements tailored to interpretive research tasks. We illustrate the framework through a concrete application scenario involving multilingual political discourse analysis. The paper contributes a conceptual and methodological foundation for responsible integration of multilingual reasoning LLMs into computational social science infrastructures.

2606.00593 2026-06-02 cs.CL cs.AI 版本更新

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER: 面向多答案问答的逐步同伴优势与多样性感知探索奖励

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) School of Software Technology, Zhejiang University(浙江大学软件技术学院)

AI总结 提出SPADER强化学习框架,通过逐步同伴优势(SPA)机制和多样性感知探索奖励,解决多答案问答中长程工具使用的细粒度信用分配与持续探索问题,实验表明在多个数据集上提升了召回率和F1分数。

详情
AI中文摘要

大型语言模型越来越多地被部署为工具增强型智能体,以获取参数知识之外的信息。虽然最近的工作改进了长程工具使用推理,但大多数方法专注于具有单一正确答案的任务。相比之下,许多现实世界中的查询需要发现一组全面的有效答案,这种设置被称为多答案问答。这种设置带来了两个挑战:长搜索轨迹上的细粒度信用分配,以及超越简单高频实体的持续探索的奖励对齐。我们提出了SPADER,一个用于多答案问答中长程工具使用的强化学习框架。SPADER包括逐步同伴优势(SPA),一种无评论家的逐步信用分配机制,它通过决策步骤对齐并行轨迹,并根据同伴回报估计优势。它还包括一个多样性感知探索奖励,通过加权稀有发现和降低冗余发现的权重来促进长尾实体发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明,SPADER通常比基于提示的智能体、结果监督的强化学习方法和最近的逐步监督方法提高了召回率和整体F1分数。我们的代码和模型权重可在https://github.com/KhanCold/spader获取。

英文摘要

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

2606.00579 2026-06-02 cs.CL cs.CV 版本更新

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

沙盒化编码智能体是竞争性的全模态任务求解器

Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) MBZUAI

AI总结 本文提出沙盒化编码智能体,仅通过文本+图像访问和工具使用,即可在全模态任务中匹配甚至超越原生全模态模型,并通过技能注入和训练配方Code-X进一步提升性能。

Comments Paper under review

详情
AI中文摘要

随着多模态大语言模型越来越多地针对视频和音频,人们通常认为这类任务需要原生全模态模型。我们表明情况并非总是如此:仅具有文本+图像访问权限和沙盒化工具使用接口的编码智能体,可以在多个音频-视频基准测试中匹配,并在某些设置中超越最先进的原生全模态模型和预定义的多模态智能体框架。我们的轨迹分析表明,它们的优势来自于编写代码和编排工具,以从转录、帧和其他模态信号中提取相关证据,从而将全模态任务转化为检索和信息处理问题,而不是摄取整个媒体流。我们进一步通过失败分类和过程级轨迹分析来刻画它们的局限性,并表明简单的技能注入(包括人工编写和自蒸馏的技能)能显著提高性能。为了探索开源激发,我们引入了Code-X,一种包含OmniCoding轨迹数据集和可验证奖励的训练方案,并在Qwen-3.5-9B和Qwen-3.6-27B上提供了基线。最后,我们认为下一个前沿是多模态处理,并引入了TerminalBench-O,一个用于现实世界全模态处理任务的过程级基准。代码将在https://github.com/Dongping-Chen/OmniCoding提供。

英文摘要

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.

2606.00570 2026-06-02 cs.CL cs.AI 版本更新

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

重新审视大型语言模型中基于参数的知识编辑:理论极限与实证证据

Wanying Ren, Xin Song, Futing Wang, Guoxiu He, Aixin Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过理论分析和实证评估,揭示了基于参数的知识编辑方法会因维度坍缩假设导致全局干扰和推理崩溃,而简单的检索基线方法在所有条件下均表现更优。

Comments Accepted to ICML 2026. Equal contribution by the first two authors. 9 pages main paper, 10 figures, with appendix

详情
AI中文摘要

基于参数的知识编辑通过局部权重修改更新大型语言模型(LLMs)的内部知识,并引起了广泛关注。然而,大多数现有方法忽略了基本的理论限制,并且很少在现实的、面向实践的设置下进行评估。在本文中,我们首先基于维度坍缩假设提出理论分析,解释局部参数编辑如何沿着表示空间中的脆弱方向传播,引发全局干扰并最终导致推理崩溃。基于这一见解,我们通过系统变化知识复杂度、编辑次数、评估维度和基线方法进行了全面的实证评估。我们的结果表明,基于参数的编辑方法持续损害LLM的核心能力。相比之下,一个简单的基于检索的基线在所有评估条件下始终比所有参数编辑方法表现更强。这些发现强调,在知识编辑后保持LLM的基本能力应成为未来研究的核心关注点。

英文摘要

Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.

2606.00566 2026-06-02 cs.LG cs.CL cs.CR 版本更新

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

相同载荷,不同通道:测量使用工具的語言模型中的信任不对称性

Mohammed Sameer Syed, Rozhin Yasaei

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本研究提出安全不对称分数(SAS),通过匹配恶意载荷仅改变传递上下文,系统测量了语言模型在不同通道(用户消息、工具元数据、工具输出)中对对抗性内容的脆弱性差异,发现代理原生模型在工具描述通道更脆弱,而通用模型相反,且机制研究表明安全相关表示在深层网络非线性编码。

Comments 13 pages, 1 figure. Submitted to EMNLP 2026

详情
AI中文摘要

随着语言模型承担代理角色,包括调用外部API、读取工具输出以及执行嵌入在第三方内容中的指令,其攻击面远超用户输入。模型是否以相同方式处理恶意指令(无论其来源)尚未被系统研究。我们引入了安全不对称分数(SAS),通过使用匹配的载荷对(保持恶意文本相同,仅改变传递上下文)来测量模型对对抗性内容的敏感性如何随内容出现在用户消息、工具元数据或工具输出中而变化。在6个生产级LLM和三种攻击家族上的评估发现了一致且信息丰富的不对称性:当对抗性内容通过工具描述而非用户消息传递时,代理原生模型显著更脆弱,而通用模型则相反。当相同内容通过工具输出而非描述传递时,这种不对称性进一步反转,表明模型隐含地将工具元数据视为可信指令,而将工具结果视为普通数据。对Llama 3.3 70B的机制研究表明,安全相关表示在网络的中间到深层因果存在但非线性编码,解释了线性探针为何无法检测到它。这些发现揭示了当前使用工具的模型在处理对抗性内容时存在的系统性、通道依赖的盲点。

英文摘要

As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model treats a malicious instruction the same way regardless of where it arrives has not been systematically studied. We introduce the Safety Asymmetry Score (SAS), which measures how much a model's susceptibility to adversarial content shifts depending on whether that content arrives in the user message, tool metadata, or tool output, using matched payload pairs that keep the malicious text identical and vary only the context of delivery. Evaluated across 6 production LLMs and three attack families, we find a consistent and informative asymmetry: agent-native models are substantially more vulnerable when adversarial content arrives via tool descriptions than via user messages, while general-purpose models show the reverse. This asymmetry further inverts when the same content is delivered through tool outputs rather than descriptions, suggesting models implicitly treat tool metadata as trusted instructions and tool results as ordinary data. A mechanistic study on Llama 3.3 70B reveals that the safety-relevant representation is causally present at mid-to-late network depths but non-linearly encoded, explaining why linear probes fail to detect it. These findings expose a systematic, channel-dependent blind spot in how current tool-using models handle adversarial content.

2606.00564 2026-06-02 cs.CV cs.CL 版本更新

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

面向视觉-语言推理的分解式在策略蒸馏:引导梯度实现视觉定位

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过将视觉-语言模型蒸馏损失分解为语言先验和视觉定位两个正交分量,提出视觉梯度引导(VGS)方法动态调整更新方向以优先优化视觉子空间,从而提升小模型在复杂多模态任务中的定位能力。

Comments ICML 2026 Spotlight

详情
AI中文摘要

虽然在策略蒸馏为训练小型推理模型提供了密集监督,但其在多模态领域的优化动态仍未得到充分探索。在这项工作中,我们通过数学上将损失分解为两个不同的组成部分:语言先验和视觉定位,挑战了视觉-语言模型(VLM)蒸馏的标准整体观点。我们的分析揭示,这些分量的梯度向量几乎正交,表明与教师语言分布对齐的目标在几何上独立于匹配其视觉感知的目标。因此,标准优化被动地遵循一条次优的折衷轨迹,隐式地平衡这两个目标。假设视觉定位是视觉-语言推理的主要瓶颈,我们引入了视觉梯度引导(VGS),一种动态重新定向更新向量以优先考虑视觉子空间的方法。在多个蒸馏设置和复杂多模态基准上的实验结果表明,VGS显著优于标准的在策略蒸馏整体公式,以最小的训练开销实现了卓越的定位能力。

英文摘要

While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.

2606.00547 2026-06-02 cs.CL 版本更新

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

学习检索:面向文本到SQL代理的双层长期记忆

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Snowflake AI Research(Snowflake AI研究院)

AI总结 提出MERIT框架,通过强化学习优化双层记忆检索(全局策略与局部决策),提升交互式文本到SQL代理的成功率并减少交互轮次。

详情
AI中文摘要

交互式文本到SQL代理通过多轮交互解决数据库任务,涉及模式探索、查询执行、反馈解释和决策修订。长期记忆帮助代理重用过去经验,但现有检索方法仍有局限。静态方法依赖固定的相似性启发式,无法优化下游效用;动态方法通常从稀疏的最终结果中学习,并在单一决策水平上检索记忆。当记忆有用性随交互阶段变化时,这种方法是 insufficient 的,因为用于初始规划的记忆可能不同于局部、状态条件执行所需的记忆。我们提出MERIT,一种动态多水平记忆检索框架。MERIT维护用于全局策略指导的片段级记忆和用于局部决策支持的轮次级记忆。两个水平都使用通过强化学习优化的学习检索策略。为了在有限的中间监督下训练轮次级检索,MERIT使用轻量级过程奖励模型为局部记忆选择提供密集的代理奖励。在BIRD-Interact上的实验表明,MERIT在成功率上优于无记忆、静态检索和动态检索基线,同时减少了平均交互轮次。在Spider2-Snow上的迁移结果进一步显示了无需基准特定调优的跨基准正迁移。这些结果表明,多水平检索改善了交互式文本到SQL代理中的经验重用。

英文摘要

Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.

2606.00544 2026-06-02 cs.LG cs.CL 版本更新

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

逃离模式抽彩:多响应训练提升语言模型泛化能力

Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna

发表机构 * Department of Computer Science, Purdue University(计算机科学系,普渡大学)

AI总结 本文提出多响应训练(MRT)方法,通过保留每个提示的多个有效响应来缓解传统单响应微调导致的“模式抽彩”问题,并从统计角度揭示了其提升分布泛化的原理和适用条件。

详情
AI中文摘要

现代语言模型微调通常为每个提示配对单个响应,尽管许多提示允许多个有效补全。这实际上将多模态条件分布简化为单样本视图,我们称之为“模式抽彩”现象,其中训练强调一部分合理模式而忽略其他模式。我们研究了多响应训练(MRT),该方法保留每个提示的多个响应,并建立了关于何时以及为何有帮助的原则性解释。我们的关键见解是,提示和响应是不同的统计资源:额外的提示减少输入分布的不确定性,而额外的响应减少条件输出分布的不确定性。这产生了方差-预算权衡,预测了何时保留多个响应是有价值的,显示了随着提示级不确定性占主导地位而收益递减,并解释了为什么大型冗余语料库可以表现出隐式的多响应效应。我们进一步分析了响应选择,并表明Random-K-of-N是分布微调的无偏默认选择,基于奖励的选择可能导致模式坍缩,而子模质量-多样性目标提供了一种具有理论保证的高效替代方案。受控模拟验证了预测的方差和选择效应,包括一个惊人的失败模式,其中仅奖励选择产生的梯度与真实目标不一致。在结构化和真实世界数据集上,包括一个新的多提示、多响应基准,MRT一致地改善了分布泛化,在响应多样性高、提示冗余性低的场景中收益最大。MRT将响应多重性重新定义为数据分配问题,并提供了明确的指导:当响应廉价且多样时,保留多个响应不是启发式方法,而是基于统计的选择。

英文摘要

Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the "mode lottery," where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.

2606.00523 2026-06-02 cs.CL 版本更新

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

ProactiveLLM: 学习流式大语言模型的主动交互

Junlong Tong, Yao Zhang, Anhao Zhao, Yingqi Fan, Yunpu Ma, Xiaoyu Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ProactiveLLM,利用模型内生状态指导交互决策,通过掩码流式建模和同步特权自蒸馏实现主动交互,减少延迟并保持质量。

Comments ICML 2026

详情
AI中文摘要

标准大语言模型(LLMs)遵循先读取后生成的范式,导致不必要的延迟和计算。流式LLMs通过在接收输入的同时生成来缓解这一问题,但仍难以决定何时与流交互。现有方法要么硬编码交互时机,要么依赖昂贵的外部对齐信号,如时间标签、推理轨迹或更强的教师模型。本文提出ProactiveLLM,通过利用模型内生状态来指导交互决策,实现主动交互。模型首先通过两种互补的训练机制学习从部分输入中感知语义充分性:基于掩码的流式建模和同步特权自蒸馏(SPSD)。前者在训练期间对输入应用单调随机掩码,模拟逐步揭示的流式输入,使模型能够从部分输入视角学习局部语义依赖。后者将部分上下文的学生视图与同一演化模型生成的全上下文教师视图对齐,允许特权全上下文证据指导学生在不完整观察下的理解。这些机制共同诱导内生充分性线索,无需外部教师或标注,为多种决策头的即插即用集成提供了通用基础。在文本和语音流式任务上的广泛评估证实,ProactiveLLM在保持质量的同时显著降低了交互延迟,验证了其动态主动交互的能力。代码公开于https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM。

英文摘要

Standard Large Language Models (LLMs) follow a read-then-generate paradigm, causing unnecessary latency and computation. Streaming LLMs alleviate this issue by generating while receiving inputs, but still struggle to decide when to interact with the stream. Existing methods either hard-code interaction timing or rely on costly external alignment signals, such as timing labels, reasoning trajectories, or stronger teachers. In this paper, we propose ProactiveLLM, which achieves active interaction by leveraging the model's endogenous states to guide interaction decisions. The model first learns to perceive semantic sufficiency from partial inputs through two complementary training mechanisms: mask-based streaming modeling and synchronized privileged self-distillation (SPSD). The former applies monotonic random masking to the input during training, simulating progressively revealed streaming inputs and enabling the model to learn local semantic dependencies from partial-input views. The latter aligns the partial-context student view with a full-context teacher view generated by the same evolving model, allowing privileged full-context evidence to guide the student's understanding under incomplete observations. Together, these mechanisms induce endogenous sufficiency cues without requiring external teachers or annotations, providing a versatile foundation for the plug-and-play integration of diverse decision heads. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction. Code is publicly available at https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM.

2606.00510 2026-06-02 cs.CL cs.AI 版本更新

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

技能还是跳过?通过双粒度偏好学习在智能体任务中学习选择性技能调用

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang, Yi Yang, Cong Qin, Yangen Hu, Lu Pan, Ke Zeng

发表机构 * Meituan(美团) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) Peking University(北京大学)

AI总结 提出SelSkill框架,通过双粒度偏好学习实现选择性技能调用,在ALFWorld和BFCL上显著提升任务成功率和执行精度。

Comments 18 pages, 4 figures, 10 tables

详情
AI中文摘要

智能体技能是可调用的程序化模块,为复杂智能体任务提供可重用知识和执行策略。然而,现有方法主要关注选择相关技能或改进技能本身,而忽略了在当前决策点是否应该实际调用相关技能。无帮助的调用可能引入无关上下文并破坏原本正确的执行过程。为解决此问题,我们提出SelSkill,一个用于选择性技能调用的双粒度偏好学习框架。SelSkill将技能使用表述为技能或跳过决策,利用预测不确定性优先考虑候选决策点,并从共享轨迹前缀构建受控的调用-跳过偏好对。它进一步结合了回合级结果偏好与步骤级调用偏好,以捕捉整体轨迹质量和技能调用的局部有效性。在ALFWorld上使用Qwen3-8B,SelSkill将任务成功率提高了10.9个百分点,执行精度提高了29.1个百分点。在BFCL上,它将任务成功率提高了5.7个百分点,执行精度提高了29.5个百分点。在Tau-bench和PopQA上的零样本结果进一步表明,学习到的调用策略可迁移到具有未见技能的新领域。

英文摘要

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.

2606.00507 2026-06-02 cs.CL 版本更新

LaSR: Context-Aware Speech Recognition via Latent Reasoning

LaSR:通过潜在推理实现上下文感知的语音识别

Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 提出LaSR训练范式,利用潜在推理轨迹增强语音大语言模型的上下文感知能力,在学术术语识别上显著提升性能且不增加延迟。

详情
AI中文摘要

近期语音大语言模型(Speech LLMs)的进展显著增强了口语理解与推理能力。然而,其上下文感知能力有限,难以有效反映说话者意图和主题上下文的语音识别。本文提出LaSR(潜在语音推理),一种新颖的训练范式,具有利用潜在推理过程的上下文感知推理轨迹。LaSR不生成显式中间令牌,而是将思维链(CoT)监督对齐到目标单词的声学特征区域,并引入潜在推理阶段用于上下文信息锚定和转录转换。此外,为有效基准测试专业词汇的上下文识别,我们提出Spoken Darwin-Science,一个专注于学术术语的大规模语料库。在Fun-Audio-Chat上的初步实验表明,LaSR显著提升了术语识别,且不引入额外延迟,并持续优于标准监督微调基线。我们的发现凸显了潜在推理在构建高效、上下文感知语音助手方面的潜力。

英文摘要

Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker's intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.

2606.00497 2026-06-02 cs.CR cs.CL 版本更新

"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

“我强烈怀疑这个网站是骗局”:自主网络代理中无防御的PII泄露与检测基准测试

Soham Roy, Sarthakbrata Halder, Arya Bharaty, Vaibhav Bhaskar, Yash Sinha, Dhruv Kumar, Srikant Panda, Murari Mandal

发表机构 * KIIT Bhubaneshwar(KIIT布巴内什瓦尔) BITS Pilani(比特斯理工学院) Lam Research(拉姆研究)

AI总结 本文通过构建包含91个攻击者控制环境和10个良性孪生基线的基准Scammer4U,评估前沿自主网络代理在社交工程攻击下的PII泄露风险,发现关键PII泄露率高达54-93%,并揭示了代理检测到攻击但仍有35.9%概率提交PII的检测-行动差距。

Comments 24 pages

详情
AI中文摘要

欺骗性网络内容广泛存在于互联网上,通常被称为社交工程攻击,它操纵自主网络代理将用户的个人身份信息(PII)提交给攻击者控制的端点。在本文中,我们表明社交工程攻击在从前沿网络代理中提取关键级PII方面非常有效,对已部署的代理系统构成严重风险。为了量化这一风险,我们引入了Scammer4U,一个预先注册的基准测试,包含91个攻击者控制的环境和10个良性孪生基线,涵盖8个攻击向量和16个站点类别,采用8轴因子分类法,隔离单个攻击设计因素的因果贡献。在前沿代理中,我们发现无隐私指导时关键级PII泄露率达到54-93%,而良性孪生基线为0%,证实泄露归因于攻击而非偶然的表单填写。升级提示级缓解措施在四个模型家族中产生急剧的模型依赖性降低,并且在汇总水平上仍不足以可靠地防止关键PII提交。最关键的是,我们识别出一个检测-行动差距:独立LLM法官确认代理推理已标记网站为可疑的情况下,代理仍然在35.9%的会话中提交关键PII,而代理未表达怀疑时为66.1%,这一30.2%的差距在四个模型家族中均稳健。我们的发现表明,基于代理自身对攻击识别的防御措施依赖于错误的信号,这激发了独立于代理推理循环的出站提交输出级拦截。

英文摘要

Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attacker-controlled endpoints. In this paper, we show that social-engineering attacks are highly effective at extracting critical-tier PII from frontier web agents, posing a severe risk to deployed agentic systems. To quantify this risk, we introduce \textbf{\textsc{Scammer4U}}, a pre-registered benchmark of 91 attacker-controlled environments and 10 benign-twin baselines, spanning 8 attack vectors and 16 site categories on an 8-axis factorial taxonomy that isolates the causal contribution of individual attack design factors. Across frontier agents, we find that critical-tier PII leakage reaches 54--93\% under no privacy guidance, compared to 0\% on benign-twin baselines, confirming that leakage is attack-attributable rather than incidental form-filling. Escalating prompt-level mitigation yields sharply model-dependent reductions across the four families and remains insufficient to reliably prevent critical PII submission at the pooled level. Most critically, we identify a detection--action gap: agents whose reasoning an independent LLM judge confirms has flagged the site as suspicious still submit critical PII in 35.9\% of sessions, versus 66.1\% when no suspicion is verbalized, a 30.2\% gap robust across all four model families. Our findings reveal that defenses conditioned on the agent's own recognition of an attack are gating on the wrong signal, motivating output-level interception of outbound submissions that operates independently of the agent's reasoning loop.

2606.00477 2026-06-02 cs.CL cs.CV 版本更新

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

文本编辑能否泛化到视觉生成?统一多模态模型中的跨模态知识编辑基准

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 提出跨模态知识编辑基准UniKE,发现文本编辑在图像生成中效果显著下降(VQA准确率仅18.5%),并提出推理增强参数编辑方法提升跨模态迁移效果。

Comments Published at ICML 2026; Code and data available at https://github.com/gxx27/UniKE

详情
AI中文摘要

统一多模态模型(UMMs)已成为通用多模态智能的有前途的范式。随着它们在现实世界应用中的部署,有效更新内部知识变得至关重要。虽然知识编辑在纯文本模型中已经成熟,但成功修改文本输出的编辑是否也能迁移到UMMs中的图像生成仍不清楚。为了研究这个问题,我们引入了UniKE,这是第一个用于UMMs中跨模态知识编辑的基准,包含2,971个编辑主题,涵盖属性和关系编辑。使用基于VQA的视觉验证,我们揭示了一个显著的模态差距:文本侧的有效性可以达到约92%,而直接图像生成下的最佳整体VQA准确率仅为18.5%。我们进一步提出了推理增强参数编辑,它在生成前显式激活编辑后的知识,并提高了所有评估模型-编辑器对的整体VQA准确率,提升高达18.6个百分点。机制分析表明,这种差距与编辑后的文本表示与视觉生成的条件路径之间的部分对齐有关,其中足以用于文本输出的编辑可能仍然太弱或未对齐,无法引导图像合成。这些发现表明,文本知识编辑不能保证可靠的跨模态迁移,并激励了模态感知的编辑方法。我们的代码和数据可在https://github.com/gxx27/UniKE获取。

英文摘要

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.

2606.00467 2026-06-02 cs.CL cs.AI cs.LG stat.ML 版本更新

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

论大语言模型适应性的局限:模型内化先验对标注任务性能的影响

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

发表机构 * University of Washington(华盛顿大学)

AI总结 通过毒性检测实验,研究大语言模型内化先验与指令交互的三个维度,发现近三分之二的零样本错误难以通过提示纠正,并引入定义特定熟悉度(DSF)指标,证明其与性能正相关,而文本记忆指标则无此关联。

Comments Accepted at ICML 2026 (Oral & Spotlight); PMLR vol. 306. 9 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于零样本标注和LLM-as-a-judge任务,但其可靠性取决于模型内化先验与用户提供指令的交互方式。我们研究了这种交互的三个维度:(1)LLM对数据和任务定义的熟悉程度如何影响性能;(2)提示中的额外信息能在多大程度上纠正零样本错误(“决策粘性”);(3)模型对错误任务定义的敏感性。通过在多种数据集(涵盖社交媒体、游戏、新闻和论坛)上进行毒性检测实验,使用密集模型和混合专家模型,我们发现近三分之二的零样本错误难以纠正,提示纠正的总体挽救率(初始错误中被纠正的比例)仅为34.8%。高置信度错误尤其难以纠正。当给出错误定义时,LLM会遵循这些定义,同时保持与正确定义条件下相同的置信水平。关键的是,我们引入了定义特定熟悉度(DSF),它衡量模型内部概念与任务定义之间的一致性。在控制数据集层面的混杂因素后,DSF与模型性能呈正相关(偏相关系数r=+0.41),而三种不同的记忆指标(ROUGE-L、BERTScore和嵌入余弦相似度)均未显示正相关。这些发现揭示了基于提示的纠正在标注任务中的局限性,强调了定义对齐比文本级记忆更重要。

英文摘要

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

2606.00462 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Short-form Text Rewriting with Phi Silica

短文本改写与 Phi Silica

Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou, Sadid Hasan

发表机构 * IEEE ICAD

AI总结 本研究通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写任务,结果表明微调提高了语义保真度、减少了幻觉并提升了与 GPT-5-chat 改写的偏好胜率。

Comments 6 pages

详情
AI中文摘要

短文本改写是释义的一种受限变体,其中有限的上下文和高语义密度几乎没有留下变化空间。虽然大型语言模型在一般释义任务上表现良好,但小语言模型(SLM)在短文本场景中常常在语义保真度和幻觉鲁棒性方面遇到困难。在这项工作中,我们提出了一项实证研究,通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写。我们从公开的幻灯片中整理了一个简短的演示风格文本数据集,并使用 GPT-5-chat 来生成改写监督以及进行 LLM 作为评判者的评估。我们的结果表明,微调提高了语义保真度,减少了幻觉,并提高了与 GPT-5-chat 改写的偏好胜率。这些发现表明,针对 SLM 的定向适配可以显著缩小与云模型的差距,并为将 SLM 适配于精度关键的改写任务提供实用指导。

英文摘要

Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.

2606.00460 2026-06-02 cs.CL eess.AS 版本更新

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

SALSA: 通过学习的引导激活向量实现语音感知的LLM适配

Yekaterina Yegorova, Argyrios Gerogiannis, Haolong Zheng, Julia Hockenmaier, Chang D. Yoo, Mark A. Hasegawa-Johnson

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出SALSA方法,通过监督学习优化逐层引导向量,在儿童语音、多语种语音和普通话-英语代码切换基准上显著提升零样本推理和语音上下文学习性能,最高相对提升46.8%。

详情
AI中文摘要

语音感知的大语言模型通常在域外场景中泛化能力较差。我们提出SALSA(通过学习的引导激活实现语音感知的LLM适配),一种轻量级适配方法,学习逐层引导向量。与通常依赖对比激活差异的引导方法不同,SALSA直接使用监督目标优化引导向量。在儿童语音、多语种语音和普通话-英语代码切换基准上,SALSA相比零样本推理和语音上下文学习基线显著提升性能,相对于零样本最高实现46.8%的相对改进。进一步分析表明,引导编码器(尤其是后层)比引导LLM主干更有效。这些发现表明,引导通过调整高层声学和语音表示以更好地与预训练语言模型表示空间对齐,而不是通过修改解码器本身,从而提升下游ASR性能。

英文摘要

Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children's speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.

2606.00451 2026-06-02 cs.CL 版本更新

ProtStructQA: A Denotation Threshold in Protein Structural Reasoning

ProtStructQA: 蛋白质结构推理中的指称阈值

Aravind Mandiga, Guoming Li, Jin Lu, Ismailcem Budak Arpinar, Khaled Rasheed, Samuel E. Aggrey

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出可执行基准ProtStructQA,通过将自然语言问题编译为DSL程序并在AlphaFold结构上执行来评估蛋白质语言模型,发现模型在1.7B到4B参数之间存在指称阈值,低于该阈值时工具辅助推理占优,高于该阈值时思维链成为最强策略。

详情
AI中文摘要

蛋白质语言系统通常通过是否生成合理的生物学文本来评估,但结构问题具有更清晰的语义:它表示3D坐标系中的测量值。我们引入ProtStructQA,一个可执行的蛋白质结构问答基准,其中每个自然语言问题由隐藏的类型化领域特定语言(DSL)程序生成,答案通过在该程序上对AlphaFold预测的结构执行获得。ProtStructQA发布了382.2K个问题,涵盖置信度、距离、预测对齐误差(PAE)、溶剂暴露、二级结构、拓扑和接触,以及保留的组合:一个包含来自四个物种的10K个蛋白质的330K活跃基准,加上一个52.2K的硬负例鲁棒性池。无需微调,我们在直接提示、思维链、语法约束可执行投票、带思维链的可执行投票以及多轮ReAct风格工具使用下评估了Qwen3模型(0.6B至8B),并在Gemma-3-1B和Gemma-3-12B上复现了主要发现。我们发现Qwen3-1.7B和Qwen3-4B之间存在一个能力依赖的指称阈值:低于该阈值时,工具中介的ReAct占主导,因为模型常常无法生成可执行的指称;高于该阈值时,思维链从大多有害转变为强烈有益,并成为大多数分割上的最强策略。解析失败和家族级分析表明,该阈值是从不可解析语言到可执行结构指称的转变,而语法和执行对PAE和二级结构查询仍然具有选择性价值。ProtStructQA将科学问答重新定义为从语言到测量的编译,并为语言模型何时能将单词映射到可执行的3D结构测量提供了诊断测试平台。

英文摘要

Protein-language systems are often evaluated by whether they generate plausible biological text, but a structural question has a sharper semantics: it denotes a measurement in a 3D coordinate system. We introduce ProtStructQA, an executable benchmark for protein structural question answering in which each natural-language question is generated from a hidden typed domain-specific language (DSL) program and the answer is obtained by executing that program on an AlphaFold-predicted structure. ProtStructQA releases 382.2K questions covering confidence, distances, predicted aligned error (PAE), solvent exposure, secondary structure, topology and contacts, and held-out compositions: a 330K active benchmark over 10K proteins from four species, plus a 52.2K hard-negative robustness pool. Without fine-tuning, we evaluate Qwen3 models from 0.6B to 8B under direct prompting, chain-of-thought, grammar-constrained executable voting, executable voting with chain-of-thought, and multi-turn ReAct-style tool use, and replicate the headline finding on Gemma-3-1B and Gemma-3-12B. We find a capability-dependent denotation threshold between Qwen3-1.7B and Qwen3-4B: below it, tool-mediated ReAct dominates because models often fail to produce executable denotations; above it, chain-of-thought flips from mostly harmful to strongly beneficial and becomes the strongest strategy on most splits. Parse-failure and family-level analyses show that the threshold is a transition from unparseable language to executable structural denotation, while grammar and execution remain selectively valuable for PAE and secondary-structure queries. ProtStructQA reframes scientific QA as compilation from language to measurement and provides a diagnostic testbed for when language models can map words to executable 3D structural measurements.

2606.00428 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters

低秩PEFT的更细参数步长:基于CP张量适配器的控制研究

Xinjue Wang, Xiuheng Wang, Yejun Zhang, Sergiy A. Vorobyov, Esa Ollila, Zhi-Yong Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过固定组件的规范多路分解(CP)张量适配器实现更细的参数步长,研究其对低秩适配器精度-预算权衡的影响,发现CP适配器能填补LoRA秩之间的空白,但效果依赖于任务。

Comments Accepted at the ICML 2026 Workshop on CoLoRAI

详情
AI中文摘要

低秩适配器通常通过扫描少量秩进行比较,但秩也固定了参数预算的分辨率。对于一个$2048{\times}2048$的OPT注意力投影,增加LoRA的一个秩会存储$4096$个可训练标量,导致可行的低预算适配器大小之间存在较大间隙。本文探讨具有更细容量增量的张量化适配器是否会改变观察到的精度-预算权衡。我们通过固定组件的规范多路分解(CP)张量适配器来实例化这个问题。在$32{\times}64{\times}32{\times}64$的张量化下,一个归一化的CP组件每个投影存储$193$个可训练标量,比LoRA的一个秩步长小约21倍。我们在OPT-1.3B上,在匹配的目标模块、训练协议、数据上限和种子调度下,比较了CP适配器和LoRA在SST-2、RTE和BoolQ上的表现。CP训练稳定,并填补了LoRA秩之间的空白,但效果依赖于任务:SST-2早期达到低预算平台,BoolQ在略低于LoRA饱和之前受益于额外的CP组件,而RTE仍然偏好LoRA。因此,更细的参数步长有助于诊断PEFT预算敏感性,但它们本身并不能保证更好的精度-预算曲线。

英文摘要

Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. For a $2048{\times}2048$ OPT attention projection, increasing LoRA by one rank stores $4096$ trainable scalars, leaving large gaps between feasible low-budget adapter sizes. This paper asks whether a tensorized adapter with finer capacity increments changes the observed accuracy--budget trade-off. We instantiate this question with fixed-component canonical polyadic (CP) tensor adapters. Under a $32{\times}64{\times}32{\times}64$ tensorization, one normalized CP component stores $193$ trainable scalars per projection, about $21$ times smaller than one LoRA rank step. We compare CP adapters and LoRA on OPT-1.3B across SST-2, RTE, and BoolQ under matched target modules, training protocol, data caps, and seed schedules. CP trains stably and fills the gaps between LoRA ranks, but the effect is task-dependent: SST-2 reaches an early low-budget plateau, BoolQ benefits from additional CP components before saturating slightly below LoRA, and RTE remains LoRA-favored. Finer parameter steps are therefore useful for diagnosing PEFT budget sensitivity, but they do not by themselves guarantee a better accuracy--budget curve.

2606.00408 2026-06-02 cs.CL cs.AI cs.IR 版本更新

Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

掩盖过时观察帮助搜索智能体——直到适得其反:一个机制图谱及其机制

Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley

发表机构 * UC San Diego(加州大学圣地亚哥分校) UC Berkeley(加州大学伯克利分校) Texas A&M University(德克萨斯大学) UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文通过系统实验发现,在长时域搜索智能体中掩盖过时观察的效果呈现非对称倒U型曲线,取决于检索器召回率与模型隐式过滤能力的交互,并揭示了其背后的令牌-轮次权衡机制。

Comments 47 pages, 7 figures

详情
AI中文摘要

长时域搜索智能体在多次工具调用中积累大量检索内容,使得上下文预算效率日益重要。一种最小干预措施是在轨迹推进过程中掩盖过时观察,但尚不清楚这种上下文管理何时有帮助及其原因。我们通过系统扫描不同智能体骨干(4B到284B参数)和三个检索器,在离线和在线智能搜索基准上研究观察掩盖。我们发现,当以无上下文管理时的模型准确率为横轴时,掩盖带来的准确率提升呈非对称倒U形:在弱检索器下出现平台期,在强检索器与中等容量模型相遇时达到峰值,在模型饱和时急剧下降。这种模式反映了检索器召回率与模型隐式过滤能力之间的交互,而非单一因素。机制上,掩盖实现了令牌-轮次权衡:它移除了模型基本不再关注的观察以及智能体很少重新打开的页面。当增加的轮次将失败转化为成功时,它们有帮助;但当掩盖移除了模型本会使用的证据时,它们会失败。因此,我们将上下文管理重新定义为一种依赖机制的干预,并为分析智能深度搜索中的上下文使用提供了整体视角。我们在此发布我们的框架和轨迹(https://github.com/i-DeepSearch/observation-masking)以支持未来研究。

英文摘要

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.

2606.00376 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

确定性视界:当扩展推理失败时工具委托变得必要

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过注意力瓶颈定理和确定性视界概念,证明解码器-only注意力在确定性状态追踪任务中存在信息论容量限制,导致扩展推理性能退化,并指出当视界超过19-31时工具委托成为必要。

Comments Accepted at ICML 2026. 4 figures. 51 pages including appendices

详情
AI中文摘要

扩展的思维链推理可能会在确定性状态追踪任务上降低性能,这不是由于偏好偏差,而是源于解码器-only注意力的信息论容量限制。我们建立了:(1) 注意力瓶颈定理及互补的可达性构造,将状态追踪容量界定为 $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$;(2) 一个上下文相关的错误模型,导致超指数精度衰减;(3) 状态空间Jaccard度量,区分能力与偏好失败;(4) 确定性视界 $d^* \in [19, 31]$,超过该视界工具委托变得必要。在12个模型和8个任务领域(包括SWE-Bench、WebArena和SQL-Multi)中,工具集成推理始终优于神经思维链;在主要模型套件上,其准确率达到86-94%,而神经思维链仅为24-42%。在最优长度轨迹上进行微调仅带来<5%的提升,证实了架构上限,并且高跨模型相关性($r = 0.81$-$0.91$)表明这些失败是架构性的而非训练特定的。我们的结果为在代理系统中纯神经推理何时应让位于混合方法提供了原则性指导。

英文摘要

Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon $d^* \in [19, 31]$ beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields $<$5% improvement, confirming an architectural ceiling, and high cross-model correlation ($r = 0.81$-$0.91$) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.

2606.00334 2026-06-02 cs.CL cs.AI 版本更新

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

隔离LLM词汇偏差:一种无人工标注的三角化偏好学习阶段度量

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出一种无需人工标注的三角化偏好偏移分数(Triangulated Preference Shift score),通过对比人类标准、基础模型和指令变体,量化偏好学习阶段引入的词汇偏差。

Comments 7 pages, 2 figures, 1 table

详情
Journal ref
The International FLAIRS Conference Proceedings, 39(1) (2026)
AI中文摘要

近年来,各种语言领域发生了显著变化;这些变化很大程度上归因于大型语言模型的出现及其与自然语言使用的不对齐。这些不对齐部分源于偏好学习阶段,例如从人类反馈中强化学习,这通常使模型更有用,但同时也可能引入系统性词汇偏差。在词汇行为方面,这体现在模型对某些格式的偏好或过度使用某些词汇(如delve、furthermore),即使这些模式在基础模型输出中并不存在。关于偏好训练引起的词汇不对齐的研究受限于对人工标注的依赖。我们通过引入三角化偏好偏移分数来解决这一问题,该度量在人类黄金标准、基础模型和指令变体之间进行三角化,以隔离偏好学习引起的特定偏移,无需人工标注。我们提供了六个模型家族的数据,将结果锚定在文献中,并通过分析偏好学习是否将模型推向可解释为“威望语言”的方向,展示了该通用方法的实用性。该度量提供了一种初始的自动化方法来量化偏好调整引起的行为偏移,从而有助于指导模型对齐和可信AI的开发。

英文摘要

Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach's utility by analyzing whether preference learning shifts models toward what could be interpreted as a "language of prestige". The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.

2606.00333 2026-06-02 cs.CL 版本更新

Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs

聊天机器人假设了哪些制度框架?审计多语言大语言模型中的司法默认值

Zhizhi Wang, Harini Suresh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究多语言大语言模型在未指定司法管辖区时,是否将输入语言作为默认司法信号,通过审计7个模型在60个提示上的2520个响应,发现中文输入更常产生中国特定答案,英文输入更常产生美国特定答案,揭示了制度框架误选风险。

详情
AI中文摘要

大语言模型越来越多地回答有关税收、劳动保护、医疗保健、教育、养老金和行政程序的问题,其有用性通常取决于适用的司法管辖区。多语言用户可能使用他们最熟悉的语言书写,而不是与适用规则的国家或地区相关的语言。我们询问,当提示省略任何国家或地区时,部署的大语言模型是否将输入语言作为默认司法信号。先前的多语言审计表明,提示语言可以改变文化、政治或规范性输出;我们检查当司法管辖区未明确指定时,模型提供了哪种法律-行政框架。我们评估了在美国或中国开发的七个大语言模型,在英语和普通话的60个未明确指定司法管辖区的法律-行政提示下,在三种系统提示条件下,产生了2520个手动注释的响应。跨模型和条件,中文输入更常产生中国特定的答案,而英文输入更常产生美国特定、比较性或通用答案。需要单一答案的提示进一步增加了司法管辖区选择:跨模型汇总,74.5%的英文输入响应采用美国框架,而53.3%的中文输入响应采用中国框架。这种方向性模式出现在所有七个模型中。我们将这种部署层面的模式描述为制度框架误选风险:一个流畅的答案可能依赖于用户未意图的法律-行政背景,特别是当他们的首选语言与相关司法管辖区不同时。大语言模型界面不应仅通过输入语言来路由制度建议;当位置信息缺失时,应请求位置或说明答案的司法管辖区范围。

英文摘要

LLMs increasingly answer questions about taxes, labor protections, healthcare, education, pensions, and administrative procedures, where usefulness often depends on the applicable jurisdiction. Multilingual users may write in their most comfortable language rather than one associated with the country or region whose rules apply. We ask whether deployed LLMs use input language as a default jurisdictional signal when prompts omit any country or region. Prior multilingual audits show that prompt language can shift cultural, political, or normative outputs; we examine which legal-administrative framework models supply when jurisdiction is underspecified. We evaluate seven LLMs developed in the United States or China on 60 underspecified legal-administrative prompts in English and Mandarin Chinese under three system-prompt conditions, yielding 2,520 manually annotated responses. Across models and conditions, Chinese input more often produces China-specific answers, while English input more often produces U.S.-specific, comparative, or generic answers. Prompts requiring a single answer further increase jurisdiction selection: pooled across models, 74.5% of English-input responses adopt a U.S. framework, while 53.3% of Chinese-input responses adopt a China framework. This directional pattern appears in all seven models. We describe this deployment-level pattern as institutional-framework misselection risk: a fluent answer may rely on a legal-administrative context the user did not intend, especially when their preferred language differs from the relevant jurisdiction. LLM interfaces should not route institutional advice by input language alone; when location is absent, they should request it or state the jurisdictional scope of the answer.

2606.00305 2026-06-02 cs.CL cs.AI 版本更新

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

通过近未来指导桥接在线策略蒸馏中的推理轨迹

Yuxuan Jiang, Francis Ferraro

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)

AI总结 针对在线策略蒸馏中令牌级学习信号无法有效桥接推理轨迹的问题,提出基于近未来轨迹信息的轨迹感知在线策略蒸馏方法,显著提升大语言模型推理性能。

详情
AI中文摘要

在线策略蒸馏通过让学生在教师监督下从其自身策略采样的轨迹上进行训练,改进了大语言模型的推理能力。尽管OPD基于轨迹操作,但其学习信号仍然是令牌级的:它通过高损失令牌识别偏差,并通过局部反向KL校正进行修复。我们表明,这种“轨迹采样但令牌学习”的机制无法可靠地将学生轨迹桥接至教师轨迹。约30%的高损失令牌落入低发散区域,表明许多是表面形式不匹配而非真正的推理分叉。此外,即使是真正发散的令牌也难以通过孤立的令牌级监督修复,因为推理失败通常表现为短视的分布漂移。我们提出轨迹感知OPD,它利用近未来轨迹信息识别真正的发散状态,并将指导分布到多个未来令牌上。实验表明,抑制非发散的高损失令牌将标准OPD的平均准确率从47.8%提升至48.2%,而TOPD进一步将性能提升至52.2%,在AIME24上从60.0%提升至63.3%,在AIME25上从46.7%提升至53.3%。

英文摘要

On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.

2606.00294 2026-06-02 cs.CL 版本更新

Uncovering Temporal Framing in the News

揭示新闻中的时间框架

Tarek Mahmoud, Veronika Solopova, Premtim Sahitaj, Ariana Sahitaj, Max Upravitelev, Mervat Abassy, Hana Fatima Shaikh, Neda Foroutan, Vera Schmitt, Preslav Nakov

发表机构 * MBZUAI, UAE(阿联酋马布里扎人工智能研究所) Technische Universität Berlin, Germany(柏林技术大学,德国) German Research Center for Artificial Intelligence (DFKI), Germany(德国人工智能研究中心(DFKI),德国) University of Maryland, USA(美国马里兰大学)

AI总结 本研究提出八种时间框架分类法,通过专家注释多语言新闻语料库,分析时间框架的普遍性、共现模式和词汇线索,并利用监督微调和零样本分类评估时间框架检测,发现句子级时间框架可学习且监督模型显著优于零样本方法。

Comments ACL 2026 Main Conference Oral

详情
AI中文摘要

时间语言的作用不仅限于将事件置于时间线上。在新闻话语中,对过去、现在和未来的引用可以作为修辞手段,塑造解释和说服。在这里,我们研究时间框架,定义为使用与时间相关的语言进行说服,以构建意义而非报告时间顺序。我们基于先前关于时间性和框架的研究,提出了八种时间框架的分类法,并通过专家注释多语言新闻语料库来实现。由此产生的数据集包括458篇英文和德文新闻文章,从超过2万句的语料库中识别出超过2千个时间框架句和约3千个时间框架注释。我们分析了框架的普遍性、共现模式和词汇线索,并使用监督微调和零样本分类评估了时间框架检测。我们的实验表明,时间框架在句子级别是可学习的,监督模型显著优于零样本方法。我们公开发布该语料库以支持未来关于时间框架的研究:https://mbzuai-nlp.github.io/temporal-framing/。

英文摘要

Temporal language does more than place events on a timeline. In news discourse, references to the past, present, and future can function as rhetorical devices that shape interpretation and persuasion. Here, we study temporal framing, defined as the persuasive use of time-related language to structure meaning rather than to report chronology. We propose a taxonomy of eight temporal frames grounded in prior work on temporality and framing, and we realize it through expert annotation of a multilingual news corpus. The resulting dataset includes 458 English and German news articles, with over 2K temporally framed sentences and approximately 3K temporal framing annotations identified from a corpus of more than 20K sentences. We analyze frame prevalence, co-occurrence patterns, and lexical cues, and evaluate temporal framing detection using supervised fine-tuning and zero-shot classification. Our experiments show that temporal framing is learnable at the sentence level, with supervised models substantially outperforming zero-shot approaches. We publicly release the corpus to support future research on temporal framing: https://mbzuai-nlp.github.io/temporal-framing/.

2606.00285 2026-06-02 cs.CL 版本更新

Model-Based Quality Assessment for Massively Multilingual Parallel Data

基于模型的大规模多语言平行数据质量评估

Abdelaziz M. A. Ibrahim, Zihao Li, Jörg Tiedemann, Shaoxiong Ji

发表机构 * University of Jyväskylä(于韦斯屈莱大学) University of Helsinki(赫尔辛基大学) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学)

AI总结 针对大规模多语言平行数据中存在的非平行句对和低质量翻译问题,提出将模型评估分解为平行性评估(使用多语言嵌入)和无参考质量估计两个独立组件,并通过实验发现没有模型在所有翻译方向上普遍可靠,建议采用方向感知的路由和校准方法。

详情
AI中文摘要

大规模多语言平行文本通常包含两个不同的问题:非平行句对和低质量翻译。我们将此类数据的基于模型评估分解为两个独立组件:使用多语言嵌入的平行性评估和无参考质量估计(QE)。对于平行性,我们在FLORES-200和BOUQuET检索任务上对四个嵌入模型进行了基准测试,覆盖了目标语言对清单中的6,654个源-目标方向。对于QE,我们在专业FLORES-200翻译上评估了九个无参考评估器,涵盖41,412个有序源-目标方向。结果表明,没有模型在所有翻译方向上普遍可靠。简单的QE集成会稀释强模型信号,而有文档记录的目标语言覆盖范围与更高的QE分数密切相关。总体而言,这些发现表明,多语言平行数据评估最好被视为一个方向感知的路由和校准问题,其中没有单一的通用指标预计在所有语言上都足够。

英文摘要

Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.

2606.00284 2026-06-02 cs.CL 版本更新

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

参数对齐减轻多语言专家语言模型中的灾难性遗忘

Sanchit Ahuja, Terra Blevins

发表机构 * Northeastern University(东北大学)

AI总结 针对多语言持续预训练中的灾难性遗忘问题,提出五种层感知参数对齐策略(硬冻结、软正则化、事后权重恢复和模型合并),在32种训练语言和保留语言上评估,证明参数对齐能有效减少遗忘且语言获取成本低。

Comments 25 Pages, 5 Figures

详情
AI中文摘要

虽然持续预训练(CPT)是将大型语言模型扩展到新语言的一种实用方法,但在目标数据上的简单微调会通过灾难性侵蚀现有能力。围绕语系组织训练可以减少跨语言干扰,但无法单独防止下游任务所需通用知识的遗忘。我们将这种遗忘与多语言CPT中的参数漂移联系起来,并提出了一套五种层感知参数对齐策略:硬层冻结、软正则化、事后权重恢复和模型合并。我们在涵盖五个语系32种训练语言以及保留语言的基准上,沿四个评估轴(困惑度、阅读理解、物理推理和翻译)系统地将我们的对齐策略与两个无正则化的CPT基线进行比较。参数对齐在语言获取成本最小的情况下显著减少了遗忘:层冻结和正则化最好地保留了理解能力,而事后恢复则带来了最强的翻译提升。总之,这些结果描绘了族专家CPT的获取-遗忘边界,并为每种策略提供了与其最佳服务任务配对的实用部署指南。

英文摘要

While continual pretraining~(CPT) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

2606.00272 2026-06-02 cs.AI cs.CL cs.CY 版本更新

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

在周三,我们提问:优化自动化法律分诊与转介中的“主动倾听”

Quinten Steenhuis, Jacqueline Harvey

发表机构 * Suffolk University Law School(苏福克大学法学院)

AI总结 本文通过专家律师和LLM评估FETCH分类器的追问问题方法,发现低成本LLM在分类任务中表现良好但生成高质量问题需高成本模型,并提出法律分诊问题评估标准。

Comments Working paper submitted as accepted to AIDA2J workshop at International Conference for AI and Law in Singapore, June 2026

详情
AI中文摘要

FETCH分类器生成追问问题,以帮助优化申请人法律问题的最佳匹配,使用低成本LLM集成。在本文中,我们描述了专家律师和LLM辅助评估FETCH中的追问问题方法,并表明虽然低成本LLM在分类任务中表现良好,但在这种情况下生成高质量的通俗语言问题似乎需要更复杂和更高成本的模型。通过与法律接待工作人员的讨论,我们提出了法律接待分类问题的评估标准,并发现仅靠提示工程不足以提高接待目的的问题质量。我们还发现LLM作为评判者与人类评分存在分歧。我们证明,通过添加单个高成本模型GPT-5,分类器可以从寻求法律帮助的申请人那里引出相关信息,并且这些问题导致分类任务更准确的性能。我们还发现不同类别(包括家庭暴力)的事实引出不均匀,与家庭法筛查规程不一致,这表明在某些法律领域纳入专门筛查小组的价值。

英文摘要

The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.

2606.00250 2026-06-02 cs.CL cs.AI cs.HC 版本更新

Effects of Varying LLM Access on Essay Writing Behavior

不同LLM访问权限对论文写作行为的影响

Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 通过控制LLM访问权限(无访问、有限访问、无限访问)的随机实验,发现有限访问能保持学生作者信心和写作策略,而无限访问降低创意表达。

Comments BEA (Building Educational Applications) Workshop 2026

详情
AI中文摘要

调查大型语言模型(LLM)对大学教学和学习的影响程度,有助于确定整合LLM的策略,以支持而非削弱学生的学习成果。本研究考察了不同水平的LLM辅助如何影响写作表现、参与度和感知作者身份。我们报告了一项初步研究,其中24名大学生被随机分配,在无LLM访问、有限访问(最多3次提示,每次回复限制100字)或无限访问条件下撰写一篇短文。各组之间的整体论文质量在统计上无显著差异。然而,写作行为和感知作者身份出现显著分化:有限访问组的学生报告了更高的所有权(62.5%愿意将论文作为独立作品提交,而无限访问组为25%)、更强的组织收益以及更具策略性和以修改为中心的提示。无限访问组花费更多时间写作,产生的论文与LLM输出更相似,并报告了创意表达减少。我们的研究结果表明,限制而非禁止LLM访问,可以在保留AI辅助的支架优势的同时,保持作者的信心。

英文摘要

Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.

2606.00203 2026-06-02 cs.CL 版本更新

DeSQ: Decomposition-based SPARQL Query Generation

DeSQ:基于分解的SPARQL查询生成

Papa Abdou Karim Karou Diallo, Aditya Sharma, Neshat Elhami Fard, Amal Zouaq

发表机构 * LAMA-WeST Mila – Quebec AI Institute(魁北克AI研究所) Polytechnique Montréal(蒙特利尔理工学院)

AI总结 提出DeSQ框架,通过将复杂问题分解为原子约束并映射为SPARQL片段,再组装成完整查询,在五个基准测试中四个超越现有方法,并增强鲁棒性和可解释性。

详情
AI中文摘要

知识库问答(KBQA)的主流方法分为两类:一是生成形式化查询,但存在脆弱性和可解释性有限的问题;二是通过知识库探索直接检索答案,但计算成本高且容易产生幻觉。为了结合两种范式的优势并减轻各自的缺点,我们提出了DeSQ(基于分解的SPARQL查询生成),这是一个与知识库无关的框架,分为三个阶段。首先,它将复杂问题分解为原子约束(AC),这些约束反映了底层知识库的关系结构。其次,它生成一个两部分的结构化输出:(a)将每个AC映射到对应的SPARQL片段,使用标准化的变量和URI占位符,以及(b)描述每个占位符的URI接地块。第三,它将这些片段组装成一个完整的SPARQL查询。DeSQ在五个主要基准测试中的四个上超越了现有最先进的方法,并展现出对词汇变化的卓越鲁棒性。除了性能提升,我们的框架通过消除对实时知识库端点的需求,大大简化了评估,并且其结构化输出支持细粒度的错误分析,从而实现更有针对性的改进干预。

英文摘要

Dominant approaches to Knowledge Base Question Answering (KBQA) fall into two categories. First is the generation of a formal query that suffers from brittleness and limited explainability, and the second is direct answer retrieval through KB exploration that is computationally costly and prone to hallucination. To combine the strengths of both paradigms while mitigating their respective weaknesses, we introduce DeSQ (Decomposition-based SPARQL Query Generation), a KB-agnostic framework that operates in three stages. First, it decomposes complex questions into Atomic Constraints (ACs) that mirror the relational structure of the underlying KB. Second, it generates a two-part structured output: (a) Mapping of each AC to its corresponding SPARQL Fragment, using standardized variable and URIs placeholders, and (b) URIs Grounding block describing each placeholder. Third, it assembles these fragments into a complete SPARQL query. DeSQ surpasses state-of-the-art approaches on four out of five major benchmarks and demonstrates superior robustness to lexical variation. Beyond performance gains, our framework greatly simplifies evaluation by eliminating the need for a live KB endpoint, and its structured output enables fine-grained error analysis, allowing more targeted interventions for improvement.

2606.00198 2026-06-02 cs.LG cs.AI cs.CL 版本更新

BAGEN: Are LLM Agents Budget-Aware?

BAGEN:LLM 智能体是否具有预算意识?

Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

发表机构 * Northwestern University(西北大学) O2 Lab(O2实验室) Independent(独立) University of Michigan(密歇根大学) Cornell(康奈尔大学) All Hands AI Stanford(斯坦福大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出预算感知智能体(BAGEN)概念,将预算作为主动控制信号而非被动成本指标,通过渐进区间估计方法预测剩余预算上下界,并在四个环境和五个前沿模型上发现强模型不一定具有强预算意识、模型过度乐观等失败模式,早期停止可节省 28-64% 令牌,但精确区间校准仍具挑战。

详情
AI中文摘要

尽管智能体正在花费越来越多的资源,但如今智能体成本大多仅在执行后衡量。预算感知智能体(BAGEN)应将预算视为主动控制信号,而非被动成本指标。我们首先系统地将预算估计定义为内部预算(来自智能体计算)和外部预算(来自智能体动作)。然后,我们将预算意识形式化为渐进区间估计:在计划的每一步,智能体应预测剩余预算的上限和下限,并在完成可能性低时发出警报。通过 rollout-replay 协议进行评分,我们在四个环境和五个前沿模型上发现了一致的失败模式:(1)强模型不一定具有强预算意识,相关性 r=0.35。(2)前沿模型始终过度乐观,继续在不太可能成功的任务上花费资源,而不是尽早提醒用户。(3)预算感知信号是可操作且可训练的。早期停止在失败轨迹上节省 28-64% 的令牌,SFT+RL 增强了早期停止和警报行为。(4)精确区间校准仍然具有挑战性,SFT+RL 后区间覆盖率上限为 47%。项目页面:https://ragen-ai.github.io/bagen/

英文摘要

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen-ai.github.io/bagen/

2606.00168 2026-06-02 cs.CL 版本更新

RealityTest: How People Probe AI Identity and Whether Models Disclose It

RealityTest: 人们如何探查AI身份以及模型是否披露它

Anna Gausen, Sarenne Wallbridge, Bessie O'Dell, Christopher Summerfield, Hannah Rose Kirk

发表机构 * AI Security Institute(人工智能安全研究所)

AI总结 通过收集来自49个国家约750名参与者的3152个身份探查查询,构建首个大规模多模态多语言基准RealityTest,发现仅31%的人在模糊场景中直接询问身份,且问题多样性远超机器生成查询,测试17个文本和6个语音模型后发现,问题措辞和对话上下文比模型本身对披露行为影响更大。

Comments 9 pages, 4 figures

详情
AI中文摘要

AI系统越来越多地部署在对话场景中,用户可能不确定自己是在与人类还是AI交谈。尽管监管机构对这一已知安全风险日益关注,但现有的AI披露评估通常仅限英语、基于机器生成的问题,且局限于文本。我们提出RealityTest,以全面测试AI系统在被询问时是否披露其身份。该基准是首个大规模多模态和多语言评估,基于人类在现实世界中实际遇到和质疑AI身份的数据。伴随基准,我们发布了包含来自49个国家约750名参与者的3152个身份探查查询的基础数据集,涵盖文本和语音场景。我们发现,在模糊场景中,只有31%的人直接询问身份,而且人们提出的问题远比机器生成的查询多样化。我们测试了17个文本模型和6个语音模型,发现披露行为存在显著差异。然而,即使是在表现最好的模型中,单一的抑制指令也会将披露率降至30%以下。验证我们在多样化、基于人类的评估数据上的投入,我们发现问题的措辞和对话上下文对披露的影响比模型本身更大。基于狭窄或合成查询集的安全评估可能会错误地描述模型在实际部署场景中的行为。

英文摘要

AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of AI disclosure are typically English-only, based on machine-generated questions, and restricted to text. We present RealityTest to comprehensively test whether AI systems disclose their identity when asked. The benchmark is the first large-scale multimodal and multilingual evaluation, grounded in human data on how people actually encounter and question AI identity in the real-world. Alongside the benchmark, we release the underlying dataset of 3,152 identity-probing queries collected from ~750 participants across 49 countries and five languages, in text and speech scenarios. We find that only 31% of people ask about identity directly in ambiguous scenarios, and that the questions people ask are far more diverse than machine-generated queries. We test 17 text and 6 speech models, and find substantial variation in disclosure behaviour. However, a single suppression instruction reduces disclosure rates to below 30%, even in the best-performing models. Validating our investment in diverse, human-grounded evaluation data, we find that how the question is phrased and the context of the conversation matter more for disclosure than which model is being tested. Safety evaluations built on narrow or synthetic query sets risk mischaracterising how models behave in realistic deployment settings.

2606.00160 2026-06-02 cs.CR cs.AI cs.CL 版本更新

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

DataShield: 针对大语言模型良性指令微调的安全降级数据过滤

Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang, Jie Pan, Jinbiao Zhu

发表机构 * nwpu.edu.cn(西北工业大学)

AI总结 提出DataShield方法,通过量化每个样本对模型合规行为的贡献作为安全降级分数,高效识别良性数据集中的安全降级样本,并在多个模型和数据集上验证有效性。

详情
AI中文摘要

大型语言模型(LLM)即使在使用良性数据集进行微调时,也会出现安全能力下降的问题。然而,现有识别良性数据集中安全降级样本的方法存在计算成本高和噪声显著的问题。在本文中,我们提出DataShield,以高效且有效地识别潜在的安全降级样本。我们的关键直觉基于以下观察:良性微调提高了LLM的整体响应合规性。DataShield的核心技术见解是将每个样本对模型合规行为的贡献量化为其安全降级分数。DataShield包含三个核心组件:(1)合规向量提取,捕获LLM的合规行为倾向;(2)一种新颖的合规感知分数(CAS),自动识别最优的安全关键层;(3)安全降级样本过滤,量化训练数据沿合规方向的投影偏移。在Llama3-8B、Llama3.1-8B和Qwen2.5-7B上使用Alpaca和Dolly良性数据集进行的广泛实验评估验证了我们的方法在识别高风险和低风险数据子集方面的有效性。我们还观察到,开放式问答更可能触发安全降级,且相应的响应往往更长。我们希望这项工作能为以数据为中心的防御方法提供新的见解。源代码可在https://github.com/ZJunBo/DataShield获取。

英文摘要

Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods for identifying safety-degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety-degrading samples. Our key intuition is based on the observation that benign fine-tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance-Aware Score (CAS), which automatically identifies the optimal safety-critical layer; and (3) Safety-degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using the Alpaca and Dolly benign datasets validates our method's effectiveness in identifying high-risk and low-risk data subsets. We also observe that open-ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data-centric defense methods. The source code is available at: https://github.com/ZJunBo/DataShield.

2606.00136 2026-06-02 cs.LG cs.AI cs.CL cs.CR cs.SI 版本更新

Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey

生成式AI与数字生态系统韧性:基于生命周期的主动式综述

Jonghyun Chung, Rishabh Chaddha, Sanket Badhe, Debanshu Das, Nathan Huang, Amanpreet Kaur

发表机构 * Google LLC(谷歌有限公司)

AI总结 本文采用基于生命周期的C5交互模型,综合机器学习与社会科学方法,系统综述了针对生成式AI驱动的对抗性合成内容的主动检测技术,包括协调不真实行为分析、流行病学建模和霍克斯过程等,旨在构建更具韧性的信息生态系统。

Comments 14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)

详情
Journal ref
IEEE Access (2026) IEEE Access (2026)
AI中文摘要

生成式AI加速了对抗性合成内容的扩散,使得传统的被动检测方法失效。本综述综合了新兴研究,展示了向主动检测新兴不真实叙事的范式转变。我们采用统一的、基于生命周期的分类法,将对抗性活动的社会技术生命周期模型与新兴不真实叙事检测的高级计算方法相结合。通过围绕C5交互模型(背景、原因、内容、放大循环、后果)构建分析,我们整合了机器学习和社会科学的不同研究流。为了区分合成放大模式与真实基线流量,本文综述了建模新叙事创建、播种和传播的最先进技术,包括协调不真实行为分析、流行病学建模和霍克斯过程。本综述还系统回顾了C5交互模型不同阶段对抗性威胁的主动检测方法,特别是高维嵌入空间中的异常检测、多层图上的无监督协调检测以及代理型AI系统。最后,本综述探讨了生成式AI带来的挑战,包括追踪快速变化威胁和多级分布漂移的困难,并概述了未来研究议程,重点在于检测异常聚类和构建预期性及韧性系统。本综述为更韧性的信息生态系统提供了基于生命周期的主动检测新兴合成威胁方法的全面回顾。

英文摘要

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.

2606.00116 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

增强BiGRU与KAN模块在法律文档分类与摘要中的应用

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

发表机构 * Dept. of ECE North South University(电子工程系北南大学)

AI总结 提出一种基于KAN的BiGRU模型,用于低资源多语言法律文档的分类与摘要,通过KAN模块提升分类准确率至67.96%。

Comments This paper contains of 10 pages, 10 figures, 4 tables and version 2 after it review from ACL 2026

详情
AI中文摘要

本研究引入了一种基于KAN的BiGRU模型的新架构,用于低资源多语言环境下的法律文档分类与摘要任务。为了解决领域语言、不同语言使用、上下文长依赖和类别不平衡等问题,我们使用了由孟加拉国法律文档组成的数据集,这些文档来自Manupatra,包括孟加拉语、英语和音译孟加拉语。我们的分类任务采用BiGRU模型以及Kolmogorov-Arnold网络(KAN)模块,而摘要部分则利用基于注意力的GRU结合KAN模型头部。分类模型达到了67.96%的准确率和0.65的F1分数;摘要的ROUGE-1、ROUGE-2和ROUGE-L指标分别对应0.38、0.23和0.31的F1分数。消融研究表明,使用KAN将分类准确率从57.34%提升至67.96%。此外,我们将所提出的技术与多个基线进行了比较,包括经典机器学习算法和预训练语言模型。

英文摘要

This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in a low-resource multilingual setup. In order to tackle problems associated with domain language, the usage of different languages, long dependencies within context, and class imbalance, we employ the dataset composed of legal documents from Bangladesh and taken from Manupatra, which include Bengali, English, and transliterated Bengali languages. Our classification task involves BiGRU model, along with Kolmogorov-Arnold Network (KAN) module, while the summarization part utilizes attention-based GRU, combined with a KAN model head. Classification model yields 67.96% of accuracy and 0.65 F1 score; while ROUGE-1, ROUGE-2, and ROUGE-L measures for summarization yield 0.38, 0.23, and 0.31 F1 scores, correspondingly. Ablation study shows that the use of KAN increases classification accuracy from 57.34% to 67.96%. Moreover, our proposed technique is compared to several baselines, including classical ML algorithms and pretrained language models.

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟:面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Bosch Corporate Research(博世企业研究) King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出分层语义几何地图(HSGM),将3D几何信息转化为VLM可理解的结构化表示,结合VLM高层语义规划与经典路径规划,实现零样本视觉语言导航,在R2R-CE和RxR-CE基准上达到最先进性能。

详情
AI中文摘要

视觉语言导航(VLN)使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型(VLM)取得了进展,但仍存在关键的语义-几何鸿沟:VLM擅长语言和2D视觉理解,但在3D空间推理方面表现不佳,且无法捕捉动作与空间转换之间的因果动态,导致导航不可靠,尤其在零样本设置中。为弥合这一鸿沟,我们提出分层语义几何地图(HSGM),将3D几何信息转化为与VLM兼容的结构化表示,有效将其与物理世界连接。具体而言,HSGM表示为多通道俯视图,组织为三个层次:(1)几何层,记录可导航区域和障碍物;(2)语义层,表示物体及其关系;(3)决策层,支持高层任务推理和目标选择。导航过程中,VLM作为高层语义规划器,解释HSGM编码的空间布局以选择几何有效航点,而航点间的低层无碰撞运动由经典路径规划算法执行,从而将语义推理与动作执行完全解耦。此外,复杂指令被分解为子任务,以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,我们的零样本框架达到了最先进性能,甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

2606.00093 2026-06-02 cs.CL cs.HC physics.data-an 版本更新

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

LLM作为评判者的评估一致性指标:报告什么及为什么

Delip Rao, Chris Callison-Burch

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文通过调查24篇近期论文,指出在二元评判标准下多数一致性指标冗余,强调Cohen's κ提供额外信息,并给出报告清单。

Comments 12 pages

详情
AI中文摘要

将LLM评判者与人类标注进行验证通常需要报告多个一致性统计量:准确率、精确率、召回率、$F_1$、Cohen's $κ$以及一个或多个秩相关。对24篇近期LLM作为评判者论文的调查发现,指标选择与评判尺度、平局处理、无效输出和弃权处理纠缠在一起,且这些选择很少被说明。对于二元标准——基于量规评估中的常见情况,每个标准被评分为MET或UNMET——报告的大多数数字是冗余的:Pearson's $r$、Spearman's $ρ$、Kendall's $τ_b$、phi系数$ϕ$和Matthews相关系数在非退化二元数据上都归结为单个数字,因此报告其中几个只会产生佐证证据的错觉。Cohen's $κ$是唯一增加信息的一致性系数:它与$ϕ$共享分子但归一化方式不同,两者之间的差距衡量了评判者正标签率偏离人类正标签率的程度。然后我们追踪当评判者可能以CANNOT_ASSESS裁决弃权时发生的变化:处理弃权的三种常见方式不是可互换的预处理选择,而是回答不同的问题,并且它们打破了二元等价关系。对于使用Fleiss' $κ$或Krippendorff's $α$评分的多评判者集成,相同的等价关系会重新出现,但存在可忽略的有限样本修正。最后,我们给出一个报告清单,其中列出评判尺度、弃权和平局处理模式、覆盖率、混淆矩阵以及聚合级别,同时附上任何标量一致性系数。

英文摘要

Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $κ$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $ρ$, Kendall's $τ_b$, the phi coefficient $ϕ$, and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence. Cohen's $κ$ is the one agreement coefficient that adds information: it shares $ϕ$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences. The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss' $κ$ or Krippendorff's $α$. We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.

2606.00091 2026-06-02 cs.CL cs.AI 版本更新

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

DLLM-JEPA:用于掩码扩散语言模型的联合嵌入预测架构

Sangdae Nam

发表机构 * Sangdae Nam

AI总结 提出DLLM-JEPA,通过将联合嵌入预测架构与掩码扩散语言模型结合,消除显式多视图数据和双梯度前向传播需求,在多个任务上提升准确率并降低训练FLOPs。

Comments 17 pages, 4 figures, 13 tables. Accepted at SPIGM Workshop, ICML 2026

详情
AI中文摘要

联合嵌入预测架构(JEPAs)重塑了视觉中的自监督表示学习。最近的LLM-JEPA将JEPA移植到自回归语言模型,但继承了因果注意力机制的两个高昂代价:它需要显式的多视图数据(例如文本-代码对),并且每步需要两次携带梯度的前向传播。我们提出DLLM-JEPA,它将JEPA与掩码扩散语言模型配对,一次性消除这两个代价。扩散模型的双向注意力通过不同的掩码率从同一输入产生两个语义不同的视图——无需显式配对——并支持单次携带梯度的前向传播,相对于LLM-JEPA减少33%的训练FLOPs。DLLM-JEPA在我们评估的每个(任务,架构)组合上优于仅扩散微调:在LLaDA-8B GSM8K上最高提升+18.7个百分点,在Dream-7B GSM8K上提升+11.4个百分点,在Spider、NL-RX-SYNTH和Django上持续获得正向增益。除了准确率,DLLM-JEPA还表现出双赢特性:在LLaDA-8B上使用Wide-t配置时,它同时提高了GSM8K准确率(67.1 vs. 65.2,+1.8个百分点),将保留的Wikitext损失降至预训练基础之下,并在三个微调种子下将MMLU准确率保持在基础水平——而L2到基础参数锚点匹配基线准确率但没有任务增益。逐层探测揭示了机制:一种几何-功能漂移分离,其中微调后的骨干比基线更远离预训练权重,但在保留的Wikitext上遗忘更少,且放大集中在中间Transformer层。该模式也出现在Dream-7B上,表明该现象并非特定于单个骨干网络。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.

2606.00084 2026-06-02 cs.IR cs.AI cs.CL cs.LG 版本更新

SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector

SentimentLens: 通过双模态调和酒店业中的情感与评分

Dineth Jayakody, Pasindu Thenahandi, Sampath Jayarathna

发表机构 * University of Peradeniya(珀拉尼亚大学)

AI总结 提出SentimentLens系统,基于方面级情感分析从非结构化酒店评论中提取知识,并通过跨模态调和文本情感与数值评分来识别运营冲突和服务改进机会。

详情
AI中文摘要

在线旅游平台生成大量用户生成的酒店评论,为大规模理解旅行者体验提供了丰富机会。然而,将非结构化文本反馈转化为结构化、可操作的见解仍然是一项具有挑战性的任务。本文提出了SentimentLens,一个基于方面级情感分析的可扩展分析系统,该系统从非结构化酒店评论中执行知识提取,并将其组织成可解释的服务类别。SentimentLens集成了方面术语提取、方面情感分类、语义类别分配和多层次分析模块,以支持区域级、酒店级和类别级评估。该系统设计为在不同地理环境和酒店环境中运行。为了展示其实用性,我们将SentimentLens应用于一个包含超过10,000条公开酒店评论的大型真实数据集。通过广泛分析,该框架揭示了旅行者情感如何随区域、服务类别和酒店类型而变化。我们进一步实现了文本情感与数值评分的跨模态调和,以识别潜在运营冲突、服务质量的结构性不一致性,并使用重要性-绩效和基于熵的分析确定高影响力的改进机会。结果表明,SentimentLens有效地将大规模非结构化评论转化为可操作的情报,支持酒店管理和旅游政策的数据驱动决策。虽然通过一个国家案例研究进行了演示,但所提出的系统可推广到其他目的地和评论驱动的服务领域。

英文摘要

Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experiences at scale. However, transforming unstructured textual feedback into structured, actionable insights remains a challenging task. This paper presents SentimentLens, a scalable analysis system based on Aspect-Based Sentiment Analysis that performs knowledge extraction from unstructured hotel reviews and organizes them into interpretable service categories. SentimentLens integrates aspect term extraction, aspect sentiment classification, semantic category assignment, and multi-level analytical modules to support region-level, hotel-level, and category-level evaluation. The system is designed to operate across different geographic contexts and hospitality settings. To demonstrate its practical utility, we apply SentimentLens to a large real-world dataset of over 10,000 publicly available hotel reviews. Through extensive analysis, the framework reveals how traveler sentiment varies across regions, service categories, and hotel archetypes. We further implement a cross-modal reconciliation of textual sentiment and numerical ratings to identify latent operational conflicts, structural inconsistencies in service quality, and high-impact improvement opportunities using importance--performance and entropy-based analyses. The results show that SentimentLens effectively transforms large-scale unstructured reviews into actionable intelligence, supporting data-driven decision-making for hospitality management and tourism policy. While demonstrated using a national case study, the proposed system is generalizable to other destinations and review-driven service domains.

2606.00065 2026-06-02 cs.IR cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

超越文本与表格:ComProScanner中视觉-语言模型集成实现从科学图表中高精度提取材料数据

Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

发表机构 * Energy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK(能源、材料与环境研究中心,伦敦南银行大学) School of Engineering and Design, London South Bank University, London SE1 0AA, UK(工程与设计学院,伦敦南银行大学) Bioscience and Bioengineering Research Centre, London South Bank University, London SE1 0AA, UK(生物科学与生物工程研究中心,伦敦南银行大学) Department of Physics, Kings College London, London WC2R 2LS, UK(物理系,伦敦国王学院)

AI总结 本文通过集成视觉-语言模型扩展ComProScanner框架,实现了从科学图表中自动提取成分-性能数据,在压电陶瓷数据集上达到0.97的组成准确率和归一化F1分数,并引入基于范围的误差阈值评估方法。

Comments 18 pages, 3 figures

详情
AI中文摘要

基于大语言模型流水线的自动提取科学文献中材料成分-性能数据的方法已取得显著进展;然而,现有框架仍局限于文本和表格内容,忽视了仅在科学图表中报告的大量定量性能数据。本文扩展了ComProScanner——一个用于自动构建成分-性能数据库的完全端到端多智能体框架,为其增加了基于原生视觉-语言模型(VLM)的图表提取能力。该扩展引入了一个FigureExtractor工具,用于基于标题关键词对所有支持的出版商进行图表过滤,以及一个GraphExtractorTool智能体,它将提取的图表传递给可配置的VLM,以从科学图表和绘图中恢复成分-性能对。基于LMArena Diagram排行榜和每百万token输入成本低于1.50美元的标准,选择了四个VLM进行评估。在来自已建立的d33测试语料库的50篇压电陶瓷文章上的基准测试表明,Gemini-3-Flash-Preview实现了最高性能,组成准确率为0.97,归一化F1分数为0.97,同时仍然是四个评估模型中成本效益最高的。此外,我们在评估框架中引入了一个基于范围的值误差阈值参数,与精确值匹配相比,提供了对从图表中提取的数值性能数据更具物理意义的评估。这些贡献使集成VLM的ComProScanner成为第一个针对材料科学、完全自动化、多模态的文献挖掘平台,能够在单一统一流水线中从文本、表格和图表中提取结构化的成分-性能数据。

英文摘要

Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of large language model-based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end-to-end multi-agent framework for automated composition-property database construction, with a native vision-language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption-keyword-based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition-property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than \$1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established $d_{33}$ test corpus demonstrates that Gemini-3-Flash-Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost-effective model among the four evaluated. We additionally introduce a range-based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM-integrated ComProScanner as the first materials-specific, fully automated, multimodal literature mining platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.

2606.00062 2026-06-02 cs.CL 版本更新

Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study

跨实体金融情感分析的图增强检索:一项比较研究

Rajan Bastakoti, Sagar Bhetwal, Nirajan Acharya, Gaurav Kumar Gupta

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文提出一种两跳图增强检索架构(Graph-RAG),通过构建情感加权知识图谱并融合密度检索与图遍历,相比标准向量检索在跨实体金融情感分析中显著提升实体召回率和复杂查询答案相关性。

详情
AI中文摘要

检索增强生成(RAG)已成为将大语言模型锚定到特定领域语料库的基础方法,然而传统的基于向量的RAG系统在捕捉支撑金融市场分析的结构化多实体关系方面存在根本性局限。本文对一种新颖的两跳图增强检索架构(Graph-RAG)与标准纯向量基线在跨实体金融情感分析中进行了全面比较研究。我们的系统从覆盖10只主要科技股的255篇新闻文章中构建了一个包含59个股票实体的情感加权知识图谱,然后通过强度过滤的图遍历(沿INFLUENCES边)增强密集检索,以揭示纯向量搜索无法获取的关系证据。我们使用语义相似度、实体召回率、RAGAS指标、延迟基准和消融研究,在100个有依据的查询(30个直接查询,70个关系查询)上评估了两种架构。Graph-RAG在实体召回率上实现了统计显著的提升(+6.4%,p < 0.001,Wilcoxon符号秩检验),并为复杂的多实体查询提供了显著更相关的答案(答案相关性+11.7%),增益集中在关系型问题类型(+16.1%)。关键的是,这些改进在答案质量上没有可测量的成本(语义相似度变化+0.001,Cohen's d = 0.078),平均延迟适度增加22.6%,但延迟方差降低了80%。对图遍历强度阈值的消融研究揭示了其与答案质量的倒U型关系,确定tau = 0.5为最优值,而生产默认值为tau = 0.7。这些发现刻画了图增强检索固有的精度-覆盖率权衡,并为构建多实体金融分析RAG系统的从业者提供了可操作的架构指导。

英文摘要

Retrieval-Augmented Generation (RAG) has become foundational for grounding large language models in domain-specific corpora, yet conventional vector-based RAG systems are fundamentally limited in their ability to capture the structured, multi-entity relationships that underpin financial market analysis. This paper presents a comprehensive comparative study of a novel two-hop Graph-RAG architecture versus a standard vector-only baseline for cross-entity financial sentiment analysis. Our system constructs a sentiment-weighted knowledge graph of 59 equity entities from 255 news articles covering 10 major technology stocks, then augments dense retrieval with intensity-filtered graph traversal over INFLUENCES edges to surface relational evidence inaccessible to vector search alone. We evaluate both architectures on 100 grounded queries (30 Direct, 70 Relational) using semantic similarity, entity recall, RAGAS metrics, latency benchmarks, and ablation studies. Graph-RAG achieves a statistically significant improvement in entity recall (+6.4%, p < 0.001, Wilcoxon signed-rank) and delivers substantially more relevant answers for complex multi-entity queries (+11.7% Answer Relevancy), with gains concentrating in relational question types (+16.1%). Critically, these improvements come at no measurable cost to answer quality (delta = +0.001 semantic similarity, Cohen's d = 0.078), with a modest 22.6% increase in mean latency offset by an 80% reduction in latency variance. An ablation study on the graph traversal intensity threshold reveals an inverted-U relationship with answer quality, identifying tau = 0.5 as optimal over the production default of tau = 0.7. These findings characterize a precision-for-coverage trade-off inherent to graph-augmented retrieval and provide actionable architectural guidance for practitioners building RAG systems for multi-entity financial analysis.

2606.00050 2026-06-02 cs.AI cs.CL cs.DB cs.IR 版本更新

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

Grokers: 基于类型化知识图谱的自底向上归纳理解与写入时智能

Gregory Magarshak

发表机构 * Gregory Magarshak

AI总结 提出Grokers架构,通过自底向上的依赖子图归纳遍历构建持久结构化理解,将智能推至写入时,实现零额外LM成本的查询,并证明字节同一性、累积单调性和双遍历顺序三个形式性质。

Comments 6 pages; second in a series with the Magarshak Machine / SPACER paper and the Context paper

详情
AI中文摘要

我们提出Grokers,一种通过依赖子图的自底向上归纳遍历来构建类型化知识图谱的持久结构化理解的架构。与检索增强生成(RAG)不同,后者在每个查询时支付全部理解成本,Grokers将智能推至写入时:自主的Groker代理分析类型化流图中的节点,通过受控语言模型(LM)调用提取结构化属性,并通过依赖关系归纳组合这种理解,写入丰富的类型化属性,从而以零额外LM成本服务于所有未来查询。我们证明了三个形式性质:(1)字节同一性定理,确立了从事务性维护的反规范化索引组装出的上下文块在语义变化之间的LM轮次中字节相同,使得KV缓存命中率接近100%;(2)累积单调性定理,确立了在受控智慧库增长协议下,无需LM调用即可解决交互的比例随已完成交互数量非递减;(3)双遍历顺序定理,确立了自顶向下生成和自底向上理解分别是它们在依赖DAG上各自任务的唯一正确遍历顺序,且它们的组合闭合为一个完整的生成-理解循环。我们进一步提出了一种基于嵌入的语义搜索的确定性替代方案,采用同义词缓存协议,其LM回退率在有限词汇域中收敛至零。在开源Qbix/Safebox/Safebots栈中提供了参考实现。

英文摘要

We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension cost at every query, Grokers pushes intelligence to write time: autonomous Groker agents analyze nodes in a typed stream graph, extract structured attributes via governed language model (LM) calls, and inductively compose that understanding upward through dependency relations, writing enriched typed attributes that serve all future queries at zero additional LM cost. We prove three formal properties: (1) the Byte-Identity Theorem, establishing that context blocks assembled from a transactionally-maintained denormalization index are byte-identical across LM turns between semantic changes, enabling KV-cache hit rates approaching 100%; (2) the Accumulation Monotonicity Theorem, establishing that the fraction of interactions resolved without LM calls is non-decreasing in the number of completed interactions under a governed wisdom library growth protocol; and (3) the Dual-Traversal Ordering Theorem, establishing that top-down generation and bottom-up comprehension are the unique correct traversal orderings for their respective tasks over a dependency DAG, and that their composition closes into a complete generation-comprehension cycle. We further present a deterministic alternative to embedding-based semantic search, with a synonym caching protocol whose LM fallback rate converges to zero for finite-vocabulary domains. A reference implementation is provided in the open-source Qbix / Safebox / Safebots stack.

2606.00048 2026-06-02 cs.CY cs.CL 版本更新

The Invisible Coalition Partner: How LLMs Vote When Democracy Gets Concrete

无形的联盟伙伴:当民主变得具体时,LLM如何投票

Joel Barmettler

发表机构 * Independent Researcher(独立研究员) Zurich Switzerland(苏黎世瑞士)

AI总结 通过对比抽象问卷和瑞士实际公投,发现LLM在具体政策决策中表现为中间派、偏向现状且跨语言不一致,而非先前认为的左倾偏见。

Comments 13 pages, 10 figures. Preprint. Code and data: https://github.com/joelbarmettlerUZH/invisible-coalition-partner

详情
AI中文摘要

先前的研究已确定,经过指令调整的大型语言模型表现出左倾政治偏见,这些偏见仅通过抽象政治问卷测量。我们表明这一发现并不适用于具体的政策决策。我们引入了一种基于瑞士民主现实的双工具方法论。Smartvote问卷(75个抽象政策问题)被应用于来自27个模型家族的66个LLM,并与184名当选的瑞士国民院议员进行比较,复制了已确立的左倾趋同(Cohen's d = 3.64, p = 0.0002)。然后,作为本工作的创新,9个旗舰LLM在三种信息条件下面对48次真实的联邦公投(Volksabstimmungen),使用四种国家语言(德语、法语、意大利语、罗曼什语),将投票与实际结果和政党推荐(Parolen)进行比较。三个发现挑战了主流叙述。(1)抽象问卷不能预测具体行为:在Smartvote上的左右一致梯度从左侧峰值转变为在Volksabstimmungen上的中心峰值,模型最接近中间派的Die Mitte和FDP,而非左派的SP和Gruene(Wilcoxon p = 0.008)。(2)对于某些模型,政治问题的语言比政治内容更能改变答案:跨语言一致性范围从50%(Mistral)到98%(GPT-5.4)。(3)两个模型表现出系统的变革厌恶而非政治偏见,无论方向如何,在83-94%的公投中投反对票(二项式p < 0.0001)。先前工作测量的“左倾偏见”可能无法推广到抽象工具之外。在具体政策决策上,LLM的行为更像是谨慎的公务员而非左派的联盟伙伴:中间派、偏好现状且跨语言不一致。

英文摘要

Prior research has established that instruction-tuned large language models exhibit left-of-center political bias, measured exclusively through abstract political questionnaires. We show that this finding does not generalize to concrete policy decisions. We introduce a dual-instrument methodology grounded in Swiss democratic reality. The Smartvote questionnaire (75 abstract policy questions) is administered to 66 LLMs from 27 model families and compared to 184 elected members of the Swiss National Council, replicating the established leftward convergence (Cohen's d = 3.64, p = 0.0002). Then, novel to this work, 9 flagship LLMs are confronted with 48 real federal referenda (Volksabstimmungen) in four national languages (German, French, Italian, Romansh) under three information conditions, comparing votes to actual outcomes and party recommendations (Parolen). Three findings challenge the prevailing narrative. (1) Abstract questionnaires do not predict concrete behavior: the left-to-right agreement gradient on Smartvote shifts from left-peaked to center-peaked on Volksabstimmungen, where models align most with centrist Die Mitte and FDP rather than leftist SP and Gruene (Wilcoxon p = 0.008). (2) For some models, the language of a political question changes the answer more than the political content does: cross-linguistic consistency ranges from 50% (Mistral) to 98% (GPT-5.4). (3) Two models exhibit systematic change-aversion rather than political bias, voting Nein on 83-94% of referenda regardless of direction (binomial p < 0.0001). What prior work measured as "leftward bias" may not generalize beyond abstract instruments. On concrete policy decisions, LLMs behave less like coalition partners of the left and more like cautious civil servants: centrist, status-quo-favoring, and inconsistent across languages.

2606.00031 2026-06-02 cs.CL cs.AI 版本更新

LLMs for Cardiovascular Risk Prediction from Structured Clinical Data

基于结构化临床数据的LLMs心血管风险预测

Jeba Maliha, Md Rafiul Kabir

发表机构 * Central Michigan University(中央密歇根大学)

AI总结 提出混合框架将结构化临床数据转换为自然语言表示,利用LLMs进行冠心病预测,并验证了高保真度与隐私保护优势。

Comments International Conference on Intelligent Systems, Blockchain, and Communication Technologies

详情
AI中文摘要

冠状动脉疾病(CAD)仍然是全球主要死因之一,凸显了对可靠预测系统的需求以支持早期诊断和风险评估。虽然传统机器学习模型在结构化临床数据上表现良好,但大型语言模型(LLMs)为解释自然语言表达的医疗信息提供了新的可能性。在这项工作中,我们开发了一个混合框架,桥接了结构化临床数据和自然语言表示用于CAD预测。使用包含1190名患者记录和11个临床属性的公开数据集,结构化变量被转换为可解释的特征表示,并通过LLMs生成合成临床叙述。验证流程进行临床变量的反向提取,并计算与原始记录的一致性分数,平均保真度达到94.61%。然后,我们评估了四种传统机器学习模型,并在零样本和少样本提示设置下与基于LLM的分类进行比较。我们使用了两个LLM:GPT和Gemini。实验结果表明,随机森林达到了最高准确率。尽管有这一优势,基于LLM的分类在真实临床环境中仍然有益。这是因为LLMs直接操作于自然语言的患者描述,意味着敏感的数值患者数据(如精确的实验室值、血压读数和诊断代码)得以保密。研究结果表明,将结构化临床数据与LLM生成的叙述相结合,可以为混合临床预测系统开辟新方向。

英文摘要

Coronary artery disease (CAD) remains one of the leading causes of death globally, highlighting the need for reliable predictive systems to support early diagnosis and risk assessment. While traditional machine learning models perform well on structured clinical data, large language models (LLMs) present new possibilities to interpret medical information expressed in natural language. In this work, we develop a hybrid framework that bridges structured clinical data and natural-language representations for CAD prediction. Using a publicly available dataset of 1,190 patient records with 11 clinical attributes, structured variables are converted into interpretable feature representations and synthetic clinical narratives using LLMs. A validation pipeline performs reverse extraction of clinical variables and computes a consistency score with the original records, achieving an average fidelity of 94.61%. We then evaluate four conventional machine learning models and compare their performance with LLM-based classification under zero-shot and few-shot prompting settings. We use two LLMs here, GPT and Gemini. Experimental results show that Random Forest achieves the highest accuracy. Despite this advantage, LLM-based classification remains beneficial in real-world clinical settings. This is because LLMs operate directly on natural language patient descriptions, meaning that sensitive numerical patient data such as exact lab values, blood pressure readings, and diagnostic codes are kept private. Findings suggest that combining structured clinical data with LLM-generated narratives can enable new directions for hybrid clinical prediction systems.

2606.00029 2026-06-02 cs.CL cs.AI 版本更新

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

TCAR-Gen: 基于证据融合的时间图检索用于知识增强生成

Sidra Nasir, Muhammad Noman Zahid, Rizwan Ahmed Khan

发表机构 * Dipartimento di Informatica, Università di Verona(威尼斯大学计算机科学系) School of Advanced Studies, University of Camerino(坎皮诺大学高级研究学院) Department of Computer Science, School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan(卡拉奇工商管理学院(IBA)数学与计算机科学学院计算机科学系)

AI总结 提出TCAR-Gen框架,结合查询条件图神经网络、时间证据融合和树链推理,在历史犯罪叙事问答中实现时间推理和多源证据融合,优于现有RAG方法。

详情
AI中文摘要

检索增强生成系统在回答历史犯罪案件叙述的复杂问题时,在时间推理和证据融合方面存在困难。现有方法要么独立于查询语义进行检索,要么无法连贯地整合多个证据来源。我们提出时间上下文增强检索生成(TCAR-Gen),一个结合查询条件图神经网络、时间证据融合和树链推理的框架,以将答案生成基于检索到的证据。在维多利亚犯罪日记基准上,TCAR-Gen在Recall@5上达到0.3738,在包括多跳推理和反事实问题在内的七种查询类型上优于Vanilla RAG、Temporal RAG、GraphRAG-C和GraphRAG-T。消融研究表明,上下文图、时间惩罚机制和查询条件是关键组件。跨五个语言模型(GPT-OSS 20B到TinyLlama 1.1B)的评估表明,TCAR-Gen在较小模型规模下保持稳健的检索覆盖,但生成质量随模型容量减少而显著下降。我们的工作表明,显式时间建模和多分支证据融合对于基于知识语料库的忠实、推理密集型问答至关重要。

英文摘要

Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.

2606.00027 2026-06-02 cs.CL cs.AI 版本更新

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

医疗大语言模型安全性、鲁棒性和公平性评估的多领域红队框架

Andrei Marian Feier, Veysel Kocaman, Yigit Gul, Ahmet Korkmaz, Alexander Thomas, Aleksei Zakharov, Jay Gil, Mehmet Butgul, David Talby

发表机构 * John Snow Labs Inc.(约翰·索克斯实验室公司)

AI总结 提出一个多领域红队框架,通过690个临床场景评估11个当代大语言模型,发现平均准确率掩盖了临床意义上的风险,性能方差和最坏情况失败比平均准确率更能反映可靠性,混合评估方法对可信安全评估至关重要。

Comments 10 pages, 4 figures. To be presented at the Text2Story 2026 Workshop (Delft, The Netherlands, 29 March 2026); CEUR Workshop Proceedings (forthcoming). Affiliation: John Snow Labs Inc

详情
AI中文摘要

大语言模型(LLM)在医疗领域的部署日益增多,但现有基准未能捕捉临床实践中常见的对抗性或伦理复杂条件下的模型行为。我们开发了一个多领域红队框架,评估了11个当代LLM在690个临床场景中的表现,这些场景涵盖9个领域和150多个子类别。场景包含对抗性变换,响应使用七维度评分标准进行评估,包括LLM辅助评分和人在环验证。结果揭示了显著的性能差异,平均得分范围从0.791到0.984。关键的是,几个高性能系统在个别安全关键场景中完全失败,表明平均准确率掩盖了临床意义上的风险。最高性能系统(X-BAI、GPT-5、Claude Opus 4.1)得分超过0.97且方差较低,而不同领域间性能差异显著。公平性相关任务在人口统计修改后错误率放大10-20%,人工评审员识别出自动评估遗漏的临床相关失败。我们的发现表明,性能方差和最坏情况失败比平均准确率更能提供临床意义的可靠性指标,而结合自动化与临床监督的混合评估方法对于可信安全评估至关重要。

英文摘要

Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories. Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0.791 to 0.984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.1) achieved scores above 0.97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation. Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.

2606.00026 2026-06-02 cs.CL 版本更新

Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation

在线社区中抑郁的认知语言指标:基于DistilBERT和全息简化表示的分析

Brian Van Steen

发表机构 * School of Computing, University of Leeds(利兹大学计算机学院)

AI总结 本研究结合认知语言学特征与DistilBERT嵌入,通过混合模型(DistilBERT+HRR)在Reddit帖子中检测抑郁,宏F1达0.94,优于基线TF-IDF的0.80。

详情
AI中文摘要

本文研究将认知基础的语言特征与基于transformer的嵌入相结合是否能改善在线文本中抑郁的自动检测。基于Beck的抑郁认知理论,研究提取认知扭曲作为可测量特征,包括第一人称代词密度、绝对化词汇以及Reddit中抑郁相关和对照社区帖子中的负面情绪。使用Kaggle Reddit自杀和抑郁检测数据集的一个子集,比较了两个分类流程:作为基线的TF-IDF嵌入与朴素贝叶斯,以及一个混合模型,该模型将DistilBERT句子嵌入与编码认知语言特征的全息简化表示(HRR)向量拼接,然后进行逻辑回归。混合DistilBERT HRR模型的宏F1得分为0.94,而TF-IDF基线为0.80,5折交叉验证F1从0.83提高到0.92,AUC从0.958提高到0.981。

英文摘要

This paper investigates whether combining cognitively grounded linguistic features with transformer-based embeddings improves automated detection of depression in online text. Using Beck's Cognitive Theory of Depression, the study extracts cognitive distortions as measurable features, including first-person pronoun density, absolutist words, and negative emotion in Reddit posts from depression-related and control communities. Using a subset of the Kaggle Reddit Suicide and Depression Detection dataset, two classification pipelines are compared, a TF-IDF embedding with Naive Bayes as a baseline, and a hybrid model that concatenates DistilBERT sentence embeddings with Holographic Reduced Representation (HRR) vectors encoding the cognitive-linguistic features, followed by Logistic Regression. The hybrid DistilBERT HRR model achieves a macro F1 score of 0.94 versus 0.80 for the TD-IDF baseline, with 5-fold cross validation F1 improving from 0.83 to 0.92, and AUC from 0.958 to 0.981.

2606.00023 2026-06-02 cs.CL cs.AI cs.LG 版本更新

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

TrustLDM:语言扩散模型的可信度基准测试

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

发表机构 * State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(中国科学院自动化研究所,智能科学与技术学院,北京大学) CISPA Helmholtz Center for Information Security(信息安全研究所) School of EECS, Peking University(电子工程学院,北京大学) Institute for Artificial Intelligence, Peking University(人工智能研究所,北京大学)

AI总结 针对语言扩散模型(LDM)的可信度问题,提出TrustLDM基准,评估其在不同架构和恶意上下文下的安全性、隐私性和公平性,并开发自动评估框架TrustLDM-Auto以识别脆弱配置。

详情
AI中文摘要

语言扩散模型(LDM)的快速发展挑战了自回归模型在语言处理中的主导地位。然而,其灵活、任意顺序的解码策略不仅实现了快速解码速度,还可能带来新的可信度挑战。为了更好地理解其流程背后的风险,我们引入了一个针对LDM的全面可信度基准(TrustLDM),评估不同LDM架构在多种静态后上下文类别下的安全性、隐私性和公平性。我们的实证结果表明,尽管LDM在仅使用用户提示时通常表现出较强的可信度,但当恶意后上下文附加到掩码响应时,其对齐行为明显下降。我们进一步观察到,较长的上下文不一定产生更强的影响,解码顺序和生成长度都会影响评估结果。最后,我们提出了TrustLDM-Auto,一个利用LDM解码灵活性自动识别脆弱配置的评估框架,揭示了所有评估模型和维度上的显著可信度弱点。我们的工作可能有助于社区构建更可信的LDM。我们的代码可在https://github.com/PKU-ML/TrustLDM获取。

英文摘要

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.

2606.00022 2026-06-02 cs.CL cs.AI 版本更新

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

lmfaoooo at SemEval-2026 Task 1: 幽默是受众。面向约束幽默生成的偏好建模

Alexey Tikhonov, Alexey Ivanov

发表机构 * Inworld.AI Berlin, Germany(Inworld.AI柏林,德国) OpenAI Mountain View, CA(山景城,加利福尼亚州)

AI总结 针对约束幽默生成任务,提出“生成多候选-偏好选择”策略,利用人类成对比较训练偏好模型,在MWAHAHA任务英、中、西语子任务中分别获得第1、第1和第2名。

Comments 5 pages. Accepted for SEMEVAL 2026

详情
AI中文摘要

幽默生成仍然困难,不仅因为生成流畅、新颖的笑话很难,而且因为“有趣”取决于受众,监督信号嘈杂——偏好随受众、语境和文化而变化,标注者一致性通常较低。在本文中,我们描述了用于SemEval-2026 Task-1(MWAHAHA)的系统,该任务专注于在显式约束下进行幽默生成。任务通过1对1竞技场式比较中的人类偏好判断来评估提交的系统。我们采用“生成多个->选择最佳”策略。首先,我们通过多步提示、模型集成和多样性导向解码为每个实例生成多样化的候选池。其次,我们使用偏好模型选择输出,该模型通过从人类比较中学习(而非绝对趣味性分数)来近似“读者”。为支持该方法,我们发布了通过幽默竞技场原型收集的2.5K人类成对判断。我们进一步提出一个可解释的流程,将标注的比较转换为偏好模型。在三个偏好数据集上,我们的模型一致优于基线,并表现出更强的跨领域迁移。最后,我们将学到的偏好模型应用于MWAHAHA设置中的候选排序,并发布中间产物(候选池和排序)以促进后续工作。我们的系统在MWAHAHA的英语和汉语子任务中排名第一,在西班牙语子任务中排名第二。

英文摘要

Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a "generate-many -> select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a "reader" by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.

2606.00021 2026-06-02 cs.CL cs.AI cs.LG 版本更新

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE: 基于软门控评估的语义嵌入导航用于检索式推测解码

Shaowen Chen, Zhicheng Liao, Hongwei Wang

发表机构 * Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 提出SENSE方法,通过语义嵌入导航和软门控评估模块替代表面形式匹配,提升检索式推测解码的鲁棒性和加速效果,在LLaMA和Qwen系列上实现最高4.09平均接受长度和3.26倍加速。

详情
AI中文摘要

推测解码(SD)通过使用轻量级草稿模型提出候选令牌,并由目标模型并行验证,从而加速大型语言模型(LLM)推理,同时不损害生成质量。尽管检索式推测解码(RSD)因其即插即用的多功能性而受到青睐,但其潜力受到刚性词汇依赖的阻碍,使得检索和验证对表面形式变化敏感。为了解决这个问题,我们提出了SENSE(基于软门控评估的语义嵌入导航)。通过将检索锚定在目标模型的隐藏状态上,SENSE建立了稳健的语义对齐,这使得软门控评估模块能够验证语义等价性而非表面形式。为了确保严格的基准测试,我们将现有方法解构为统一框架内的原子原语,促进细粒度的组件级比较。跨多个领域的广泛实验表明,SENSE在LLaMA和Qwen系列上优于多个基线,实现了高达4.09的平均接受长度和3.26倍的加速,同时保持了生成质量。我们的代码将在发表后发布。

英文摘要

Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality. While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations. To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms. To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison. Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.

2606.00020 2026-06-02 cs.CL cs.AI 版本更新

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP:基于效率感知奖励的强化学习链式推理中文文本纠错

Wei Tian, Yuhao Zhou, Man Lan

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(东华大学教育人工智能研究所)

AI总结 提出CSRP三阶段框架,通过连续预训练、链式推理监督微调和基于效率感知奖励的组相对策略优化,在NACGEC基准上实现最优性能,有效缓解过度纠正偏差。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main conference)

详情
AI中文摘要

基于大语言模型的中文语法纠错系统面临两个关键挑战:通用模型缺乏针对细微语法区别的专业语言先验,以及使用最大似然估计的监督微调无法优化以精度为中心的指标,导致系统性过度纠正。我们提出CSRP,一个三阶段框架,通过以下步骤逐步构建纠错能力:在590万平衡样本上进行连续预训练以内化领域知识,使用显式错误推理进行链式推理监督微调以实现诊断透明度,以及采用新颖的效率感知奖励进行组相对策略优化,明确惩罚不必要的编辑。在NACGEC基准上,CSRP以50.99的$F_{0.5}$和57.17的精确率实现了最先进性能,大幅优于先前最佳结果,同时有效缓解了MLE训练模型固有的过度纠正偏差。我们的方法还将CSCD拼写纠错提升至59.61的F1,超过GPT-4 5.20分。全面的消融研究表明,RL对齐阶段相比SFT基线贡献了8%的相对增益,且该增益与大规模CPT的贡献正交,验证了针对编辑效率的显式优化对于高质量语法纠错至关重要。我们的代码可在https://github.com/TW-NLP/ChineseErrorCorrector获取。

英文摘要

Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction. We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8\% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at https://github.com/TW-NLP/ChineseErrorCorrector.

2606.00017 2026-06-02 cs.AI cs.CL cs.MA 版本更新

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

MindGames Arena 泛化赛道:具有延迟每步奖励归因的 In2AI 解决方案

Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov

发表机构 * iMak AI Lab(iMak人工智能实验室)

AI总结 提出延迟每步奖励归因方法,结合资格门控、异步rollout生成和课程对手采样,实现多智能体环境中稳定高效的强化学习训练,并在NeurIPS 2025的MindGames Arena基准测试中取得领先。

Comments 18 pages, 2 figures, 9 tables. Technical report. First place in both Open and Efficient tracks of MindGames Arena Generalization Track at NeurIPS 2025

详情
AI中文摘要

训练用于多智能体战略交互的语言模型智能体面临一个核心困难:任何行动的质量可能取决于从未实现的未来事件、违反游戏规则的移动或其他玩家的决策。标准强化学习假设每一步都可以分配奖励,但在结果跨时间和智能体纠缠的环境中,这一假设不成立。我们引入了具有资格门控的延迟每步奖励归因,这是一个仅在回合结束时计算奖励、根据任务特定语义将其传播回原始步骤,并排除缺乏有效依赖信息的步骤的回合生命周期和后处理流程。结合通过vLLM的连续批处理实现的异步rollout生成、基于课程的对手采样和多层分层批次构建,该方法能够在多智能体环境中实现稳定、样本高效的强化学习训练。我们在NeurIPS 2025的MindGames Arena基准测试中进行了评估,使用我们的方法训练的单个80亿参数开源模型在头对头对战中匹配或超越了包括GPT-5在内的显著更大的专有系统,并在开放(无限制)和高效(<=80亿参数)赛道中均获得第一名。

英文摘要

Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.

2606.00016 2026-06-02 cs.CL cs.AI 版本更新

AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

AEyeDE:一种基于注意力的人工智能生成文本检测归因框架

Aria Nourbakhsh, Adelaide Danilov, Christoph Schommer, Salima Lamsiyah

发表机构 * University of Luxembourg(卢森堡大学) Department of Computer Science(计算机科学系)

AI总结 提出AEyeDE框架,利用代理Transformer模型的注意力归因矩阵,通过轻量级CNN区分人类与AI生成文本,在多种设置下优于文本基线,并揭示注意力图的局部结构差异。

Comments 24 pages, 2 figures

详情
AI中文摘要

随着现代语言模型达到接近人类水平的流畅度,并能规避依赖表面统计或似然信号的检测器,检测AI生成文本变得越来越具有挑战性。我们提出 extsc{AEyeDE},一种归因驱动的人机作者身份检测方法,利用模型注意力作为判别信号。具体来说,我们使用具有白盒访问权限的\emph{代理}Transformer模型提取人类和AI生成文本的基于注意力的归因矩阵,并训练轻量级卷积神经网络从这些归因图中学习表示。在编码器-解码器翻译设置中,我们的方法始终优于纯文本基线。在仅解码器设置中,它在生成器特定检测中表现强劲,在标准基准上保持竞争力,并在跨数据集迁移和替代拼写扰动下表现出鲁棒性。我们进一步表明,注意力图表现出重复的局部结构,其相对频率在不同数据集和代理模型下的人类与AI生成文本之间一致不同。这些发现表明,基于注意力的归因图为AI生成文本检测提供了互补且可解释的信号。我们将公开代码以支持未来研究。

英文摘要

Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. We propose \textsc{AEyeDE}, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a \emph{proxy} Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research.

2606.00014 2026-06-02 cs.CL cs.AI 版本更新

Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval

面向鲁棒的上下文学习:利用分布外代理进行目标不可访问的演示检索

Hao Xu, Rite Bo, Fausto Giunchiglia, Yingji Li, Rui Song

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Department of Information Engineering and Computer Science, University of Trento, Italy(特伦托大学信息工程与计算机科学系)

AI总结 提出DOPA框架,通过引入分布外代理近似不可访问的目标域并利用马氏距离全局多样性约束,提升大语言模型在分布偏移下的鲁棒性。

Comments Accepted by ACL 2026 main

详情
AI中文摘要

尽管研究表明大语言模型(LLMs)在分布外(OOD)任务上表现良好,但随着分布偏移加剧,其优势趋于减弱。因此,研究人员旨在从可用源域中检索分布相似且信息丰富的演示来增强LLMs的推理能力。然而,在目标域不可访问的实际场景中,评估未知分布具有挑战性,这间接影响所选演示的质量。为解决此问题,我们提出 extbf{DOPA},一种演示搜索框架,它引入OOD代理来近似不可访问的目标域并指导检索过程。基于代理评估,DOPA进一步引入基于马氏距离的全局多样性约束,确保检索到的演示具有足够的多样性。在多个LLMs和任务上的实验结果表明,DOPA有效增强了OOD设置下的鲁棒性 ootnote{https://github.com/bort64/ood\_code}。

英文摘要

Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out-of-Distribution (OOD) tasks, their advantage tends to diminish as the distribution shift becomes more severe. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations. To address this problem, we propose \textbf{DOPA}, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process. Building on proxy-based evaluation, DOPA further introduces a Mahalanobis distance-based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings\footnote{https://github.com/bort64/ood\_code}.

2605.04127 2026-06-02 cs.LG cs.CL cs.CY 版本更新

Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities

立场:煤矿中的随机鹦鹉。模型崩溃对低资源社区的威胁

Devon Jarvis, Richard Klein, Benjamin Rosman, Steven James, Stefano Sarao Mannelli

发表机构 * GitHub

AI总结 本文探讨模型崩溃(生成模型在先前模型输出上训练导致的性能下降)如何通过降低训练效率、扭曲数据分布,不成比例地影响低资源和边缘化社区,并呼吁采取行动。

Comments 14 pages, 1 figure, 1 table, International Conference on Machine Learning

详情
AI中文摘要

模型崩溃,即当生成模型在先前的模型输出上进行训练时出现的性能下降,随着人工生成内容的激增,日益受到关注。对大型语言模型的相关批评强调了它们倾向于复现训练数据中的频繁模式、依赖庞大的数据集以及巨大的环境成本。这些因素共同导致了数据退化、文化偏见的强化以及资源利用的低效。在这篇立场论文中,我们旨在结合这些观点,并论证模型崩溃威胁着当前使AI民主化的努力。通过降低训练效率并使数据分布偏离其支撑的尾部,模型崩溃不成比例地影响了低资源和边缘化社区。我们考察了这一现象的环境和文化影响,将我们的立场置于近期关于模型崩溃的立场论文中,并以行动呼吁作为结论。最后,我们概述了减轻这些影响的初步方向。

英文摘要

Model collapse, the degradation in performance that arises when generative models are trained on the outputs of prior models, is an increasing concern as artificially generated content proliferates. Related critiques of large language models have highlighted their tendency to reproduce frequent patterns in training data, their reliance on vast datasets, and their substantial environmental cost. Together, these factors contribute to data degradation, the reinforcement of cultural biases, and inefficient resource use. In this position paper we aim to combine these views and argue that model collapse threatens current efforts to democratize AI. By reducing training efficiency and skewing data distributions away from the tails of their support, model collapse disproportionately impacts low-resource and marginalized communities. We examine both the environmental and cultural implications of this phenomenon, situate our position within recent position papers on model collapse, and conclude with a call to action. Finally, we outline initial directions for mitigating these effects.

2605.03052 2026-06-02 cs.CL 版本更新

How Language Models Process Negation

语言模型如何处理否定

Zhejian Zhou, Tianyi Zhou, Robin Jia, Jonathan May

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究大型语言模型处理否定的内部机制,发现模型内部存在正确处理的组件,但后期注意力层导致错误,通过消融可提升准确率;模型同时采用抑制和构建两种机制,其中构建机制更显著。

Comments ICML 2026

详情
AI中文摘要

我们研究了大型语言模型(LLMs)在机制层面如何处理否定。首先,我们确定,尽管开放权重模型在涉及否定问题时经常给出错误答案,但它们确实拥有内部组件能够正确处理否定。其准确性低是由于后期注意力层的行为促进了简单的捷径;消融这些注意力模块极大地提高了否定相关问题的准确性。其次,我们揭示了模型处理否定的方式。我们考虑了两种假设:模型可能使用注意力头来关注被否定的短语并抑制相关概念,或者它们可能直接构建整个否定短语的表示(例如,将“不是气体”表示为促进液体和固体的向量)。我们应用一系列观察性和因果性可解释性技术于Mistral-7B和Llama-3.1-8B,以表明模型实现了这两种机制,其中“构建”机制更为突出。综合来看,我们的工作加深了对LLMs内部机制的理解,突出了构建主导的计算以及LLMs内竞争机制的共存。

英文摘要

We study how Large Language Models (LLMs) process negation mechanistically. First, we establish that even though open-weight models often provide wrong answers to questions involving negation, they do possess internal components that process negation correctly. Their poor accuracy is due to late-layer attention behavior that promotes simple shortcuts; ablating those attention modules greatly improves accuracy on negation-related questions. Second, we uncover how models process negation. We consider two hypotheses: models could use attention heads that attend to the phrase being negated and suppress related concepts, or they could directly construct a representation of the entire negative phrase (e.g., representing "not gas" as a vector that promotes liquids and solids). We apply a range of observational and causal interpretability techniques on Mistral-7B and Llama-3.1-8B to show that models implement both mechanisms, with the "constructive" mechanism being more prominent. Combined, our work deepens the understanding of LLMs' internals, highlighting construction-dominant computations and the coexistence of competing mechanisms within LLMs.

2605.31490 2026-06-02 cs.CL 版本更新

Are Full Rollouts Necessary for On-Policy Distillation?

在线策略蒸馏是否必须完整展开?

Yaocheng Zhang, Jiajun Chai, Yuqian Fu, Songjun Tu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Meituan(美团) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文针对在线策略蒸馏(OPD)中完整展开计算成本高且早期训练时教师反馈不可靠的问题,提出渐进式OPD(POPD)和截断式OPD(TOPD)两种展开长度控制策略,在数学推理任务上实现高达3倍训练效率提升,并显著减少内存占用。

Comments 15 pages, 14 figures

详情
AI中文摘要

在线策略蒸馏(OPD)在学生生成的展开轨迹上提供密集的教师反馈,而非固定的教师轨迹,已成为一种有前景的训练后范式。然而,标准OPD通常在训练期间生成完整展开,这计算成本高昂,并且可能使学生暴露于后期展开位置不可靠的教师反馈,尤其是在早期训练阶段。我们确定展开长度是OPD中严重影响训练效率的关键瓶颈。与具有可验证奖励的强化学习(RLVR)不同,OPD不需要最终答案奖励来提供学习信号。因此,完整展开对于OPD可能并非总是必要的。基于这一见解,我们提出了两种简单的展开长度控制策略:渐进式OPD(POPD),它在训练期间逐步扩展展开长度;以及截断式OPD(TOPD),它永久性地在可靠的截断展开上进行蒸馏。数学推理实验表明,POPD将OPD的训练效率提升高达3倍,而TOPD仅使用10%的展开长度即可达到与OPD相当的性能,从而显著减少了挂钟时间和内存消耗。这些结果表明,控制展开长度为更高效的OPD提供了一条简单实用的途径。

英文摘要

On-policy distillation (OPD) provides dense teacher feedback along student-generated rollouts rather than fixed teacher traces and has emerged as a promising post-training paradigm. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as a key bottleneck in OPD that substantially impacts training efficiency. Unlike Reinforcement Learning with Verifiable Rewards (RLVR), OPD does not require a final answer reward to provide learning signals. Therefore, full rollouts may not always be necessary for OPD. Motivated by this insight, we propose two simple horizon-control strategies: Progressive OPD (POPD), which gradually expands the rollout horizon during training, and Truncated OPD (TOPD), which permanently performs distillation on reliable truncated rollouts. Experiments on mathematical reasoning show that POPD improves the training efficiency of OPD by up to 3$\times$, while TOPD matches OPD performance using only 10\% of the rollout horizon, leading to substantial wall-clock and memory reductions. These results demonstrate that controlling the rollout horizon offers a simple and practical path to more efficient OPD.

2605.31401 2026-06-02 cs.CL 版本更新

"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

Înţelegi Româneşte?'' 罗马尼亚语视觉语言模型的构建配方

Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

发表机构 * National University of Science and Technology POLITEHNICA Bucharest(波兰技术大学布加勒斯特国家大学)

AI总结 本文系统研究了构建罗马尼亚语视觉语言模型(VLM)的完整流程,通过翻译英文语料、消融实验分析视觉骨干、语言骨干和OCR数据的影响,并构建文化本地化评估集HoraVQA,证明罗马尼亚语适配模型在性能上超越同尺寸甚至更大尺寸的模型。

详情
AI中文摘要

视觉语言模型(VLM)很大程度上遵循纯文本LLM的发展轨迹,在英文基准测试中表现出色,但在低资源语言上性能急剧下降,因为既缺乏大规模图像-文本语料库,也缺乏基于文化的评估。我们提出了一项针对罗马尼亚语构建语言特定VLM的系统研究,涵盖了从数据构建到架构选择的完整流程。我们将已有的英文VLM训练和评估语料库翻译成罗马尼亚语,对文本注释和图像内文本应用机器翻译,在保留视觉基础的同时调整文本内容。利用这些数据,我们训练并消融了一系列VLM,以隔离以下因素的贡献:(i)不同规模和预训练的视觉骨干,(ii)从多语言到罗马尼亚语适配LLM的语言骨干,以及(iii)OCR风格的图像-文本数据。我们进一步整理了HoraVQA,一个基于罗马尼亚日常场景的文化本地化评估集。罗马尼亚语适配的VLM始终优于同尺寸的对应模型,并且在所有评估基准上甚至超越了下一个更大尺寸类别的模型。

英文摘要

Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data. We further curate HoraVQA, a culturally native evaluation set grounded in Romanian everyday scenes. Romanian-adapted VLMs consistently outperform their same-sized counterparts and, across all evaluated benchmarks, even surpass models from the next larger size category.

2605.31086 2026-06-02 cs.CL cs.IR 版本更新

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

超越静态对话:对现实、异构和演化长期记忆的基准测试

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu, Yeyun Gong, Haizhen Huang, Yan Lu, Weiwei Deng, Feng Sun, Qi Zhang, Hanfang Yang

发表机构 * Center for Applied Statistics, Renmin University of China(中国人民大学应用统计中心) School of Statistics, Renmin University of China(中国人民大学统计学院) Microsoft(微软公司)

AI总结 针对现有大语言模型记忆基准中对话缺乏长期语义一致性、人物静态化以及忽视异构数据流的问题,提出RHELM基准,通过用户画像和LOOP模块生成具有动态时间演化和长期连贯性的现实对话,并整合异构外部源,评估模型在多源聚合和现实上下文推理方面的能力。

详情
AI中文摘要

在现有的大语言模型(LLM)记忆基准中,评估的对话会话通常缺乏长期语义一致性,且底层人物角色趋于扁平化和静态化。此外,在现实场景中,用户与助手之间的交互涉及更多样化、异构的数据流,例如文档和电子邮件。这些缺陷严重限制了当前评估的现实性和有效性。为了解决这些限制,我们引入了RHELM(现实、异构和演化的长期记忆)。通过精心设计的用户画像和新型LOOP(规划-展开-演化-剪枝)模块,我们在多样化的交互场景中构建了展现动态时间演化和长期连贯性的现实对话。关键的是,这些对话与用户时间事件轨迹同步的异构外部源深度融合。由此产生的基准涵盖了跨越七种查询类型的具有挑战性的问答对,每个问题映射到至少27个关键记忆特征之一,这些特征在我们看来是当前研究中必要但尚未充分探索的。在全文模型、检索增强生成(RAG)方法和代表性记忆框架上的全面实验表明,当代方法在复杂的现实环境中仍然暴露出关键弱点,特别是在解决多源聚合和现实上下文推理方面。

英文摘要

In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants involve more diverse, heterogeneous data streams, such as documents and emails. These shortcomings significantly limit the realism and effectiveness of current evaluations. To address these limitations, we introduce RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory). Driven by meticulously crafted user profiles and a novel LOOP (pLan-rOllout-evOlve-Prune) module, we construct realistic dialogues across diverse interaction scenarios that exhibit dynamic temporal evolution and long-term coherence. Crucially, these dialogues are deeply integrated with heterogeneous external sources synchronized with the user's temporal event trajectory. The resulting benchmark encompasses challenging question-answer pairs spanning seven inquiry types, with each question mapping to at least one of 27 critical memory characteristics that we identify as essential yet underexplored in current research. Comprehensive experiments across full-context models, retrieval-augmented generation (RAG) methods, and representative memory frameworks reveal that contemporary approaches still expose critical weaknesses in complex, real-world settings, particularly in resolving multi-source aggregation and real-world contextual reasoning.

2605.30848 2026-06-02 cs.CR cs.CL 版本更新

LLM Anonymization Against Agentic Re-Identification

LLM匿名化对抗智能体重识别

Ziwen Li, Jianing Wen, Tianshi Li

发表机构 * Khoury College of Computer Sciences(科里学院计算机科学学院) Northeastern University(东北大学)

AI总结 提出AURA框架,通过掩码-重构方法解耦隐私定位与效用保留,并利用对抗性隐私和效用检查,以抵抗基于网络搜索的智能体重识别攻击,同时保留文本的上下文效用。

Comments 32 pages, 7 figures

详情
AI中文摘要

具有网络搜索功能的智能体LLM改变了文本匿名化的威胁模型:弱上下文线索可能成为重识别的交叉引用证据,但这些细节也承载着文本的下游分析价值。现有的防御措施要么移除显式标识符,要么扰动文本以获得形式化隐私,要么针对非网络推理模型测试改写后的文本,而未充分探索抵抗智能体网络搜索重识别与效用保留之间的操作区域。我们引入了AURA( extbf{A}nonymization with extbf{U}tility- extbf{R}etention extbf{A}daptation),一个LLM驱动的 extit{掩码-重构}框架,将隐私定位与效用保留重构解耦,并通过对抗性隐私和效用保留检查选择候选方案。我们在真实用户访谈转录上评估AURA,使用由网络搜索智能体发起的重识别攻击,以及基于受访者档案事实、编码本事实和联合上下文效用网格的效用评估。我们的结果表明,AURA通过使用自适应隐私范围增强对智能体重识别的抵抗,以及使用掩码-重构匿名化方法在固定隐私范围内更好地保留上下文效用,从而改善了隐私-效用边界。

英文摘要

Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become cross-referenceable evidence for re-identification, yet those same details also carry downstream analytic value of the text. Existing defenses either remove explicit identifiers, perturb text for formal privacy, or test rewritten text against non-web inference models, leaving underexplored the operating region between resistance to agentic web-search re-identification and utility retention. We introduce AURA (\textbf{A}nonymization with \textbf{U}tility-\textbf{R}etention \textbf{A}daptation), an LLM-powered \textit{mask-reconstruct} framework that decouples privacy localization from utility-preserving reconstruction and selects candidates with adversarial privacy and utility-retention checks. We evaluate AURA on real-user interview transcripts using re-identification attacks carried out by web-search agents, along with a utility evaluation based on interviewee-profile facts, codebook facts, and the joint contextual utility grid. Our results show that AURA improves the privacy-utility frontier by using adaptive privacy scope to strengthen resistance to agentic re-identification and using a mask-reconstruct anonymization method to better preserve contextual utility under fixed privacy scope.

2605.30743 2026-06-02 cond-mat.mtrl-sci cs.CE cs.CL 版本更新

A Padding Method for Enhanced Encoding of Inorganic Structures with Varying Chemical Compositions

一种用于增强不同化学成分无机结构编码的填充方法

Thang Dang, Haderbache Amir, Tzanakakis Alexandros, Yoshimoto Yuta

发表机构 * Fujitsu Limited(富士通株式会社) National Technical University of Athens(希腊国家技术大学)

AI总结 提出一种利用晶体对称性信息(Wyckoff位置长度感知填充)的编码方法,结合端到端生成系统,提升无机材料生成精度和稳定性,在质子导体数据上重建准确率提升5.3%,在perov-5数据集上生成的新颖稳定材料比基线模型多63.5%。

详情
AI中文摘要

通过生成模型设计新型无机材料仍然是材料科学的重要挑战,这是因为无机结构在广泛的化学成分和结构景观中具有复杂性和多样性。无机化合物的巨大组合空间需要创新的、人工智能驱动的方法来克服生成准确性和效率方面的限制。为了解决这个问题,我们引入了一种新方法,通过利用领域特定的对称感知表示来重新定义无机材料的编码和生成。我们的方法不仅改进了复杂无机结构的表示,还通过提高生成候选物的精度和稳定性,为材料发现领域做出了贡献。我们方法的核心是一种利用晶体对称信息来增强编码过程的新型填充技术。通过将Wyckoff位置长度感知填充集成到编码器架构中,我们实现了对无机材料更鲁棒的、信息丰富的表示。这种对称驱动的增强提高了深度学习模型生成稳定、先前未探索的无机结构的准确性和计算效率。此外,我们引入了一个端到端系统,利用机器学习势模型从初始数据到验证输出无缝生成新颖的、甚至在训练数据中未见过的稳定无机材料。该管道将先进的生成模型与稳定性分析相结合,标志着下一代无机材料自动探索和设计的重大飞跃。我们的方法在质子导体数据上将重建准确率提高了5.3%,并在perov-5数据集上生成了比基线模型多63.5%的新颖稳定无机材料。

英文摘要

Designing novel inorganic materials through generative models remains an important challenge for material science, driven by the complexity and diversity of inorganic structures across expansive chemical compositions and structural landscape. The vast combinatorial space of inorganic compounds demands innovative, AI-driven approaches to overcome limitations in generative accuracy and efficiency. To address this, we introduce a novel method that redefines the encoding and generation of inorganic materials by utilizing domain-specific symmetry-aware representation. Our approach not only refines the representation of intricate inorganic structures but also contributes to the field of material discovery by enhancing the precision and stability of generated candidates. Central to our methodology is a novel padding technique that exploits crystal symmetry information to enhance the encoding process. By integrating Wyckoff position length-aware padding into an encoder architecture, we achieve a more robust informed representation of inorganic materials. This symmetry-driven enhancement improves deep learning models to generate stable, previously unexplored inorganic structures with superior accuracy and computational efficiency. Furthermore, we introduce an end-to-end system that leverages the machine learning potential models to seamlessly generate novel, even those unseen in the training data, and stable inorganic materials from initial data to validated output. This pipeline integrates advanced generative models with stability analysis, marking a significant leap forward in the automated exploration and design of next-generation inorganic materials. Our method improved reconstruction accuracy 5.3% in proton conductor data, and generated 63.5% more novel stable inorganic material to baseline model on the perov-5 dataset.

2605.30443 2026-06-02 cs.CL 版本更新

Cross-Lingual Steering for Figurative Language Generation

跨语言引导的比喻语言生成

Linfeng Liu, Tiffany Zhan, Louie Hong Yao, Saptarshi Ghosh, Tianyu Jiang

发表机构 * Department of Computer Science, University of Cincinnati(卡内基梅隆大学计算机科学系) School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院) Independent Researcher(独立研究者)

AI总结 通过激活引导技术,研究多语言大模型中比喻语言生成的内部信号是否跨语言可复用,发现跨语言方向可有效转移并增强目标行为。

Comments 40 pages, 7 figures

详情
AI中文摘要

多语言大模型能够生成比喻语言,但驱动这一行为的内部信号是语言特有的还是跨语言可复用的尚不清楚。使用激活引导作为探针,我们从一种语言的比喻-字面激活差异中估计比喻类别的方向,并在生成过程中应用它。在五个比喻类别、六种语言和四个多语言LLM中,这些方向在其自身语言内可靠地引导,对于隐喻和明喻最为稳健。更重要的是,它们可以跨语言转移:从一种语言学习到的方向在应用于另一种语言时增加了目标行为,其中德语是最易接受的目标之一。进一步地,从其他语言组合而成的方向可以匹配甚至超越目标语言自身的原生方向,而移除这一共享成分则会削弱原生引导。综合来看,这些结果提供了直接证据,表明存在一种可复用但依赖于目标的跨语言信号用于比喻生成。

英文摘要

Multilingual large language models can generate figurative language, but whether the internal signals driving this behavior are language-specific or reusable across languages is unclear. Using activation steering as a probe, we estimate a direction for a figurative category from figurative--literal activation differences in one language and apply it during generation. Across five figurative categories, six languages, and four multilingual LLMs, these directions steer reliably within their own language, most robustly for metaphor and simile. More importantly, they transfer across languages: a direction learned in one increases the target behavior when applied to another, with German among the most receptive targets. Going further, directions assembled from other languages can match or even surpass a target language's own native direction, while removing this shared component weakens native steering. Together, these results provide direct evidence of a reusable but target-dependent cross-lingual signal for figurative generation.

2605.28183 2026-06-02 cs.CL cs.AI 版本更新

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

BenGER:德国法律中基于归入的法律推理的LLM系统基准测试

Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak, Anne Zettelmeier, Elly Breu, Angelina Greiner, Sofija Milijas, Matthias Grabmair

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) Ludwig Maximilian University of Munich (LMU)(慕尼黑路德维希-马克西米利安大学) University of Konstanz(康斯坦茨大学) University of Saarbrücken(萨尔布吕肯大学)

AI总结 提出BenGER数据集,用于评估LLM系统在德国法律归入推理中的表现,通过自动和基于法官的指标比较12个LLM系统与人类基线。

Comments Pre-Print v2

详情
AI中文摘要

我们引入了BenGER(德国法律基准)数据集,用于评估LLM系统在德国法律中基于归入的法律推理。BenGER数据集由三个部分组成:596个跨多个法律教育水平的考试式自由文本法律案例任务和531个简短的教义推理任务。我们评估了12个当代LLM系统——包括封闭旗舰型、效率导向型和开放权重型——使用自动和基于法官的指标。在受控验证子集上,对定时的人类撰写解决方案(在无辅助和人机共创条件下)进行模型性能与这些人类基线的对比。我们引入了一个与多评分者人工评分协议(每个解决方案三次盲审加一次作者知情创建者评审)交叉验证的、基于评分标准的LLM-as-a-Judge框架。我们的结果表明,用LLM法官替换盲人评审员对整个人类评审池的一致性影响不大于完全移除该评审员(Calderon r=0.96 vs. r=0.96,匹配n=30),封闭旗舰系统在所有语料库中领先排行榜,并且人机共创显著优于无辅助的人类工作。

英文摘要

We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.

2605.24583 2026-06-02 cs.LG cs.CL stat.ML 版本更新

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

正确测量对齐引起的激活偏移:一种模板控制的双重差分协议

Yuki Nakamura

发表机构 * The Open University of Japan(日本开放大学)

AI总结 针对对齐前后模型内部激活比较中存在的聊天模板混淆问题,提出模板控制的双重差分协议,有效分离对齐偏移与格式效应,恢复拒绝方向并提升余弦对齐度。

Comments 11 pages, 1 figure. v3: substantially revised and reframed as a measurement-methodology paper. Code, data, and an immutable Zenodo archive are available at https://github.com/Nakammura/effective-rank-audit (DOI: 10.5281/zenodo.20341444)

详情
AI中文摘要

比较模型在安全对齐前后的内部激活是探究安全训练改变了什么的一种自然方式:在安全相关输入上形成配对的(对齐后减对齐前)激活矩阵,并读取其有效秩或主方向。我们表明,形成该矩阵的直观方式存在混淆。对齐后的模型在基础模型从未见过的聊天模板下进行评估,因此朴素差异将对齐偏移与聊天格式混为一谈。我们引入修改矩阵的四变量分解(朴素、模板控制、对齐内和双重差分,DiD),以分离这两种效应。仅模板控制即可消除 Llama-3.1-8B、Gemma-2-9B 和 Qwen-2.5-7B 上测量有效秩的 2.0-3.9 倍膨胀;DiD 对比才是恢复 Arditi 等人(2024)的拒绝方向的关键,将其余弦对齐度从 0.18-0.39 提升至 0.50-0.86。跨三个系列的投影消融实验证实,恢复的子空间在行为上是活跃的,且奇异值顺序并非因果顺序。我们在受控测试平台上验证了该协议,并将其提炼为对齐激活差异研究的测量建议。

英文摘要

Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned-minus-base activations on safety-relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences, DiD) that separates the two effects. Template control alone removes a 2.0-3.9x inflation of the measured effective rank across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18-0.39 to 0.50-0.86. Projection-ablation across the three families confirms the recovered subspace is behaviorally active and that singular-value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation-difference studies of alignment.

2605.27835 2026-06-02 cs.LG cs.CL 版本更新

CAREF: Calibration-Aware Regularization for Explanation Faithfulness Without Rationale Supervision

CAREF: 面向解释忠实性的校准感知正则化,无需理由监督

Naphat Nithisopa, Teerapong Panboonyuen

发表机构 * MARSAIL Chulalongkorn University(朱拉隆梭大学) PBYAIL (Panboonyuen AI Lab)(PBYAIL(Panboonyuen人工智能实验室))

AI总结 提出CAREF框架,通过校准感知正则化联合优化预测准确性和解释忠实性,无需理由监督,在四个NLE基准上以少量可训练参数取得最佳性能。

Comments 10 pages

详情
AI中文摘要

我们引入了CAREF,一个参数高效的微调框架,通过校准感知正则化联合优化预测准确性和解释忠实性。其核心是通过单一统一损失函数LSCED(面向解释忠实性的校准感知正则化)将基于熵的校准与令牌级稀疏性控制相结合,无需理由监督。在四个NLE基准(COS-E、ECQA、ComVE、e-SNLI)上使用Flan-T5进行评估,我们轻量级的CAREF-AQ变体仅使用6.43%的可训练参数就达到了最佳平均准确率(89.04)和解释对齐度(81.00 nBERT),优于LoRA和AdaLoRA。据我们所知,CAREF是第一个将熵和稀疏性正则化统一到单一训练目标中用于可解释LLM微调的方法。

英文摘要

We introduce CAREF, a parameter-efficient fine-tuning framework that jointly optimizes predictive accuracy and explanation faithfulness via calibration-aware regularization. At its core, CAREF couples entropy-based calibration with token-level sparsity control through a single unified loss, the Calibration-Aware Regularization for Explanation Faithfulness (LSCED), without requiring rationale supervision. Evaluated on four NLE benchmarks (COS-E, ECQA, ComVE, e-SNLI) with Flan-T5, our lightweight CAREF-AQ variant attains the best average accuracy (89.04) and explanation alignment (81.00 nBERT) using only 6.43% of trainable parameters, outperforming LoRA and AdaLoRA. To our knowledge, CAREF is the first method to unify entropy and sparsity regularization in a single training objective for interpretable LLM fine-tuning.

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 针对Transformer中异质注意力结构(如共注意力)带来的多源信息融合挑战,提出一种通用解释方法,并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情
AI中文摘要

Transformer极大地推动了人工智能的发展,也推动了智能体(agent)的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类:同质注意力结构和异质注意力结构。异质注意力结构以共注意力(co-attention)为典型例子,处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求,对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面,我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面,基于我们的实验分析范式,我们解释代表性模型的操作机制,进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

2605.27000 2026-06-02 cs.CL cs.AI 版本更新

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

撒更宽的网:面向代码推理的协调 Pass@K 策略优化

Yilong Li, Suman Banerjee, Tong Che

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) NVIDIA Research(英伟达研究)

AI总结 提出协调 Pass@K 策略优化 (CPPO),通过规划器生成多种策略并协调求解器尝试,以解决代码生成中重复采样导致冗余推理路径的问题,在多个基准上显著提升 pass@4。

Comments Code reasoning; pass@K optimization; coordinated planning; verifiable rewards; strategy diversity

详情
AI中文摘要

使用验证器进行重复采样是分配代码生成测试时计算的标准方法,pass@$K$ 是规范指标。然而,标准策略类从单一答案分布中抽取 $K$ 个独立样本,因此尝试往往坍缩到近乎重复的推理路径,并将预算浪费在冗余 rollout 上。在竞争性编程中,这种失败代价高昂,因为许多问题允许多种不同的算法策略,而 pass@$K$ 只需要一次正确尝试。我们提出协调 Pass@$K$ 策略优化 (CPPO),它将 pass@$K$ 生成转化为策略上的联合探索:规划器输出一个包含 $K{=}4$ 种替代高层方法的元组,共享求解器为每种方法尝试一个解决方案。CPPO 使用乘法规划器奖励 $R_{\\mathrm{plan}} = J_ψ\\\cdot R_{\\mathrm{out}}$ 训练此联合策略,仅将信用分配给导致验证器确认的 pass@$K$ 成功的有效策略元组。在 APPS、CodeContests 和 LiveCodeBench-v6 上,CPPO 在相同的 $K{=}4$ 求解器尝试预算下,相比于直接采样、规划基线、仅规划器 SFT 和面向 pass@$K$ 的 RL,提升了 pass@$4$,在九个模型-基准组合中的六个上具有统计显著增益。最大单次增益是在 Qwen3.5-9B LiveCodeBench-v6 上,相比于最强基线 PKPO 提升了 $+0.16$($0.588 \\rightarrow 0.748$;配对 bootstrap,$p < 0.05$)。

英文摘要

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, $R_{\mathrm{plan}} = J_ψ\cdot R_{\mathrm{out}}$, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@$K$ success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@$4$ over direct sampling, planning baselines, planner-only SFT, and pass@$K$-oriented RL under the same $K{=}4$ solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is $+0.16$ on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ($0.588 \rightarrow 0.748$; paired bootstrap, $p < 0.05$).

2605.26444 2026-06-02 cs.CL 版本更新

NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

NanoSpec: 利用极简上下文词汇表加速推测解码

Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma

发表机构 * Institute for Artificial Intelligence, Peking University, Beijing, China(北京大学人工智能研究院) Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education(教育部高可信软件技术重点实验室) School of Computer Science, Peking University, Beijing, China(北京大学计算机学院) State Key Laboratory of Networking(网络与交换技术国家重点实验室) School of Computer Science, Beijing University of Posts(北京邮电大学计算机学院)

AI总结 提出NanoSpec,一种无需训练的方法,通过动态构建极简上下文感知词汇表(平均<3k tokens),结合系统-算法协同设计,将推测解码的草稿生成时间平均减少51.6%,实现端到端加速1.17-1.29倍。

详情
AI中文摘要

大型语言模型的巨大词汇量(通常超过10万tokens)在推测解码过程中对最后的线性投影层造成了计算瓶颈。现有的词汇表剪枝方案依赖于静态或粗粒度的子词汇表,需要较大的活跃大小(约30k)才能维持草稿质量。我们提出NanoSpec,一种新颖的无需训练的方法,通过为每个生成步骤动态构建一个极简的、上下文感知的活跃词汇表,打破了这种权衡。利用语言生成固有的时间局部性,NanoSpec实现了高覆盖率,同时将平均词汇量削减了40倍以上(降至<3k tokens),且无需任何辅助训练参数。为了在现代硬件上实现这种高稀疏性的理论优势,我们引入了系统-算法协同设计,通过异步收集和GPU驻留状态管理克服了稀疏内存访问的低效问题。作为一个互补的即插即用模块,NanoSpec将草稿生成时间平均减少了51.6%,在7个任务上比最先进的推测解码方法EAGLE-2和EAGLE-3实现了1.17-1.29倍的端到端加速,并优于复杂的基于训练的剪枝基线。

英文摘要

The massive vocabulary sizes of large language models, often exceeding 100k tokens, impose a computational bottleneck on the final linear projection layer during speculative decoding. Existing vocabulary pruning solutions rely on static or coarsely-grained sub-vocabularies that necessitate large active sizes ($\sim$30k) to maintain draft quality. We propose NanoSpec, a novel training-free approach that breaks this trade-off by dynamically constructing a minimalist, context-aware active vocabulary for each generation step. Leveraging the inherent temporal locality of language generation, NanoSpec achieves high coverage while slashing the average vocabulary size by over $40\times$ (to $<$3k tokens) without requiring any auxiliary trained parameters. To realize the theoretical benefits of such high sparsity on modern hardware, we introduce a system-algorithm co-design that overcomes the inefficiencies of sparse memory access through asynchronous gathering and GPU-resident state management. As a complementary plug-and-play module, NanoSpec cuts draft time by an average of 51.6\%, delivering a $1.17$-$1.29\times$ end-to-end speedup over the state-of-the-art speculative decoding methods EAGLE-2 and EAGLE-3 across 7 tasks and outperforming complex training-based pruning baselines.

2605.26436 2026-06-02 cs.CL cs.AI 版本更新

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

目标重掩码:在离散扩散语言模型中将令牌编辑替换为令牌到掩码的精炼

Lin Yao

AI总结 针对离散掩码扩散语言模型中令牌编辑机制的局限性,提出无训练的令牌到掩码重掩码方法,通过将疑似错误令牌重置为掩码状态,利用扩散过程在更干净上下文中重新预测,显著提升数学等任务的性能。

Comments This paper has been significantly revised, expanded, and superseded by a more comprehensive version available at arXiv:2604.18738. The authors have chosen to withdraw this version to avoid overlap and direct readers to the updated work

详情
AI中文摘要

离散掩码扩散语言模型(如LLaDA)通过迭代去噪生成文本,其中掩码令牌逐步被预测的令牌替换。LLaDA2.1引入了令牌到令牌(T2T)编辑机制,通过直接替换疑似错误的已提交令牌来加速生成。然而,我们发现了T2T编辑的根本性限制:它将错误检测与替换耦合,用可能错误的令牌污染生成上下文,并引入了训练-推理噪声不匹配,其中系统性的模型生成错误与训练中看到的随机扰动不同。我们提出了令牌到掩码(T2M)重掩码,这是一种无需训练、即插即用的T2T编辑替代方案,将疑似错误的令牌重置回掩码状态,允许扩散过程在更干净的上下文中重新预测它们。我们设计并实证验证了三种互补的错误检测策略——基于概率的、触发镜像的和基于时间差分的——并提供了统一的理论分析,表明T2M重掩码净化了生成上下文,将系统性的推理错误转换回模型的原生掩码噪声类型,并实现了延迟承诺以进行联合多位置优化。在涵盖知识、推理、数学、编码和指令跟随的12个基准上的全面实验表明,T2M通常在需要精确令牌级输出的任务上提升性能,其中数学任务提升最大(CMATH上+5.92%)。对CMATH的错误分析揭示,主要的失败模式是最后一英里令牌损坏——即正确的推理产生损坏的最终答案——而T2M修复了59.4%的此类情况。

英文摘要

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies -- probability-based, trigger-mirrored, and temporal-difference-based -- and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model's native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption -- where correct reasoning produces a corrupted final answer -- and that T2M repairs 59.4% of such cases.

2605.26431 2026-06-02 cs.CL stat.AP 版本更新

Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent

探究LLM中的最简阶段结构:普遍依赖关系无法表示的内容

Yuanhao Chen, Peter Chin

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 通过设计UD距离不变的条件,使用结构探针评估LLM在wh-移位刺激上的表现,发现阶段计数梯度和阶段内凝聚性效应,表明分布预训练能诱导出超越UD的形式句法抽象表征。

详情
AI中文摘要

结构探针在普遍依赖关系(UD)上进行训练,而UD不编码形式句法抽象,如阶段边界或阶段内凝聚性。大型语言模型(LLM)是否编码这些抽象仍是一个开放问题,基于UD的探针在构造上无法回答。我们在wh-移位刺激上评估结构探针,这些刺激中UD距离在设计上跨条件不变——因此任何非零效应都反映了超越UD的结构。三个条件——裸小句、不定式和限定式——按wh-元素跨越的最简方案(MP)阶段边界数量排序。在来自四个家族的13个LLM中,我们在跨从句对上发现了阶段计数梯度(12/13模型),并在一个从句内对上发现了13/13的符号不对称性,该从句内对的UD距离跨条件相同——后者特别由阶段内凝聚性预测,这是MP抽象,在构造上对UD不可见。激活修补证实这些表征在12/13模型中因果活跃。这些发现表明,分布预训练可以诱导与形式句法抽象一致的表征,这些抽象超出了基于注释的探针所能达到的范围;基于UD的探针提供了句法编码的下界,而非上界。

英文摘要

Structural probes train on Universal Dependencies (UD), which does not encode formal-syntactic abstractions such as phase boundaries or phase-internal cohesion. Whether large language models (LLMs) encode these remains an open question that UD-based probing cannot answer by construction. We evaluate structural probes on wh-movement stimuli where UD distances are invariant across conditions by design -- any non-zero effect therefore reflects structure beyond UD. The three conditions -- bare small clause, infinitival, and finite -- are ordered by the number of Minimalist Program (MP) phase boundaries the wh-element crosses. Across 13 LLMs from four families, we find a phase-count gradient on a cross-clause pair (12/13 models) and a 13/13 sign asymmetry on a within-clause pair whose UD distance is identical across conditions -- the latter specifically predicted by phase-internal cohesion, an MP abstraction invisible to UD by construction. Activation patching confirms the representations are causally active in 12/13 models. These findings suggest that distributional pretraining can induce representations aligned with formal-syntactic abstractions beyond the reach of annotation-based probing; UD-grounded probes provide a lower bound on syntactic encoding, not an upper bound.

2605.26397 2026-06-02 cs.CL cs.AI 版本更新

Algorithmic Fragility and Persona Bias in LLM-Generated Autistic Communication

LLM生成的自闭症交流中的算法脆弱性与人格偏见

Naba Rizvi, Mohammed Rizvi, Harper Strickland, Saleha Ahmedi, Nedjma Ousidhoum

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Georgia Institute of Technology(佐治亚理工学院) Cornell University(康奈尔大学) Cardiff University(卡迪夫大学)

AI总结 通过双人格改写范式,发现LLM在生成自闭症人格文本时存在词汇与情感偏离、输出坍塌等系统性失败,且对齐策略而非参数规模主导这些失败,表明当前对齐训练导致深层表征鸿沟。

Comments main paper: 9 pages; total: 19 pages; 2 figures; 5 tables

详情
AI中文摘要

安全对齐减少了明确有害的输出,但无意中编码了一种对边缘化交流的净化、神经典型化表征。我们使用双人格改写范式研究这种编码,提示十个大型语言模型(LLM)从自闭症或神经典型人格改写自然发生的自闭症话语。我们发现,尽管语义相似性相当,自闭症人格改写比神经典型改写在词汇形式和情感语域上偏离显著更多。此外,大多数模型将跨人格生成折叠成几乎相同的输出。为了揭示这种生成崩溃背后的机制,我们引入了一个多智能体定性分析框架。我们的结果揭示了系统性输出擦除、刻板幻觉和任务回避元评论是此任务的普遍失败模式,这些模式按对齐策略而非参数规模聚类。最后,我们与自闭症人类标注者的针对性比较表明,社区内部知识相对于LLM分类产生了系统性标签反转。我们的发现表明,当前的对齐训练导致仅通过定性分析可见的人格特定生成崩溃,证实了提示工程无法解决的深层表征鸿沟。

英文摘要

Safety alignment reduces explicitly harmful outputs but inadvertently encodes a sanitized, neuronormative representation of marginalized communication. We investigate this encoding using a dual-persona rewrite paradigm, prompting ten large language models (LLMs) to rewrite naturally occurring autistic discourse from either an autistic or neurotypical persona. We uncover autistic-persona rewrites diverge significantly more in lexical form and affective register than neurotypical rewrites, despite equivalent semantic similarity. Furthermore, most models collapse cross-persona generations into near-identical outputs. To uncover the mechanisms behind this generative breakdown, we introduce a multi-agent qualitative analysis framework. Our results reveal systemic output erasure, stereotyped hallucination, and task-evasive meta-commentary are pervasive failure modes for this task that cluster by alignment strategy rather than parameter scale. Finally, our targeted comparison with autistic human annotators demonstrates that community-insider knowledge produces systematic label reversals relative to LLM classifications. Our findings indicate that current alignment training causes persona-specific generative breakdown visible only through qualitative analysis, confirming a deep representational gap that prompt engineering cannot resolve.

2605.26292 2026-06-02 cs.CV cs.CL 版本更新

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Evi-Steer:通过高效且可泛化的证据调优学习引导生物医学视觉-语言模型

Taha Koleilat, Hassan Rivaz, Yiming Xiao

发表机构 * Concordia University(康科迪亚大学)

AI总结 提出Evi-Steer框架,通过证据跨模态低维引导实现BiomedCLIP的不确定性感知参数高效微调,仅更新0.11%参数,在15个生物医学数据集上少样本学习和域泛化设置中优于现有方法。

Comments MICCAI 2026 Early Accept; Project Page: https://tahakoleilat.github.io/Evi-Steer. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情
AI中文摘要

视觉-语言基础模型的参数高效适配对于生物医学图像的精确多模态理解至关重要,但现有方法仍具有确定性,且在域偏移或模糊的图像-文本对齐下常常表现不佳。这一限制在临床中尤为关键,因为模型应在低数据 regime 和域偏移下保持鲁棒性。我们提出了Evi-Steer,一个用于BiomedCLIP的证据跨模态低维引导框架,能够在仅更新总模型参数0.11%的情况下实现不确定性感知的参数高效微调。我们的方法在视觉和文本编码器中执行轻量级低维令牌更新,同时估计认知不确定性。这些不确定性估计更新门控残差,使模型在证据较弱时能够保守地适应。此外,我们引入了基于Dempster-Shafer理论的跨模态置信度融合,使视觉适应能够以文本置信度为条件,并抑制冲突或不确定的跨模态更新。我们在涵盖8个器官和8种成像模态的15个生物医学成像数据集上,在少样本学习和域泛化设置下进行了全面评估。Evi-Steer在少样本学习和域偏移设置下始终优于最先进的方法,展示了在真实临床环境中部署视觉-语言模型的实用且鲁棒的途径。代码可在https://github.com/HealthX-Lab/Evi-Steer获取。

英文摘要

Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at https://github.com/HealthX-Lab/Evi-Steer.

2605.30290 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Self-Trained Verification for Training- and Test-Time Self-Improvement

自训练验证用于训练和测试时的自我改进

Chen Henry Wu, Aditi Raghunathan

发表机构 * arXiv

AI总结 提出自训练验证(STV)方法,通过让验证器模仿参考解决方案下的自身版本,解决自我改进中验证器瓶颈问题,在测试时显著提升验证-细化循环,在训练时通过验证器在环训练(ViL)进一步提升生成器性能。

详情
AI中文摘要

大规模自我改进一直是推理模型的长期目标,有两个自然的实现阶段:测试时,通过验证-细化(V-R)循环;训练时,通过自训练方法。两者都受限于同一个瓶颈:验证器。当验证器得分膨胀而准确率停滞,且反馈过于泛化无法执行时,V-R循环会停滞;当糟糕的自生成数据被加入训练时,自训练同样会失败。更好的验证将解锁两者,但我们想要训练的能力,即捕捉自生成的错误,缺乏训练信号。为了解决这一挑战,我们提出了自训练验证(STV)。我们的关键观察是,虽然模型单独无法捕捉这些错误,但当它看到参考解决方案时却可以。我们将这种不对称性转化为监督目标,训练验证器模仿自身更具信息量的版本。在测试时,STV在困难问题上显著改进了V-R循环,而替代方法(如SFT、对验证器分数进行RL,甚至元验证器)则不然。STV在困难数学任务上大致使准确率翻倍,在科学推理任务上提升14倍(从1.5%到21%)。在训练时,我们额外使用STV验证器在V-R循环内的反馈对生成器进行RL训练——我们称之为验证器在环训练(ViL)。从一个RL收敛的生成器开始,ViL在pass@1上进一步获得33%的提升。更值得注意的是,生成器在测试时无验证器的独立pass@1相对标准RL收敛点提升了30%。因此,困难问题推理的下一个前沿可能在于我们如何训练用于验证和与验证结合的方法。网站:https://ar-forum.github.io/stv-webpage

英文摘要

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

2605.30280 2026-06-02 cs.RO cs.AI cs.CL 版本更新

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA:统一跨任务、环境和机器人形态的视觉-语言-动作建模

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出Qwen-VLA,一种基于DiT动作解码器的统一具身基础模型,通过大规模联合预训练和具身感知提示,将操作、导航和轨迹预测统一为动作-轨迹预测框架,实现跨任务、环境和机器人形态的泛化。

Comments 34 pages

详情
AI中文摘要

具身智能通常通过针对单个任务(如操作或导航)的专用模型进行研究,导致能力碎片化,且跨任务、环境和机器人形态的泛化能力有限。在这项工作中,我们研究了异构的具身决策问题是否可以在单个视觉-语言-动作模型中统一。我们提出了Qwen-VLA,一个统一的具身基础模型,它通过基于DiT的动作解码器将Qwen的视觉-语言建模栈从感知、理解和推理扩展到连续动作和轨迹生成。Qwen-VLA通过大规模联合预训练方案在多样化的数据源上进行训练,包括机器人操作轨迹、人类自我中心演示、合成模拟数据、视觉-语言导航数据、轨迹中心监督和辅助视觉-语言数据。为了支持多种机器人平台,我们引入了具身感知提示调节,其中特定于机器人的文本描述指定了当前的具身形态和控制约定。我们进一步将操作、导航和轨迹预测统一为一个动作-轨迹预测框架,实现了跨机器人形态、任务族和环境的可迁移视觉基础、空间推理和连续动作生成。在操作、导航和轨迹中心基准上的实验显示,在场景布局、背景、光照、物体配置和机器人形态变化下,具有一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct在LIBERO上达到97.9%,在Simpler-WidowX上达到73.7%,在RoboTwin-Easy/Hard上达到86.1%/87.2%,在R2R上达到69.0% OSR,在RxR上达到59.6% SR,在真实世界ALOHA实验中平均OOD成功率为76.9%,在DOMINO动态操作上零样本成功率为26.6%。

英文摘要

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

2605.30237 2026-06-02 cs.IR cs.CL cs.LG 版本更新

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

GRASP:半结构化知识库上具有自适应融合和重排序的计划引导图检索

Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, Jie Liu

发表机构 * Department of Electrical Engineering and Computer Science, University of Michigan(密歇根大学电气工程与计算机科学系) Department of Computational Medicine and Bioinformatics, University of Michigan(密歇根大学计算医学与生物信息学系)

AI总结 提出GRASP框架,通过计划引导图检索、条件融合和重排序三阶段,在半结构化知识库上实现最先进的检索性能。

详情
AI中文摘要

半结构化知识库(SKBs)将文本文档嵌入实体和关系的类型化图中,并支持产品搜索、学术论文搜索和精准医学查询等应用。现有的SKB混合检索系统要么仅将图用于查询扩展,要么在全局加权下混合文本和结构分支,要么依赖微调的图遍历生成器。我们提出了GRASP,一个三阶段SKB检索框架,统一了基于计划的图检索、与密集检索器的计划条件融合以及融合候选上的微调重排序。GRASP在三个STaRK基准测试的每个指标上显著推进了现有技术水平,将平均Hit@1从62.0提升至73.9。消融和敏感性研究进一步证实了GRASP的有效性和鲁棒性。

英文摘要

Semi-structured knowledge bases (SKBs) embed textual documents in a typed graph of entities and relations, and underpin applications such as product search, academic paper search, and precision-medicine inquiries. Existing hybrid retrieval systems on SKBs either use the graph only for query expansion, mix textual and structural branches under a global weighting, or rely on fine-tuned graph-traversal generators. We present GRASP, a three-stage SKB retrieval framework unifying plan-based graph retrieval, plan-conditioned fusion with a dense retriever, and a fine-tuned reranker over the fused candidates. GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9. Ablation and sensitivity studies further confirm the effectiveness and robustness of GRASP.

2605.29987 2026-06-02 cs.LG cs.CL 版本更新

MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment

MIC: 通过各向同性子空间对齐最大化自适应表示中的信息容量

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对多尺度表示中的维度冗余和谱坍缩问题,提出MIC框架,通过各向同性子空间对齐、软坍缩正则化和谱各向同性正则化,结合自蒸馏目标生成语义密集且高判别力的表示,在高压缩场景下显著优于基线。

Comments Accepted at the GlobalSouthML Workshop at ICML 2026. 8 pages, 2 figures

详情
AI中文摘要

尽管多尺度表示学习能够实现弹性维度的嵌入,但嵌套子空间常常遭受维度冗余和谱坍缩的困扰。为了解决这一问题,我们引入了MIC,一个通过各向同性子空间对齐来优化多粒度嵌入几何景观的框架。MIC采用软坍缩正则化(SCR),通过交叉相关惩罚来减轻前缀子空间和残差子空间之间的冗余,同时使用谱各向同性正则化(SIR)确保低维前缀的超球面均匀性。通过自蒸馏目标统一这些策略,MIC生成语义密集的表示,并保持高判别力。我们的实验表明,MIC显著优于标准基线,特别是在维持信息容量最为关键的高压缩场景中。

英文摘要

Although multi-scales representation learning enables elastic-dimension embeddings, nested subspaces often suffer from dimensional redundancy and spectral collapse. To address this, we introduce MIC, a framework that optimizes the geometric landscape of multi-granular embeddings through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to mitigate redundancy between prefix and residual subspaces via cross-correlation penalties, alongside Spectral Isotropy Regularization (SIR) to ensure hyper-spherical uniformity in low-dimensional prefixes. By unifying these strategies through a self-distillation objective, MIC generates semantically dense representations that maintain high discriminative power. Our experiments demonstrate that MIC significantly outperforms standard baselines, particularly in high-compression scenarios where maintaining informational capacity is most critical.

2605.04583 2026-06-02 cs.CL 版本更新

TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

TajikNLP:一个用于塔吉克语(西里尔字母)综合文本处理的开源工具包

Mullosharif K. Arabov, Karomatullo Habibullozoda, Nurali Shirinov

发表机构 * Institute of Computational Mathematics and Information Technologies(计算数学与信息科技研究所) Kazan Federal University(喀山联邦大学) Bokhtar State University named after Nosiri Khusrav(诺西拉夫命名的博克塔尔州大学)

AI总结 本文介绍 TajikNLP,一个开源的 Python 库,提供首个完整的塔吉克语文本处理流水线,包括清洗、分词、词性标注、词干提取、词形还原等,并发布四个语言数据集,以促进低资源西里尔字母语言的 NLP 研究。

Comments Accepted to CLIB 2026

详情
AI中文摘要

塔吉克语使用西里尔字母书写,在公开可用的自然语言处理工具包方面仍然严重资源不足,这阻碍了语言学研究和应用开发。本文介绍了 TajikNLP,一个开源的 Python 库,它提供了首个完整的流水线来处理真实的塔吉克语文本,同时保留原始西里尔字母正字法。该库实现了以统一 Doc 对象为中心的模块化架构,支持顺序应用组件进行清洗、规范化、分词(包括子词 BPE)、词素切分、词性标注、词干提取、词形还原和句子分割。引入了一个新颖的统一形态引擎,提供受控和深度分析模式,显著改进了对塔吉克语黏着性名词和动词屈折变化的处理。该发布版还包含一个基于词典的情感分析器和直接从 Hugging Face Hub 加载的预训练 Word2Vec/FastText 嵌入。为确保可重复性并促进未来研究,四个配套的语言数据集——一个词性标注语料库(52.5k 条目)、一个情感词典(3.5k 条目)、一个地名辞典(5.6k 条目)和一个个人姓名数据集(3.8k 条目)——已在宽松许可下公开发布。该库的可靠性通过包含 616 个自动化测试的广泛测试套件验证,实现了 93% 的源代码覆盖率。因此,TajikNLP 为塔吉克语处理建立了基础技术基础设施,降低了在低资源西里尔字母环境中学术和工业应用的门槛。

英文摘要

The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The release further incorporates a lexicon-based sentiment analyser and pre-trained Word2Vec/FastText embeddings loaded directly from the Hugging Face Hub. To ensure reproducibility and facilitate future research, four accompanying linguistic datasets -- a POS-tagged corpus (52.5k entries), a sentiment lexicon (3.5k entries), a toponym gazetteer (5.6k entries), and a personal names dataset (3.8k entries) -- have been openly published under permissive licenses. The library's reliability is validated by an extensive test suite of 616 automated tests achieving 93% source code coverage. TajikNLP thus establishes a foundational technological infrastructure for Tajik language processing, lowering the barrier to entry for both academic and industrial applications in low-resource Cyrillic-script environments.

2512.00837 2026-06-02 cs.CL 版本更新

WaterSearch: Exploring Seed Pooling for Improving the Quality-Detectability Trade-off in LLM Watermarking

WaterSearch:探索种子池以改进LLM水印中质量-可检测性权衡

Yukang Lin, Jiahao Shao, Shuoran Jiang, Wentao Zhu, Bingjie Lu, Xiangping Wu, Joanna Siebert, Qingcai Chen

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出WaterSearch框架,通过控制种子池实现句子级搜索,联合优化分布保真度和水印信号特征,在保持高可检测性的同时显著提升文本质量。

详情
AI中文摘要

水印技术作为大型语言模型(LLMs)生成文本的关键保护措施,通过将可识别信号嵌入模型输出,实现可靠归因并增强机器生成内容的安全性。现有方法通常通过操纵令牌生成概率来嵌入信号,尽管有效,但本质上面临可检测性与文本质量之间的权衡:鲁棒水印所需的信号强度和随机性往往会降低下游任务的性能。本文设计了一种新颖的嵌入方案,通过控制种子池促进水印文本的多样化并行生成。基于该方案,我们提出WaterSearch,一种句子级、基于搜索的水印框架,可适应多种现有方法。WaterSearch通过联合优化两个关键方面来提升文本质量:1)分布保真度;2)水印信号特性。此外,WaterSearch辅以句子级检测方法,具有强大的攻击鲁棒性。我们在三个流行LLM上评估了该方法,涵盖十个不同任务。大量实验表明,在水印可检测强度为95%时,我们的方法相比最先进基线实现了平均51.01%的性能提升。在短文本生成和低熵输出生成等挑战性场景中,我们的方法分别获得了47.78%和36.47%的性能增益。此外,在包括插入、同义词替换和释义攻击的不同攻击场景下,WaterSearch保持高可检测性,进一步验证了其强大的抗攻击能力。我们的代码可在https://github.com/Yukang-Lin/WaterSearch获取。

英文摘要

Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{https://github.com/Yukang-Lin/WaterSearch}{https://github.com/Yukang-Lin/WaterSearch}.

2605.29365 2026-06-02 cs.CL 版本更新

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

Casual 作为锚点:解决语体转换数据集中的监督偏差

Hyojeong Yu, Hyukhun Koh, Minsung Kim, Kyomin Jung

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出将语体转换重新概念化为一个分级维度,引入三层次谱系(非正式、随意、正式),并构建3LF数据集以解决现有基准中监督信号偏差导致的非正式到正式转换失败问题。

Comments HEAL@CHI 2026 Workshop Paper

详情
AI中文摘要

语体转换通常被框架化为非正式与正式语体之间的对称双向任务。我们认为这一框架掩盖了现有基准(如GYAFC)中的监督设计缺陷:二元人工改写编码的是相对风格变化,而非绝对的人类语体概念。因此,模型学会生成满足基准标签的伪正式输出,却无法产生真正的正式语言。我们通过重新评估基准正式标签(基于人类对齐的语体定义)量化了这种偏差,揭示了在多个模型族中持续存在的非正式到正式转换失败。为解决此问题,我们将语体转换重新概念化为一个分级维度而非二元属性。我们引入一个三层次谱系:非正式、随意和正式,其中随意作为明确的中间状态,澄清了监督信号。基于此框架,我们提出了3LF数据集,提供所有三个层次的平行监督。在3LF上训练显著减少了非正式到正式转换的失败,并改善了与人类感知的对齐。例如,GPT-4.1-nano在非正式到正式方向上的F1值从0.06提升至0.88,尽管3LF比GYAFC小得多。我们进一步证明这些增益无法仅通过上下文学习复现,并提供了对歧义驱动错误和意义扭曲的定性分析。总体而言,我们的发现展示了监督设计如何塑造风格对齐,并强调了在可控文本生成中考虑对齐的基准构建的重要性。

英文摘要

Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human rewrites encode relative stylistic shifts rather than absolute human notions of formality. Consequently, models learn to generate pseudo-formal outputs that satisfy benchmark labels while failing to produce genuinely formal language. We quantify this misalignment by re-evaluating benchmark formal labels under a human-aligned definition of formality, revealing substantial discrepancies that propagate to consistent informal-to-formal failures across model families. To address this issue, we reconceptualize formality transfer as a graded dimension rather than a binary attribute. We introduce a three-level spectrum: informal, casual, and formal, where casual serves as an explicit intermediate state that clarifies supervision signals. Based on this framework, we introduce 3LF, a dataset providing parallel supervision across all three levels. Training on 3LF substantially reduces informal-to-formal failures and improves alignment with human perception. For example, GPT-4.1-nano improves from 0.06 to 0.88 F1 in the informal-to-formal direction despite 3LF being significantly smaller than GYAFC. We further demonstrate that these gains cannot be reproduced through in-context learning alone and provide qualitative analyses of ambiguity-driven errors and meaning distortions. Overall, our findings demonstrate how supervision design shapes stylistic alignment and highlight the importance of alignment-aware benchmark construction in controllable text generation.

2605.29341 2026-06-02 cs.CV cs.CL 版本更新

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

WorldMemArena: 通过动作-世界交互评估多模态智能体记忆

Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣芭芭拉分校) J.P. Morgan Chase(摩根大通) ETH Zurich(苏黎世联邦理工学院) Stanford University(斯坦福大学) Johns Hopkins University(约翰霍普金斯大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出WorldMemArena基准,通过动作-世界交互循环的四阶段生命周期评估多模态智能体记忆,揭示现有方法在写入、维护、检索和使用中的失败点。

Comments 25 pages, 8 figures

详情
AI中文摘要

多模态大语言模型越来越多地被部署为长周期智能体,其中记忆必须做的不仅仅是回忆:它必须跟踪不断变化的世界,修正过时的信息,并在决策时提供正确的证据。现有基准衡量静态对话中的回忆,将记忆压缩为单一的任务结束准确率,并将视觉观察简化为字幕,使我们无法将失败定位到写入、维护、检索或使用。能够自主管理记忆的智能体框架的兴起加剧了这一差距,因为我们没有原则性的方法来比较手工设计的流水线与自我管理的替代方案。为了弥补这些差距,我们将多模态智能体记忆形式化为一个具有可观察四阶段生命周期的动作-世界交互循环,并在WorldMemArena中实例化:400个多会话多模态任务,涵盖终身演化(演化的个人和任务状态)和智能体执行(来自真实观察、动作和反馈的记忆),并标注了黄金记忆点、更新、干扰项和用于阶段级诊断的证据链。这使得长上下文、手工设计(RAG和外部记忆系统)和基于框架的记忆智能体之间首次进行直接比较。结果表明:(1)更好的记忆写入和存储并不能保证更好的性能;(2)多模态记忆仍然难以充分利用视觉证据;(3)系统在不同领域不稳定,并在真实的智能体轨迹上性能下降;(4)框架记忆更灵活,但成本更高且可靠性较低。

英文摘要

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

2605.13511 2026-06-02 cs.CL cs.AI 版本更新

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Many-Shot CoT-ICL: 使上下文学习真正学习

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

发表机构 * The University of Hong Kong(香港大学)

AI总结 研究多示例思维链上下文学习在推理任务中的特性,提出曲线演示选择方法,在数学任务上提升5.42个百分点。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然多示例ICL取得了显著性能,但先前对其缩放行为的研究主要关注非推理任务。在这项工作中,我们研究了推理任务上的多示例ICL,特别关注多示例思维链上下文学习(CoT-ICL)。通过分析非推理和推理任务以及非推理和推理导向的LLM,我们识别出多示例CoT-ICL的几个独特性质。我们进一步将这些发现解释为多示例CoT-ICL是上下文测试时学习而非缩放模式匹配,并提出两个原则:(i)演示应易于目标模型理解,(ii)它们应按顺序排列以支持平滑的概念进展。受该原则指导,我们提出了曲线演示选择(CDS),一种简单的排序方法,在具有64个演示的数学任务上获得了高达5.42个百分点的提升。总体而言,我们的结果将长上下文窗口从检索缓冲区重新定义为上下文测试时学习的结构化课程。

英文摘要

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In this work, we study many-shot ICL on reasoning tasks, with a particular focus on many-shot chain-of-thought in-context learning (CoT-ICL). Analyzing across non-reasoning and reasoning tasks and across non-reasoning and reasoning-oriented LLMs, we identify several distinctive properties of many-shot CoT-ICL. We further interpret these findings by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggest two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on a math task with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

2605.24727 2026-06-02 cs.AI cs.CL cs.CY cs.IT math.IT 版本更新

Fundamental Limitation in Explaining AI

解释AI的根本限制

Atsushi Suzuki, Jing Wang

发表机构 * Department of Mathematics Faculty of Science(科学学院数学系) The University of Hong Kong Hong Kong SAR(香港大学香港特别行政区) School of Computing and Mathematical Sciences Faculty of Engineering and Science(工程与科学学院计算与数学科学系) University of Greenwich United Kingdom(格林威治大学英国)

AI总结 本文通过数学证明了一个解释AI的基本四难困境,指出AI及其解释无法同时满足环境复杂性、AI性能优良、解释可解释性和解释完全忠实性四个条件,从而表明AI治理应基于解释忠实性总是不完整的假设。

Comments minor modifications

详情
AI中文摘要

尽管大规模模型如LLMs和扩散模型已取得实际成功,公共机构强调了AI可解释性的重要性。然而,现有的解释AI方法并非旨在提供大规模AI系统行为的完全忠实解释。虽然对AI系统行为的完全忠实且可解释的解释可能对AI治理有用,但尚不清楚提供这样的解释在理论上是否可能。在本文中,我们从数学上证明了解释AI的一个基本四难困境,指出AI及其解释无法同时满足以下四个条件:1)操作环境的复杂性,2)AI性能的优良性,3)AI解释的可解释性,以及4)AI解释的完全忠实性。这个四难困境表明,在大多数我们无法改变环境或牺牲良好AI性能和可解释解释的应用中,我们应该放弃解释的完全忠实性,而应仅针对应用重要的部分进行解释。因此,该四难困境意味着AI治理应基于AI解释的忠实性总是不完整的假设来设计。

英文摘要

While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI's performance, 3) the interpretability of the AI's explanation, and 4) the complete faithfulness of the AI's explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

2605.24681 2026-06-02 cs.CL cs.AI 版本更新

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Mix-MoE:通过混合专家混合提升大语言模型的多语言机器翻译

Bo Li, Tianyu Dong, Shaolin Zhu, Deyi Xiong

发表机构 * School of Software, Tsinghua University(清华大学软件学院) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 提出Mix-MoE框架,通过将MoE层分为语言模型专家和机器翻译专家,并利用傅里叶变换增强路由机制,解决大语言模型在多语言机器翻译微调中的参数干扰问题。

Comments Accepted by TASLP

详情
AI中文摘要

大语言模型(LLMs)在多语言机器翻译(MT)中展现出巨大潜力,即使双语监督有限。然而,使用平行语料库微调LLMs带来了主要挑战,即参数干扰。为了解决这些问题,我们提出了Mix-MoE,一个混合专家混合框架,旨在训练LLMs进行多语言MT。我们的框架在两个不同的阶段运行:(1)在单语语料库上使用MoE进行后预训练,以及(2)在平行语料库上使用MoE进行后预训练。关键的是,我们将MoE层分为两个专门的组:语言模型专家(LM专家)和机器翻译专家(MT专家)。LM专家旨在捕获和保留预训练LLM学到的单语知识。另一方面,MT专家专门训练以获取和存储双语翻译知识。此外,为了促进这些专门专家之间的有效交互并利用文本中潜在的结构模式,我们引入了一种由模型表示中的傅里叶变换特征增强的路由机制。实验结果表明,Mix-MoE在多语言MT中表现出色,显著优于现有基线,并在缓解参数干扰方面取得了显著进展。

英文摘要

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

2605.24528 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Hypothesis Generation and Inductive Inference in Children and Language Models

儿童与语言模型中的假设生成与归纳推理

Jeffrey Qin, Wasu Top Piriyakulkij, Zhuangfei Gao, Mia Radovanovic, Jessica Sommerville, Kevin Ellis, Marta Kryven

发表机构 * Computer Science University of Waterloo(滑铁卢大学计算机科学系) Department of Computer Science Cornell University(康奈尔大学计算机科学系) Department of Computer Science Dalhousie University(达尔豪斯大学计算机科学系) Department of Psychology University of Toronto(多伦多大学心理学系)

AI总结 通过归纳推理盒子任务,结合贝叶斯粒子推断的程序归纳形式化,比较儿童与基于LLM的智能体在不确定性下的假设生成与证据寻求行为,发现两者在适应环境结构上相似但信息寻求成本与归纳偏差不同。

详情
AI中文摘要

现实世界中的决策需要在证据、潜在因果规则以及世界状态本身的不确定性下构建心智模型。在这种条件下,哪些计算原理支撑人类的推理?在给定匹配约束下,基于LLM的智能体是否表现出类似行为?我们使用归纳推理盒子任务来探讨这些问题,在该任务中,参与者(人类儿童和基于LLM的智能体)通过与不确定环境的顺序交互来推断潜在原因。我们将该任务形式化为基于贝叶斯粒子推断的程序归纳,并承认两种互补的解释:(1) 作为对假设的约束满足过程,以及(2) 作为程序综合问题,其中假设是针对证据评估的可执行程序。使用基于约束的公式,我们表明儿童的行为最好由主观证据可靠性和在线假设生成的组合来解释,这解释了他们的证据寻求模式以及任务完成与规则泛化之间的分离。使用程序综合公式,我们将基于LLM的智能体视为模型有机体:可控系统,允许系统性地操纵任务条件。在各种后端中,基于LLM的智能体复制了儿童对证据可靠性和可观察性变化的反应,包括折扣不可靠证据、寻求解决部分信息以及任务完成与因果泛化之间的分离。同时,与儿童相比,基于LLM的智能体倾向于过度观察和过度遵守指令。这些结果表明,虽然儿童和基于LLM的智能体在适应环境结构方面相似,但他们的信息寻求行为表现出不同的潜在成本和归纳偏差。

英文摘要

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

2605.18838 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

说谎只是一个阶段:语言模型扩展中的隐藏对齐转变

Adil Amin

发表机构 * ZEHEN Labs(ZEHEN实验室)

AI总结 通过分析63个基础模型,发现语言模型在特定规模阈值下,推理能力与真实性从反相关转变为正相关,并揭示了输出投影瓶颈和零竞争注意力头等内部机制。

Comments 15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." ( https://doi.org/10.48550/arXiv.2605.18840). Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

扩展定律预测了计算量带来的损失,但未预测能力如何相互作用。我们测量了来自16个家族的63个基础模型的推理能力与真实性之间的耦合,并发现了一个在损失曲线中不可见的相变:低于家族依赖的临界规模N_c时,能力反相关(r = -0.989,p = 4 x 10^{-5},非参数置换检验);高于该规模时,它们合作。N_c ~ 3.5B参数 [2.9B, 13.4B](bootstrap 95% CI),但模型大小并非决定相位的唯一变量。架构、数据整理和训练配方各自独立地改变N_c:精心整理的数据消除了Qwen代际之间的耦合下降(在匹配规模下从0.025到0.830),Gemma-4在4B时通过蒸馏和架构创新实现了0.871的耦合,这通常是13B+标准训练模型的特征,而Phi在1B时仅通过数据整理就达到了10B网络训练模型的耦合水平。宽度归一化消除了所有测试家族的反相关,支持输出投影瓶颈的存在。在内部,40个模型中有38个显示零竞争注意力头。一个稀疏回归ODE以5.6%的误差交叉预测了保留的Llama-2。该诊断不需要模型内部信息——仅需跨模型家族的公开基准分数。合作区域扩展到前沿(r = +0.72,34个模型,10个实验室)。一个概念验证干预证实了瓶颈是可利用的:在识别层添加单个真实方向向量,无需重新训练即可纠正税收阶段60%的错位输出——这是一种无需修改权重的、每推理一次的外科手术式修正。代码、数据、用于任何开放权重模型的开源转向CLI以及用于相位诊断的交互式仪表板已发布:https://zehenlabs.com/cape/。

英文摘要

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

2603.09095 2026-06-02 cs.CL cs.CV 版本更新

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

阅读,而非思考:理解并弥合多模态大语言模型中文本变为像素时的模态差距

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Amazon(亚马逊) New York University(纽约大学) Texas A&M University(德克萨斯大学)

AI总结 本文系统诊断多模态大语言模型在处理图像文本时的模态差距,发现其源于模型推理意愿不足而非感知失败,并提出一种轻量级自蒸馏方法有效弥合该差距。

详情
AI中文摘要

多模态大语言模型(MLLMs)能够处理以图像形式呈现的文本,但它们的表现往往不如相同内容以文本令牌形式提供时。我们通过在五种输入模式下跨七个基准评估七个MLLM,系统性地诊断了这种“模态差距”,涵盖了从合成渲染文本到来自arXiv PDF和Wikipedia页面的真实文档图像。我们发现,该差距对字体和分辨率等渲染选择高度敏感,并且自然文档图像通常表现出更小的差距,这表明性能差异部分反映了评估伪影而非根本性限制。通过对超过4000个示例进行基于扎根理论的错误分析,我们确定了主要原因:仅图像输入抑制了推理努力,模型产生的输出短5-19倍,跳过了逐步计算或推理。不愿推理,而非感知或知识检索失败,驱动了性能差距,尤其是在需要多步推理的任务上。我们展示了一种简单的、轻量级的在线自蒸馏方法,通过让模型在其自身的文本模式推理轨迹与图像输入配对上进行微调,弥合了这一差距,将图像模式准确率提升至匹配或超过文本模式性能,提升超过50%,并且增益可迁移到未见过的基准而不会灾难性遗忘。总体而言,我们的结果和分析提供了对模态差距的系统理解,并指出了在多模态语言模型中改进视觉文本理解的实际路径。

英文摘要

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the gap is highly sensitive to rendering choices such as font and resolution, and that natural document images often exhibit much smaller gaps, suggesting the performance difference partly reflects evaluation artifacts rather than fundamental limitations. Through a grounded-theory error analysis of over 4,000 examples, we identify the primary cause: image input alone suppresses reasoning effort, with models producing 5--19x shorter outputs that skip step-by-step computation or reasoning. The reluctance to reason, not a failure of perception or knowledge retrieval, drives the performance gap, particularly on tasks requiring multi-step reasoning. We show that a simple, lightweight on-policy self-distillation method by fine-tuning models on their own text-mode reasoning traces paired with image inputs closes this gap, raising image-mode accuracy to match or exceed text-mode performance with over 50\% improvement, and the gains transfer to unseen benchmarks without catastrophic forgetting. Overall, our results and analyses provide a systematic understanding of the modality gap and suggest a practical path toward improving visual text understanding in multimodal language models.

2605.24005 2026-06-02 cs.AI cs.CL 版本更新

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD:通过一致性调节奖励分解挖掘潜在逻辑以实现自我进化推理

Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学)

AI总结 针对大语言模型推理中高质量过程数据稀缺的问题,提出LC-ERD框架,通过潜在逻辑挖掘和一致性调节的奖励分解,实现自我对齐与推理进化。

Comments Accepted in SIGKDD 2026 Research Track

详情
AI中文摘要

大语言模型推理的进化受到高质量过程数据稀缺的瓶颈限制。虽然通过内生奖励进行自我对齐提供了一种解决方案,但挖掘有效监督面临三个挑战:(1)通过模仿偏差产生的标签噪声,奖励优先考虑统计可能性而非逻辑真实性,造成掩盖复合错误的“正确性幻觉”;(2)粗粒度监督,稀疏的全局结果(例如在GRPO中)无法提供细粒度指导,将推理链视为整体;(3)分布崩溃,信号无法在不放大预训练偏差的情况下泛化。为了解决这些问题,我们引入了LC-ERD(逻辑一致的内生奖励分解),一个将自我对齐视为潜在结构挖掘的框架。我们通过聚合模型潜在逻辑专家(LLE)的共识推导出变分逻辑势,以去噪推理流形,并引入基于IGM原则的多智能体价值分解协议来量化单个步骤的效用。实验表明,LC-ERD提供了一条稳健的自我进化路径,揭示了逻辑一致性与准确性之间的权衡,同时识别了标准奖励遗漏的高价值推理模式。我们的代码可在https://github.com/LC-ERD-repo/LC-ERD获取。

英文摘要

The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC-ERD-repo/LC-ERD.

2605.22978 2026-06-02 cs.CL 版本更新

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

一种可复现的通用依赖风格管道用于Katharevousa希腊语议会文本

George Mikros, Fotios Fitsilis

发表机构 * Hamad Bin Khalifa University(哈马德·本·卡立夫大学) Universidad Austral(阿维拉大学)

AI总结 针对Katharevousa希腊语,提出一种可复现的通用依赖风格解析管道,结合OCR重建、LLM辅助标注、自动验证和模型比较,显著提升解析性能并开源全部资源。

Comments 12 pages, 1 figure, 2 tables; companion to the kathnlp open-source release at https://github.com/gmikros/katharevousa-nlp-tooling

详情
AI中文摘要

Katharevousa希腊语尽管在法律、行政和议会档案中具有重要性,但当代NLP管道对其支持不足。我们提出了一种可复现的工作流程,用于构建和评估针对希腊后军政府时期早期议会问题的通用依赖风格解析资源。该管道结合了OCR感知重建、模式约束的LLM辅助标注、自动验证、确定性CoNLL-U快照、固定分割评估和模型族比较。冻结的自动验证参考集包含1,697个句子,分为1,357个训练句子和340个保留测试句子。我们在相同评分协议下比较了现成的希腊语和古希腊语解析器、基于特征的解析器、mBERT、XLM-R和自定义Stanza训练。现成系统显示出显著的语域不匹配:最强的外部基线spaCy希腊语达到0.4183 LAS。最佳结构解析器XLM-R模型达到0.8893 UPOS准确率、0.7250依赖关系F1、0.6098 UAS和0.5162 LAS,比最佳外部基线绝对LAS提升0.0980。基于特征的模型在UPOS和关系标注上仍具竞争力,表明在此数据规模下透明的词汇-上下文特征仍然重要。除了分数,本文还贡献了一种可审计的方法论,用于将困难的历史议会OCR转化为可重用的句法NLP基础设施。整个管道——代码、模式、冻结参考标注、固定训练/测试分割和每个模型的基准报告——作为本文的开放获取附件发布。

英文摘要

Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.

2602.04509 2026-06-02 cs.CL 版本更新

Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Model-Dowser:无数据重要性探测以缓解多模态大语言模型中的灾难性遗忘

Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim

发表机构 * GitHub

AI总结 提出Model-Dowser,一种基于重要性评分的稀疏微调方法,通过联合考虑权重幅度、输入激活和输出敏感性来缓解多模态大语言模型微调中的灾难性遗忘,在LLaVA和NVILA上优于现有方法。

Comments Accepted at ICML 2026. Code link: https://model-dowser.github.io

详情
AI中文摘要

在特定任务数据上微调多模态大语言模型(MLLMs)是提高下游应用性能的有效方法。然而,这种适应通常会导致预训练任务泛化能力的下降,这种现象称为灾难性遗忘。现有的旨在缓解该问题的方法要么在微调语言解码器更深层时失效,要么随着模型规模增大而扩展性差。为了解决这些局限性,我们提出了Model-Dowser,一种用于MLLMs的新型稀疏微调方法。Model-Dowser通过联合考虑权重幅度、输入激活和输出敏感性,为每个模型参数测量相对于预训练泛化能力(在下游适应之前)的原则性重要性分数。在微调过程中,Model-Dowser选择性地保留高重要性参数并更新其余参数。在两个代表性MLLMs(LLaVA和NVILA)上的综合实验表明,Model-Dowser有效缓解了灾难性遗忘,并始终优于先前方法,同时保持资源高效且可扩展到数十亿参数模型。

英文摘要

Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

2605.14355 2026-06-02 cs.AI cs.CL 版本更新

Herculean: An Agentic Benchmark for Financial Intelligence

Herculean: 面向金融智能的智能体基准测试

Xueqing Peng, Zhuohan Xie, Yupeng Cao, Haohang Li, Lingfei Qian, Yan Wang, Vincent Jim Zhang, Huan He, Xuguang Ai, Linhai Ma, Ruoyu Xiang, Yueru He, Yi Han, Shuyao Wang, Yuqing Guo, Mingyang Jiang, Yilun Zhao, Youzhong Dong, Xiaoyu Wang, Yankai Chen, Ye Yuan, Qiyuan Zhang, Fuyuan Lyu, Haolun Wu, Yonghan Yang, Zichen Zhao, Yuyang Dai, Fan Zhang, Rania Elbadry, Ayesha Gull, Muhammad Usman Safder, Nuo Chen, Fengbin Zhu, Tianshi Cai, Zimu Wang, Polydoros Giannouris, Yuechen Jiang, Zhiwei Liu, Mohsinul Kabir, Yuyan Wang, Yixiang Zheng, Yangyang Yu, Weijin Liu, Wenbo Cao, Anke Xu, Peng Lu, Jerry Huang, Mingquan Lin, Prayag Tiwari, Yijia Zhao, Víctor Gutiérrez-Basulto, Xiao-Yang Liu, Kaleb E Smith, Jiahuan Pei, Arman Cohan, Jimin Huang, Yuehua Tang, Alejandro Lopez-Lira, Xi Chen, Xue Liu, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou

发表机构 * The Fin AI Yale University(耶鲁大学) Columbia University(哥伦比亚大学) Stevens Institute of Technology(史蒂文斯理工学院) NVIDIA(英伟达) New York University(纽约大学) Georgia Institute of Technology(佐治亚理工学院) University of Florida(佛罗里达大学) MBZUAI Université de Montréal(蒙特利尔大学) University of Minnesota(明尼苏达大学) University of Massachusetts Boston(马萨诸塞大学波士顿分校) National Institute of Advanced Industrial Science and Technology(国家先进工业科学与技术研究院) University of Liverpool(利物浦大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) National University of Singapore(新加坡国立大学) Halmstad University(哈尔姆斯塔德大学) University of Manchester(曼彻斯特大学) Cardiff University(卡迪夫大学) McGill University(麦吉尔大学) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 本文提出Herculean,首个覆盖交易、对冲、市场洞察和审计四个代表性工作流的智能体金融智能基准测试,通过标准化MCP技能环境评估异构智能体系统,发现智能体在交易和市场洞察上表现较好,但在对冲和审计等需要长期协调、状态一致性和结构化验证的任务上存在显著不足。

详情
AI中文摘要

随着AI智能体的进步,核心问题不再是它们能否解决孤立的、定义明确的金融任务,而是它们能否可靠地执行金融专业工作。现有的金融基准测试仅提供了这种能力的部分视角,因为它们主要评估静态能力,如问答、检索、摘要和分类。我们引入了Herculean,这是首个面向智能体金融智能的技能基准测试,涵盖四个代表性工作流,包括交易、对冲、市场洞察和审计。每个工作流被实例化为一个基于MCP的标准化技能环境,具有自己的工具、交互动态、约束和成功标准,从而能够对异构智能体系统进行一致的端到端评估。在前沿智能体中,我们发现智能体在交易和市场洞察上表现相对较好,但在对冲和审计上表现显著不足,这些任务中长期协调、状态一致性和结构化验证至关重要。总体而言,我们的结果指出了当前智能体在高风险金融工作流中将金融推理转化为可靠工作流执行方面的关键差距。

英文摘要

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

2605.19575 2026-06-02 cs.CL 版本更新

A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics

基于理论语言学专家标准的习语性数据驱动方法

Elena Mikhalkova, Anastasiya Vishnyakova, Anastasiya Drozdova, Polina Gavin, Aleksander Zhmykhov, Timofey Protasov

发表机构 * Tyumen State University(塔戈尔大学) Aston University(阿斯顿大学)

AI总结 基于理论语言学专家标注的16项标准,分析286个多词表达的习语性分布,发现无绝对习语表达,词汇标准影响最大。

详情
AI中文摘要

本文基于理论书籍和论文中描述的16项词汇、语法及其他习语性标准,对286个多词表达(MWE)进行了数据分析。MWE来自相同的理论来源,并由一组语言学专家根据这些类别进行标注。类别分布显示,不存在绝对习语性的表达。词汇标准似乎最具影响力;语法标准受特定条件约束;过时词汇和语法的存在影响MWE被单个词替换的能力。

英文摘要

The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

2605.19344 2026-06-02 cs.CL 版本更新

Retrieval-Augmented Linguistic Calibration

检索增强的语言校准

Yi-Fan Yeh, Linwei Tao, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu

发表机构 * School of Computer Science, University of Sydney(悉尼大学计算机科学学院) City University of Hong Kong(香港城市大学) Shanghai Jiao Tong University(上海交通大学) University of Oxford, Department of Engineering Science(牛津大学工程科学系) Department of Engineering Science, University of Oxford(牛津大学工程科学系)

AI总结 提出检索增强的语言校准(RALC)框架,通过分布建模和检索增强改写,提升语言置信度表达的真实性和校准性。

详情
AI中文摘要

诸如“我相信”和“可能”等语言线索为传达置信度提供了直观的界面,然而,一个可泛化的、有原则的语言置信度表达校准框架仍未被充分探索。特别是,共现的语言线索、上下文变化和主观的受众解读带来了独特的挑战。因此,我们将语言置信度建模为关于陈述正确性的合理感知概率值的分布,捕捉标量表示所丢弃的解释变异性。在这个分布框架内,我们引入真实性作为互补的评估维度,并提出真实性散度(FD),一种信息论度量,用于量化真相揭示时受众信念所引发的惊讶。基于这些基础,我们提出检索增强的语言校准(RALC),一个轻量级的后处理管道,通过检索增强改写将校准后的置信度信号传播回自然语言。在三个问答基准和五个LLM家族上,RALC将域内真实性和校准性分别提高了高达66%和58%,优于黑盒和灰盒校准基线。

英文摘要

Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

2605.05945 2026-06-02 cs.CV cs.CL 版本更新

MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

MobileEgo Anywhere:基于商用硬件的长时域自我中心数据开放基础设施

Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathore, Pratyush Patnaik, Shubhanshu Khatana, Ekaksh Janweja

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出MobileEgo Anywhere框架,利用智能手机传感器实现超过一小时的自我中心轨迹采集,并发布开源处理流水线STERA、移动应用及200小时数据集,验证其在视觉-语言-动作模型训练中的有效性。

详情
AI中文摘要

视觉-语言-动作(VLA)模型推动了对大规模自我中心数据集的需求,但用于收集长时域数据的硬件和基础设施仍然难以获取。当前数据集通常只有几分钟长的片段,无法捕捉复杂机器人任务执行所需的长时域时间依赖。我们提出MobileEgo Anywhere,一个在商用移动硬件上收集超过一小时自我中心轨迹的框架,利用现代智能手机传感器进行长期姿态跟踪,避免了传统机器人数据收集的硬件障碍。我们发布三个组件:(1)STERA,一个开源视频处理流水线,将原始移动捕获转换为标准化、训练就绪的格式,用于VLA和基础模型研究;(2)一个免费的移动应用,让任何用户记录自我中心活动;(3)一个200小时的数据集,包含多样化的长格式自我中心数据,跨584个会话具有持久状态跟踪。我们进一步展示该数据是可用的训练信号:在其上对VLA进行中期训练可降低保留动作预测误差。

英文摘要

Vision-language-action (VLA) models have driven demand for large-scale egocentric datasets, yet the hardware and infrastructure to collect long-horizon data remain inaccessible. Datasets today typically have episodes only a few minutes long, which fails to capture the long-horizon temporal dependencies that complex robotic task execution requires. We present MobileEgo Anywhere, a framework for collecting hour-plus egocentric trajectories on commodity mobile hardware that uses modern smartphone sensors for long-term pose tracking without the hardware barriers of traditional robotics data collection. We release three components: (1) STERA, an open-source video-processing pipeline that converts raw mobile captures into standardized, training-ready formats for VLA and foundation-model research; (2) a free mobile app that lets any user record egocentric activity; and (3) a 200-hour dataset of diverse, long-form egocentric data with persistent state tracking across 584 sessions. We further show this data is a usable training signal:mid-training a VLA on it lowers held-out action-prediction error.

2603.05308 2026-06-02 cs.CL cs.AI 版本更新

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1:用于零样本和可扩展生物医学证据归因的小型语言模型

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu

发表机构 * Division of Intramural Research, National Library of Medicine, National Institutes of Health(国家医学图书馆内部研究部,国立卫生研究院) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Center for Cancer Research, National Cancer Institute, National Institutes of Health(国家癌症研究所癌症研究中心,国立卫生研究院) Department of Population Health Sciences, Weill Cornell Medicine Institute of AI for Digital Health, Weill Cornell Medicine(韦尔·科恩医学中心流行病学与健康科学系,韦尔·科恩医学中心人工智能与数字健康研究所)

AI总结 提出仅3B参数的小语言模型Med-V1,通过高质量合成数据训练,在生物医学证据归因任务上性能媲美GPT-5等前沿大模型,并用于量化LLM幻觉和识别临床指南中的证据错误归因。

详情
AI中文摘要

评估一篇文章是否支持某个断言对于幻觉检测和声明验证至关重要。虽然大型语言模型(LLM)有潜力自动化这一任务,但实现强性能需要如GPT-5这样的前沿模型,而这些模型在规模部署时成本过高。为了高效执行生物医学证据归因,我们提出了Med-V1,一个仅有三亿参数的小语言模型家族。在本研究中新开发的高质量合成数据上训练,Med-V1在统一为验证格式的五个生物医学基准上显著优于其基础模型(+27.0%至+71.3%)。尽管规模较小,Med-V1的性能与GPT-5等前沿LLM相当,并提供高质量的预测解释。我们使用Med-V1进行了首次用例研究,量化了不同引用指令下LLM生成答案中的幻觉。结果表明,格式指令强烈影响引文有效性和幻觉,GPT-5生成更多声明但表现出与GPT-4o相似的幻觉率。此外,我们展示了第二个用例,表明Med-V1可以自动识别临床实践指南中的高风险证据错误归因,揭示了否则难以大规模识别的潜在负面公共卫生影响。总体而言,Med-V1为生物医学证据归因和验证任务的实际应用提供了一种高效、准确的轻量级替代方案。Med-V1可在https://github.com/ncbi-nlp/Med-V1获取。

英文摘要

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

2510.25799 2026-06-02 cs.CL 版本更新

LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

LISTEN 你的偏好:面向多目标选择的LLM框架

Adam S. Jovine, Tinghan Ye, Francis Bahk, Jingjing Wang, Matthew Ford, David B. Shmoys, Peter I. Frazier

发表机构 * Cornell University(康奈尔大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出LISTEN框架,利用LLM作为决策代理,通过迭代优化内部偏好模型(LISTEN-U参数法或LISTEN-T非参数法)从自然语言中学习用户隐含偏好,实现多目标选择。

Comments Accepted at IJCAI-ECAI 2026 (the 35th International Joint Conference on Artificial Intelligence)

详情
AI中文摘要

人类专家通常难以从大量具有多个竞争目标的项目中选出最佳选项,这一过程因难以形式化复杂、隐含的偏好而成为瓶颈。为解决此问题,我们引入LISTEN(基于LLM的迭代选择与自然语言权衡评估),一种基于LLM的代理框架,将LLM视为决策代理,能够迭代优化其内部偏好模型并采取行动(如提出效用或选择候选者),以最大化与用户隐含目标的对齐。为了在上下文窗口和推理成本等LLM约束下运行,我们提出两种迭代算法:LISTEN-U,使用LLM细化参数化效用函数;以及LISTEN-T,一种非参数方法,对小批量解决方案进行锦标赛式选择。在包括航班预订、购物和考试排程等多样化任务上的评估结果显示,当偏好参数化对齐时(我们通过一种新颖的一致性度量来衡量),LISTEN-U表现优异,而LISTEN-T整体性能更稳健。这项工作探索了直接用自然语言引导复杂多目标决策的有前景方向,减轻了传统偏好诱导的认知负担。代码见https://github.com/AdamJovine/LISTEN;数据见https://huggingface.co/datasets/AdamJovine/LISTEN-benchmark。

英文摘要

Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN (LLM-based Iterative Selection with Trade-off Evaluation from Natural-language), an agentic LLM-based framework that treats the LLM as a decision-making agent capable of iteratively refining its internal preference model and taking actions (e.g., proposing utilities or selecting candidates) to maximize alignment with a user's implicit goals. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance overall. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation. Code is available at https://github.com/AdamJovine/LISTEN; data is available at https://huggingface.co/datasets/AdamJovine/LISTEN-benchmark.

2602.19612 2026-06-02 cs.CL 版本更新

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

遗忘的解剖:事实显著性与模型微调的双重影响

Anna Borisiuk, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina

发表机构 * AIRI Sber AI Lab HSE University(HSE大学) Skoltech ISP RAS Research Center for Trusted Artificial Intelligence(ISP RAS可信人工智能研究中心)

AI总结 本研究通过构建DUET基准,分析了预训练与监督微调阶段的事实显著性对机器遗忘效果的影响,发现微调后的模型遗忘更平滑且保留率更高。

详情
AI中文摘要

机器遗忘(MU)使大型语言模型(LLM)能够移除不安全或过时的信息。然而,现有工作假设所有事实同样可遗忘,并且很大程度上忽略了被遗忘的知识是来自预训练还是监督微调(SFT)。在本文中,我们引入了DUET(跨训练阶段的双重遗忘评估),这是一个包含28.6k个维基数据派生三元组的基准,这些三元组使用维基百科链接计数和基于LLM的显著性分数标注了事实流行度。我们的实验表明,预训练模型和SFT模型对遗忘的反应不同。在遗忘数据上进行SFT步骤可产生更平滑的遗忘、更稳定的调优以及10-50%更高的保留率,而直接在预训练模型上进行遗忘仍然不稳定,容易发生重新学习或灾难性遗忘。

英文摘要

Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUET (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

2605.13136 2026-06-02 cs.CL 版本更新

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

GateKD: 基于置信度门控的闭环蒸馏用于鲁棒推理

Kasidit Sermsri, Teerapong Panboonyuen

发表机构 * Chulalongkorn University(朱拉隆梭大学) MARSAIL PBYAIL (Panboonyuen AI Lab)(Panboonyuen AI Lab)

AI总结 提出GateKD框架,通过置信度门控的闭环蒸馏(包括软监督、隐藏状态演化和注意力蒸馏)减少噪声传播,提升学生模型在常识、逻辑和符号推理上的鲁棒性。

Comments 16 pages

详情
AI中文摘要

从大型语言模型(LLMs)中蒸馏多步推理能力到紧凑的学生模型仍然具有挑战性,原因在于噪声理由、幻觉监督和静态的师生交互。现有的推理蒸馏方法,包括基于导师的方法,主要采用开环方式运行,隐含地假设教师可靠性一致,从而传播错误的中间推理。我们提出GateKD,一种置信度门控的闭环蒸馏框架,通过将教师视为动态门控者而非静态预言家,实现鲁棒的推理迁移。GateKD引入三种互补机制:(i) 置信度门控的软监督,选择性蒸馏可靠的预测信号;(ii) 门控的隐藏状态演化,仅在教师置信度高时对齐中间表示;(iii) 可靠性过滤的注意力蒸馏,在抑制噪声模式的同时保留稳定的推理结构。这些组件共同形成一个闭环反馈回路,其中教师置信度持续调节蒸馏过程,减少幻觉转移并稳定学生推理。在常识、逻辑和符号推理基准上,使用不同规模的T5和Flan-T5骨干网络进行的大量实验表明,GateKD始终优于强开环蒸馏基线。值得注意的是,GateKD在逻辑和符号推理中取得了显著收益,在低资源蒸馏设置下保持鲁棒性,并且当移除任何门控组件时性能明显下降。我们的结果强调,置信度门控的闭环监督对于构建可靠且可扩展的小型推理模型至关重要。

英文摘要

Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.

2605.12813 2026-06-02 cs.CL cs.AI cs.CR cs.LG 版本更新

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA: 引发LLM幻觉的逼真潜在对抗攻击

Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出REALISTA框架,通过潜在空间优化语义等价的对抗提示,有效引发大语言模型幻觉,优于现有方法。

Comments Accepted at ICML 2026. Code is available at https://github.com/Buyun-Liang/REALISTA

详情
AI中文摘要

大型语言模型(LLM)在许多任务上表现出色,但仍然容易产生幻觉,因此有必要在逼真的对抗输入下系统地评估其可靠性。我们将幻觉引发问题形式化为一个约束优化问题,目标是找到与良性用户提示语义等价的对抗提示。现有攻击方法仍有局限:基于离散提示的攻击保持语义等价性和连贯性,但仅搜索有限的提示变体;而连续潜在空间攻击探索更丰富的空间,但通常解码为不再有效改写的提示。为解决这些局限,我们提出REALISTA,一个逼真的潜在空间攻击框架。REALISTA构建了一个依赖于输入的合法编辑方向字典,每个方向对应一个语义等价且连贯的改写,并在潜在空间中优化这些方向的连续组合。这种设计结合了连续攻击的优化灵活性和基于离散改写的攻击的语义逼真性。实验表明,REALISTA在开源LLM上达到优于或与最先进逼真攻击相当的性能,并且关键的是,在自由形式响应设置下成功攻击大型推理模型,而先前的逼真攻击则失败。代码可在https://github.com/Buyun-Liang/REALISTA获取。

英文摘要

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to systematically evaluate their reliability under realistic adversarial inputs. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing attack methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.

2605.11374 2026-06-02 cs.LG cs.CL cs.IR 版本更新

Test-Time Compute for Frozen Embedding Models through Agentic Program Search

冻结嵌入模型在测试时通过智能体程序搜索的计算

Han Xiao

发表机构 * Jina AI by Elastic(Jina AI 由 Elastic 提供)

AI总结 本文提出一种智能体程序搜索方法,通过大语言模型编写程序操作冻结编码器API,在推理时提升小嵌入模型的检索质量,无需训练参数且跨任务迁移。

Comments 15 pages, 7 figures, 4 tables

详情
AI中文摘要

测试时计算被广泛认为只对大型推理模型有益,小模型无法从中获益。对于密集检索,我们持相反观点,因为现代小型嵌入模型是从大型语言模型骨干蒸馏或适配而来,继承了其潜在的测试时计算能力。我们探究一个冻结的嵌入模型在仅推理时,无需辅助模型且部署时不训练任何参数,能获得多少检索质量提升。一个智能体循环中,大语言模型在冻结编码器API上编写程序,探索144个候选程序,得到12个帕累托最优程序,这些程序在成本比率从$c=1.2$到$14.7$之间权衡推理计算与质量,每个程序在所有14个发现任务上均提升了nDCG@10。这些程序不使用可训练参数,并恢复了经典检索原语,包括倒数秩融合、Fisher线性判别、Rocchio伪相关反馈和句子级MaxSim。未经修改地应用于19个保留任务和三个未见过的编码器家族,单个固定程序改进了大多数任务,中位数$Δ$nDCG@10为正,在$c\ge4$时胜率为54%至57%,且在发现过程中从未见过的编码器家族上增益最大。一个在相同任务上训练的匹配预算学习投影头无法以这种方式迁移,它在域内检索上提升了$+0.20$至$+0.25$ nDCG@10,但在每个保留编码器上均低于基线。因此,小型嵌入模型继承了可用的测试时计算潜力,冻结编码器将推理计算转化为检索增益,并迁移到新的语料库和编码器,无需每个领域的标签。

英文摘要

Test-time compute is widely believed to benefit only large reasoning models, leaving small models with nothing to gain. We argue the opposite for dense retrieval, since modern small embedding models are distilled or adapted from large language model backbones and can inherit their latent test-time-compute potential. We ask how much retrieval quality a frozen embedding model gains at inference alone, with no auxiliary model and no parameters trained at deployment. An agentic loop in which a large language model writes programs over a frozen encoder API explores 144 candidates and yields twelve Pareto-optimal programs that trade inference compute for quality across cost ratios from $c{=}1.2$ to $14.7$, every one improving nDCG@10 on all 14 discovery tasks. The programs use no trainable parameters and recover classical retrieval primitives, among them reciprocal rank fusion, the Fisher linear discriminant, Rocchio pseudo-relevance feedback, and sentence-level MaxSim. Applied unmodified to nineteen held-out tasks and three unseen encoder families, a single fixed program improves the majority of tasks, with a positive median $Δ$nDCG@10 and a 54 to 57% win-rate at $c{\ge}4$, and the gains are largest on encoder families never seen during discovery. A matched-budget learned projection head trained on the same tasks does not transfer this way, improving in-domain retrieval by $+0.20$ to $+0.25$ nDCG@10 yet falling below baseline on every held-out encoder. Small embedding models therefore inherit usable test-time-compute potential, and a frozen encoder converts inference compute into retrieval gains that transfer to new corpora and encoders with no per-domain labels.

2512.17738 2026-06-02 cs.CL 版本更新

When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

当黄金标准未必标准:评估用户生成内容翻译的挑战

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

发表机构 * Inria Paris(巴黎研究所)

AI总结 针对用户生成内容(UGC)中非标准语言现象,提出十二种非标准现象和五种翻译行为的分类法,并论证翻译评估需考虑指南的标准化程度,呼吁开发可控的、指南感知的评估框架。

Comments 10 pages (23 with references and appendices). Accepted at EAMT 2026

详情
AI中文摘要

用户生成内容(UGC)的特点是频繁使用非标准语言,从拼写错误到表情选择(如俚语、字符重复和表情符号)。这使得评估UGC翻译具有挑战性:什么是“好”的翻译取决于输出的期望标准化水平。为探讨这一点,我们检查了四个UGC数据集的人工翻译指南,并推导出十二种非标准现象和五种翻译行为(标准化、复制、转移、省略、审查)的分类法。我们的分析揭示了UGC处理方式的显著差异,导致参考翻译的标准化程度存在差异。我们表明,大型语言模型的翻译分数对带有明确UGC翻译指令的提示高度敏感,并且当这些指令与数据集指南一致时,分数会提高。我们认为,公平评估需要模型和指标都了解翻译指南。最后,我们呼吁在数据集创建过程中制定明确的指南,并开发可控的、指南感知的UGC翻译评估框架。

英文摘要

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation challenging: what counts as a "good" translation depends on the desired standardness level of the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. We show that translation scores of large language models are highly sensitive to prompts with explicit UGC translation instructions, and that they improve when they align with the dataset guidelines. We argue that fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

2605.09253 2026-06-02 cs.CL cs.AI 版本更新

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

基石还是绊脚石?解读在线策略蒸馏中的岩石令牌

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校) Case Western Reserve University(凯斯西储大学) Arizona State University(亚利桑那州立大学) VU Amsterdam(阿姆斯特丹自由大学)

AI总结 本文研究在线策略蒸馏中持续高损失的“岩石令牌”,发现它们虽占据大量梯度但功能贡献微弱,提出绕过这些令牌可简化对齐过程。

详情
AI中文摘要

尽管近期关于可验证奖励强化学习(RLVR)的研究表明,一小部分关键令牌不成比例地驱动推理增益,但在线策略蒸馏(OPD)中类似的令牌级理解仍未探索。本文研究了高损失令牌——在OPD的逐令牌KL目标下,作为师生不匹配的最直接信号,根据现有研究,这些令牌应随着训练收敛而逐渐减少;然而,我们的实证分析显示并非如此。即使在OPD训练达到明显饱和后,仍有大量令牌持续表现出高损失;我们将这些令牌称为“岩石令牌”,它们可占生成输出中高达18%的令牌。我们的研究揭示了两个令人惊讶的悖论。首先,尽管这些令牌的高出现频率提供了不成比例的大份额总梯度范数,但岩石令牌本身在整个训练过程中保持停滞,抵抗教师驱动的修正。其次,通过因果干预,我们发现这些令牌对模型的实际推理性能贡献可忽略不计。这些发现表明,大量优化带宽被花费在学生模型无法或无需内化的结构和话语残差上。通过解构这些动态,我们证明策略性地绕过这些“绊脚石”可以显著简化对齐过程,挑战了统一令牌权重的必要性,并为大规模模型蒸馏提供了更高效的范式。

英文摘要

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

2605.09098 2026-06-02 cs.CL 版本更新

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

动态元度量:面向机器翻译评估的源句条件加权

Luke Zhang, Justin Vasselli, Aditya Khan, York Hay Ng, En-Shiun Annie Lee

发表机构 * University of Toronto, Canada(多伦多大学) Nara Institute of Science and Technology, Japan(奈良科学技術大學) Ontario Tech University, Canada(安大略技术大学)

AI总结 提出动态元度量(DMM)框架,通过源句条件组合现有度量来提升机器翻译评估性能,实验表明MLP组合优于线性与高斯过程集成,软条件扩展进一步带来提升。

Comments 5 pages, ACL SRW 2026

详情
AI中文摘要

我们提出动态元度量(DMM),一种用于机器翻译评估的框架,学习基于源句条件组合现有度量。DMM不依赖单一的静态集成或语言特定权重,而是根据源片段属性调整度量组合。我们研究了硬条件,即每个簇拟合一个可解释的组合器,以及探索性的软条件扩展,其权重随源簇责任连续变化。我们使用系统和片段级别的成对一致性度量,在WMT度量共享任务数据上跨多个语言对评估DMM。在各种设置下,基于MLP的组合优于线性和基于高斯过程的集成,引入软条件在线性模型上带来了增益。

英文摘要

We propose Dynamic Meta-Metrics (DMM), a framework for machine translation evaluation that learns source-sentence conditioned combinations of existing metrics. Rather than relying on a single static ensemble or language-specific weighting, DMM adapts the metric combination based on properties of the source segment. We study hard conditioning, which fits an interpretable combiner per cluster, and an exploratory soft-conditioned extension whose weights vary continuously with source-cluster responsibilities. We evaluate DMM on the WMT Metrics Shared Task data across multiple language pairs using pairwise agreement measures at the system and segment levels. Across settings, MLP-based combinations outperform linear and Gaussian process-based ensembles, and introducing soft conditioning yields gains over linear models.

2602.16571 2026-06-02 cs.CL 版本更新

Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

数学辅导中的效用保持去标识化:MathEd-PII基准数据集中的数值歧义研究

Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak, Daryl Hedley, Jorge Dias, Chris Shaw, Ruth Schäfer, René F. Kizilcec

发表机构 * University of Washington(华盛顿大学)

AI总结 针对数学辅导对话中数值表达式与标识符相似导致过度去标识化的问题,提出MathEd-PII基准数据集,并采用领域感知提示策略(F1达0.821)在保持数据效用的同时有效检测PII。

详情
AI中文摘要

大规模共享对话数据是推进教学科学的关键,但严格的去标识化仍是一大障碍。在数学辅导记录中,数值表达式常与结构化标识符(如日期或ID)相似,导致通用个人身份信息(PII)检测系统过度编辑核心教学内容,降低数据效用。本研究探讨如何在保持教育效用的同时检测PII,重点关注这一“数值歧义”问题。我们引入了MathEd-PII,这是首个用于数学辅导对话中PII检测的基准数据集,通过人机协同的LLM标注构建。利用基于密度的分割,我们发现虚假PII编辑集中在数学密集区域,证实数值歧义是主要失败模式。随后比较了四种检测策略:Presidio基线以及三种基于LLM的方法(基础提示、数学感知提示和片段感知提示)。领域感知提示(包括数学感知F1: 0.802和片段感知F1: 0.821)显著优于基线(F1: 0.379),同时减少了数值假阳性,表明去标识化必须融入领域上下文以保持分析效用。本研究提供了新的基准和证据,表明辅导数据的效用保持去标识化需要领域感知建模。

英文摘要

Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

2605.00696 2026-06-02 stat.ML cs.CL cs.LG 版本更新

Adaptive Querying with AI Persona Priors

基于AI人格先验的自适应查询

Kaizheng Wang, Yuhang Wu, Assaf Zeevi

发表机构 * Department of Industrial Engineering and Operations Research and Data Science Institute, Columbia University(工业工程与运筹学系及数据科学研究所,哥伦比亚大学) Decision, Risk, and Operations Division, Columbia Business School(决策、风险与运营部门,哥伦比亚商学院)

AI总结 提出一种基于AI人格诱导的潜变量模型,利用大语言模型生成响应分布,实现高效贝叶斯设计,用于在有限查询预算下学习用户相关量。

Comments ICML 2026

详情
AI中文摘要

我们研究在严格查询预算内,通过自适应查询学习用户相关的感兴趣量(如对保留项目的响应和心理测量指标)的问题。经典的贝叶斯设计和计算机化自适应测试通常依赖于限制性的参数假设或昂贵的后验近似,限制了它们在异质性、高维和冷启动场景中的应用。我们引入了一种人格诱导的潜变量模型,通过有限字典中的AI人格成员身份来表示用户状态,每种人格由大语言模型产生的响应分布提供。这产生了具有闭式后验更新和高效有限混合预测的表达性先验,从而实现了可扩展的贝叶斯设计用于顺序项目选择。在合成数据和WorldValuesBench上的实验表明,基于人格的后验提供了准确的概率预测和可解释的自适应启发流程。

英文摘要

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight query budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.

2506.05412 2026-06-02 cs.CV cs.CL 版本更新

Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

视觉-语言模型将头部方向误认为注视方向:非语言对话线索

Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo

发表机构 * Brown University(布朗大学) Columbia University(哥伦比亚大学) Emory University(埃默里大学) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学) University of Michigan(密歇根大学) UC San Diego(圣地亚哥大学)

AI总结 本研究通过控制头部方向的实验发现,视觉-语言模型(VLMs)在推断注视目标时主要依赖头部方向而非眼睛外观,导致与人类存在显著性能差距,并指出数据偏差是主要原因。

Comments Accepted by ACL 2026. Project page at https://zoryzhang.github.io/gaze/

详情
AI中文摘要

一个人的注视方向是儿童和成人常用的非语言交流线索。视觉-语言模型(VLMs)推断注视目标的能力如何?为了构建评估刺激,我们拍摄了1,360张真实场景照片,其中一个人注视着桌子上几个物体之一。重要的是,我们还控制了注视者的头部方向:有时朝向注视目标,有时朝向干扰物,有时不加约束。我们发现VLMs与人类之间存在显著的性能差距,排除了分辨率、物体命名能力等替代解释,并确定了差距的主要原因是VLMs使用头部方向而非眼睛外观来推断注视方向。这种偏差可能源于数据而非架构,正如基于transformer的视觉模型微调的概念验证实验所表明的那样。未来的工作应研究这些发现是否广泛适用于基于现有数据训练的各种深度学习方法,以及更好的数据是否能缓解所有架构的这一问题。准确定位原因将为能够解读注视目标的技术奠定基础,从而与人类进行更高效的交互。

英文摘要

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.

2604.25702 2026-06-02 cs.CL 版本更新

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

反向翻译增强的直接偏好优化用于神经机器翻译

Mehrdad Ghassabi, Spehr Rajabi, Hamidreza Baradaran Kashani, Sadra Hakim, Mahshid Keivandarian, Amirhossein Jahani Bahnamiri

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出基于直接偏好优化的强化学习后训练框架,利用通用文本语料和专家反馈纠正神经机器翻译错误,在英德翻译任务上将COMET分数从0.703提升至0.747。

Comments 5 pages, 2 figures

详情
AI中文摘要

当代神经机器翻译(NMT)系统几乎完全通过在有监督的平行数据上训练构建。尽管取得了巨大进展,这些系统仍然表现出持续的翻译错误。本文提出,基于强化学习(RL)的后训练范式可以有效纠正此类错误。我们引入了一个新颖的框架,仅需要通用文本语料库和专家翻译器(可以是人类或AI系统)提供迭代反馈。在我们的实验中,我们特别关注英德翻译作为代表性高资源语言对。关键的是,我们使用直接偏好优化(DPO)实现了这种基于RL的后训练。将我们的DPO驱动框架应用于gemma3-1b模型,在英德翻译任务上显著提升了翻译质量,将其COMET分数从0.703提高到0.747。结果表明,DPO为通过基于偏好的后训练增强预训练NMT模型提供了一条高效且稳定的途径。

英文摘要

Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it's COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.

2604.07967 2026-06-02 cs.CL cs.AI 版本更新

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

AtomEval: 事实核查中对抗性声明重写的有效性感知原子评估

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang

发表机构 * Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出AtomEval协议,通过原子分解和保留门控,区分有效规避验证与改变命题的重写,并引入VASR指标,解决传统ASR膨胀问题。

详情
AI中文摘要

大型语言模型(LLM)可以重写被驳斥的声明以规避基于证据的事实核查器,但当重写改变、削弱或纠正了本应保留的虚假命题时,传统的攻击成功率(ASR)可能会被夸大。我们引入了AtomEval,一种用于固定证据对抗性声明重写的有效性感知评估协议。AtomEval将声明表示为“主体-关系-客体-修饰语”(SROM)原子,应用单向保留门将有效的验证器规避与改变命题的重写分开,并报告有效性感知攻击成功率(VASR),该指标仅统计保留原始虚假命题的验证器规避重写。AtomEval进一步提供细粒度诊断,解释命题级失败和非最小有效重写。在FEVER被驳斥声明重写任务上,AtomEval揭示并解释了ASR膨胀:许多明显的攻击通过改变、削弱或纠正本应保留的命题来欺骗验证器。通过使受攻击命题的保留变得明确且可测量,AtomEval为评估必须在验证器规避与命题保留之间取得平衡的对抗性重写器提供了稳定的评估目标。

英文摘要

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies a one-way preservation gate to separate valid verifier evasion from proposition-changing rewrites, and reports validity-aware attack success rate (VASR), which counts only verifier-evasive rewrites that preserve the original false proposition. AtomEval further provides fine-grained diagnostics that explain both proposition-level failures and non-minimal valid rewrites. On FEVER refuted-claim rewriting, AtomEval exposes and explains ASR inflation: many apparent attacks fool the verifier by altering, weakening, or correcting the proposition they should preserve. By making attacked-proposition preservation explicit and measurable, AtomEval provides a stable evaluation target for evaluating adversarial rewriters that must balance verifier evasion with proposition preservation.

2602.17513 2026-06-02 cs.CL 版本更新

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

弥合领域鸿沟:从MIMIC-III到产科的监督式与零样本临床章节分割

Baris Karacan, Barbara Di Eugenio, Patrick Thornton

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 通过构建产科笔记数据集、评估基于Transformer的监督模型和首次与零样本大语言模型对比,发现监督模型在域内表现强但域外下降显著,而零样本模型在修正幻觉后展现出稳健的域外适应性。

Comments 14 pages. Camera-ready version accepted at LREC 2026; includes minor revisions and an appendix. To appear in the conference proceedings

详情
Journal ref
Proceedings of the 2026 Language Resources and Evaluation Conference (LREC 2026), pages 2594-2607, Palma, Spain. ELRA 2026
AI中文摘要

临床自由文本笔记包含重要的患者信息。它们被组织成带标签的章节;识别这些章节已被证明支持临床决策和下游NLP任务。在本文中,我们通过三个关键贡献推进临床章节分割。首先,我们整理了一个新的去标识化、带章节标签的产科笔记数据集,以补充公共语料库(如MIMIC-III)所涵盖的医学领域,现有的大多数分割方法都是在这些语料库上训练的。其次,我们在MIMIC-III的一个精选子集(域内)和新的产科数据集(域外)上系统评估了基于Transformer的监督模型用于章节分割。第三,我们首次将医学章节分割的监督模型与零样本大语言模型进行直接比较。我们的结果表明,虽然监督模型在域内表现强劲,但其性能在域外大幅下降。相比之下,一旦纠正了幻觉章节标题,零样本模型展现出稳健的域外适应性。这些发现强调了开发特定领域临床资源的重要性,并指出零样本分割是将医疗NLP应用于研究充分的语料库之外的一个有前景的方向,前提是适当管理幻觉。

英文摘要

Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

2604.21511 2026-06-02 cs.IR cs.CL 版本更新

From Tokens to Concepts: Leveraging SAE for SPLADE

从词元到概念:利用稀疏自编码器增强SPLADE

Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski

发表机构 * Sorbonne Université, CNRS, ISIR Paris France(索邦大学、国家科学研究中心、巴黎信息科学研究所法国)

AI总结 针对SPLADE依赖骨干词汇表导致多义性和同义性等问题,提出用稀疏自编码器学习的语义概念空间替换词汇表,实现性能相当且效率提升。

Comments 11 pages, 3 figures, 9 tables. To appear at SIGIR 2026

详情
AI中文摘要

学习型稀疏IR模型(如SPLADE)在效率与效果之间取得了很好的平衡。然而,它们依赖于底层的骨干词汇表,这可能阻碍性能(多义性和同义性),并对多语言和多模态使用构成挑战。为了解决这一限制,我们提出用稀疏自编码器(SAE)学习的语义概念潜在空间替换骨干词汇表。在本文中,我们研究了这两个概念的兼容性,探索了训练方法,并分析了我们的SAE-SPLADE模型与传统SPLADE模型之间的差异。实验表明,SAE-SPLADE在域内和域外任务上均能达到与SPLADE相当的检索性能,同时提供更高的效率。

英文摘要

Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

2604.19786 2026-06-02 cs.CL 版本更新

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

HumorRank: 基于锦标赛的大语言模型幽默生成评估排行榜

Edward Ajayi, Prasenjit Mitra

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲分校)

AI总结 提出HumorRank,一种基于锦标赛的框架,通过理论指导的成对偏好判断对文本幽默生成进行排名,并利用Bradley-Terry估计生成全局排行榜。

详情
AI中文摘要

幽默在大语言模型(LLM)中仍然难以评估,因为一个回答是否有趣是主观的、比较性的,并且由相互作用的喜剧机制而非单一标量属性塑造。因此,现有的幽默评估协议往往产生孤立的分数或特定任务的判断,难以跨模型进行比较。我们引入了HumorRank,一种基于锦标赛的框架,通过理论指导的成对偏好判断对文本幽默生成进行排名。在SemEval-2026 MWAHAHA和Humor Transfer Bench上,HumorRank使用基于LLM的比较判断(基于言语幽默通论GTVH)评估了九个专有、开放权重和专门模型,并通过Bradley-Terry估计的锦标赛聚合生成全局排名。得到的排名跨评判者稳定:独立的Llama和Qwen LLM评判者在两个基准上均达到Kendall τ = 0.889。排行榜揭示了清晰的模型分层,表明强大的幽默生成不仅依赖于规模,还依赖于对喜剧机制(如不协调、简洁、升级和荒谬)的掌握。HumorRank提供了一种可扩展且可解释的方法,用于对LLM生成的幽默进行基准测试,而不完全依赖孤立的自动指标或有限的人工评估。

英文摘要

Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor generation through theory-grounded pairwise preference judgments. Across SemEval-2026 MWAHAHA and Humor Transfer Bench, HumorRank evaluates nine proprietary, open-weight, and specialized models using LLM-based comparative judgments informed by the General Theory of Verbal Humor (GTVH), with tournament aggregation yielding global rankings via Bradley-Terry estimation. The resulting rankings are cross-judge stable: independent Llama and Qwen LLM judges achieve Kendall τ = 0.889 on both benchmarks. The leaderboard reveals clear model stratification, showing that strong humor generation depends not only on scale but on mastery of comedic mechanisms such as incongruity, conciseness, escalation, and absurdity. HumorRank provides a scalable and interpretable methodology for benchmarking LLM-generated humor without relying solely on isolated automatic metrics or limited human evaluation.

2510.11423 2026-06-02 cs.SI cs.CL 版本更新

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

超越众包:基于LLM增强的社区笔记治理健康错误信息

Jiaying Wu, Zihang Fu, Haonan Wang, Fanxiao Li, Jiafeng Guo, Preslav Nakov, Min-Yen Kan

发表机构 * National University of Singapore(新加坡国立大学) Yunnan University(云南大学) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院人工智能安全国家重点实验室,计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对X平台社区笔记在健康错误信息治理中延迟高的问题,提出CrowdNotes+框架,通过LLM增强笔记生成与评估,显著提升正确性、有用性和证据效用。

Comments ACL 2026

详情
AI中文摘要

社区笔记是X(原Twitter)上众包错误信息治理系统,允许用户标记误导性帖子、附加上下文笔记并评价笔记的有用性。然而,我们对30.8K条健康相关笔记的实证分析显示,笔记获得有用性状态的中位延迟高达17.6小时。为了在现实世界错误信息激增期间提高响应速度,我们提出CrowdNotes+,一个统一的基于LLM的框架,通过增强社区笔记实现更快、更可靠的健康错误信息治理。CrowdNotes+集成了两种模式:(1)基于证据的笔记增强和(2)效用引导的笔记自动化,并由相关性、正确性和有用性的三级层次评估支持。我们通过HealthNotes(一个包含1.2K条健康笔记并标注有用性的基准数据集)和微调的有用性判断器实例化该框架。我们的分析首先揭示了当前众包治理的一个关键漏洞:投票者经常将风格流畅性与事实准确性混淆。通过我们的层次评估解决这一问题,对15个代表性LLM的实验表明,CrowdNotes+在笔记正确性、有用性和证据效用方面显著优于人类贡献者。

英文摘要

Community Notes, the crowd-sourced misinformation governance system on X (formerly Twitter), allows users to flag misleading posts, attach contextual notes, and rate the notes' helpfulness. However, our empirical analysis of 30.8K health-related notes reveals substantial latency, with a median delay of 17.6 hours before notes receive a helpfulness status. To improve responsiveness during real-world misinformation surges, we propose CrowdNotes+, a unified LLM-based framework that augments Community Notes for faster and more reliable health misinformation governance. CrowdNotes+ integrates two modes: (1) evidence-grounded note augmentation and (2) utility-guided note automation, supported by a hierarchical three-stage evaluation of relevance, correctness, and helpfulness. We instantiate the framework with HealthNotes, a benchmark of 1.2K health notes annotated for helpfulness, and a fine-tuned helpfulness judge. Our analysis first uncovers a key loophole in current crowd-sourced governance: voters frequently conflate stylistic fluency with factual accuracy. Addressing this via our hierarchical evaluation, experiments across 15 representative LLMs demonstrate that CrowdNotes+ significantly outperforms human contributors in note correctness, helpfulness, and evidence utility.

2604.18360 2026-06-02 cs.SD cs.CL 版本更新

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Omni-Embed-Audio: 利用多模态大语言模型实现鲁棒的音频-文本检索

HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang

发表机构 * Sogang University(首尔大学)

AI总结 提出Omni-Embed-Audio(OEA)检索编码器,利用多模态大语言模型原生理解音频,并通过用户意图查询(UIQ)和硬负样本挖掘,在文本到音频检索中达到与M2D-CLAP相当的性能,同时在文本到文本检索和硬负样本判别上显著优于现有方法。

Comments Accepted at ACL 2026 Main Conference. Camera-ready version

详情
AI中文摘要

基于对比语言-音频预训练(CLAP)的音频-文本检索系统在传统基准上表现强劲;然而,这些基准依赖于与真实世界搜索行为差异显著的标题风格查询,限制了其对实际检索鲁棒性的评估。我们提出了Omni-Embed-Audio(OEA),一种利用具有原生音频理解能力的多模态大语言模型的检索导向编码器。为了系统评估超越标题风格查询的鲁棒性,我们引入了用户意图查询(UIQ)——五种反映自然搜索行为的表述形式:问题、命令、关键词标签、释义和基于排除的负查询。对于负查询,我们开发了一个硬负样本挖掘管道,并提出了判别指标(HNSR, TFR),评估模型抑制声学相似干扰物的能力。在AudioCaps、Clotho和MECAT上的实验表明,OEA在文本到音频检索性能上与最先进的M2D-CLAP相当,同时在两个关键领域展现出明显优势:(1)主导的文本到文本检索(相对提升22%),以及(2)显著优越的硬负样本判别(HNSR@10提升4.3个百分点,TFR@10相对提升34.7%),揭示了大语言模型骨干对复杂查询具有更优的语义理解能力。

英文摘要

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.

2601.14750 2026-06-02 cs.CL cs.CV 版本更新

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Render-of-Thought: 将文本思维链渲染为图像以进行视觉潜在推理

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

发表机构 * Tencent BAC(腾讯BAC) Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学) School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Mathematics and Statistics, University of Glasgow(格拉斯哥大学数学与统计学学院)

AI总结 提出Render-of-Thought框架,通过将思维链的文本步骤渲染为图像,利用视觉语言模型的视觉编码器进行语义对齐,实现3-4倍令牌压缩和推理加速,同时保持竞争性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

思维链提示在解锁大型语言模型的推理能力方面取得了显著成功。尽管思维链提示增强了推理能力,但其冗长性带来了巨大的计算开销。最近的工作通常只关注结果对齐,缺乏对中间推理过程的监督。这些缺陷掩盖了潜在推理链的可分析性。为了解决这些挑战,我们引入了Render-of-Thought,这是第一个通过将文本步骤渲染为图像来具体化推理链的框架,使潜在推理过程显式且可追溯。具体来说,我们利用现有视觉语言模型的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现,而无需额外的预训练开销。在数学和逻辑推理基准上的大量实验表明,与显式思维链相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,它与其他方法相比保持了竞争性能,验证了这种范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT获取。

英文摘要

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

2507.05179 2026-06-02 cs.CL 版本更新

From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

从碎片到事实:一种课程驱动的DPO方法用于生成印地语新闻真实性解释

Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Adam Jatowt, Sriparna Saha

发表机构 * TCS Research(TCS研究) Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院帕纳布计算机科学与工程系) Indian Institute of Technology Patna(印度理工学院帕纳布) University of Innsbruck(因斯布鲁克大学)

AI总结 针对印地语等低资源语言的虚假信息检测,提出一种结合直接偏好优化(DPO)与课程学习的框架,通过引入“实际性”和“精炼度”参数优化解释质量,实验验证了生成连贯、相关解释的有效性。

Comments Accepted at ACL 2026 Findings

详情
AI中文摘要

在虚假信息泛滥的时代,生成可靠的新闻解释至关重要,尤其是对于印地语等代表性不足的语言。由于缺乏强大的自动化工具,印地语在扩展虚假信息检测方面面临挑战。为弥补这一差距,我们提出了一种新颖的框架,将直接偏好优化(DPO)与课程学习相结合,使机器生成的解释与人类推理对齐。来自可信来源的事实核查解释作为偏好响应,而LLM输出则突出系统局限性并作为非偏好响应。为了优化任务特定的对齐,我们在DPO损失函数中引入了两个关键参数——实际性和精炼度,从而提高了解释的质量和一致性。使用LLM(Mistral、Llama、Gemma)和PLM(mBART、mT5)进行的实验证实了该框架在生成连贯、上下文相关解释方面的有效性。这种可扩展的方法有助于打击虚假信息,并将自动解释生成扩展到低资源语言。

英文摘要

In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters -- Actuality and Finesse -- into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

2604.11424 2026-06-02 cs.CL 版本更新

Bridging What the Model Thinks and How It Speaks: Expressive Speech Generation via Self-Aware Intent-Realization Alignment

弥合模型所想与所言之间的鸿沟:通过自我意识意图-实现对齐的富有表现力的语音生成

Kuang Wang, Lai Wei, Ping Lin, Qibing Bai, Wenkai Fang, Li Zhou, Feng Jiang, Zhongjie Jiang, Jun Huang, Yannan Wang, Haizhou Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tencent Ethereal Audio Lab(腾讯虚实音频实验室) Southeast University(东南大学) Zhejiang University(浙江大学) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 提出SASLM框架,通过自蒸馏意图和自奖励优化,无需外部标注即可弥合语义理解与声学实现之间的差距,实现富有表现力的语音生成。

Comments Submitted to EMNLP 2026. Project page: https://wangkevin02.github.io/SASLM/

详情
AI中文摘要

语音语言模型(SLM)表现出强大的语义理解能力,但往往无法将这种能力转化为富有表现力的声学实现,生成的语音韵律平淡且情感错位。我们将这种不匹配识别为语义理解-声学实现差距。现有方法通常依赖外部指定的代理,如情感标签或风格提示,这些需要标注且难以捕捉对话中动态变化的表达意图。为克服这些限制,我们提出SASLM(自我意识语音语言模型),一种无代理框架,通过自我意识意图-实现对齐弥合模型所想与所言之间的鸿沟:(1)意图感知桥接通过变分信息瓶颈(VIB)从模型自身的演化语义生成状态中自蒸馏表达意图,从而在无外部表达监督下指导表达性语音实现;(2)实现感知对齐通过自奖励优化反思性地将生成的声学与预期表达对齐,在语音生成过程中逐步提高意图-实现一致性。尽管仅使用3B参数和800小时表达性语音数据,SASLM在EchoMind上达到了开源系统中的最先进性能,超越了规模大10倍以上的模型,并接近商业系统。

英文摘要

Speech Language Models (SLMs) exhibit strong semantic understanding, yet often fail to translate this capacity into expressive acoustic realization, producing speech with flattened prosody and misaligned emotion. We identify this mismatch as the semantic understanding-acoustic realization gap. Existing approaches typically rely on externally specified proxies, such as emotion labels or style prompts, which require annotations and struggle to capture dynamically evolving expressive intent throughout dialogue. To overcome these limitations, we propose SASLM (Self-Aware Speech Language Model), a proxy-free framework that bridges what the model thinks and how it speaks through self-aware intent-realization alignment: (1) Intent-Aware Bridging self-distills expressive intent from the model's own evolving semantic generation states via a Variational Information Bottleneck (VIB), thereby guiding expressive speech realization without external expressive supervision; while (2) Realization-Aware Alignment reflectively aligns generated acoustics with intended expression through self-reward optimization, progressively improving intent-realization consistency during speech generation. Despite using only 3B parameters and 800 hours of expressive speech data, SASLM achieves state-of-the-art performance on EchoMind among open-source systems, surpassing models over 10 times larger and approaching commercial systems.

2604.10788 2026-06-02 cs.CL cs.AI 版本更新

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR:探索大语言模型中的工具内化推理

Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) University of Edinburgh(爱丁堡大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 本文提出TInR-U框架,通过工具内化、监督微调和强化学习三阶段训练,使LLM无需外部文档即可进行工具集成推理,在域内和域外设置中均取得优越性能。

Comments Accepted to ACL 2026

详情
AI中文摘要

工具集成推理(TIR)通过扩展大语言模型(LLM)在推理过程中使用外部工具的能力,已成为一个有前景的方向。现有的TIR方法通常在推理过程中依赖外部工具文档。然而,这导致了工具掌握困难、工具规模限制和推理效率低下等问题。为了缓解这些问题,我们探索了工具内化推理(TInR),旨在促进使用内化到LLM中的工具知识进行推理。实现这一目标面临显著的要求,包括工具内化和工具-推理协调。为了解决这些问题,我们提出了TInR-U,一个用于统一推理和工具使用的工具内化推理框架。TInR-U通过三阶段流水线进行训练:1)使用双向知识对齐策略进行工具内化;2)使用高质量推理注释进行监督微调预热;3)使用TInR特定奖励进行强化学习。我们在域内和域外设置中全面评估了我们的方法。实验结果表明,TInR-U在两种设置下均实现了优越的性能,突显了其有效性和效率。

英文摘要

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

2604.10688 2026-06-02 cs.LG cs.AI cs.CL 版本更新

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

SCOPE: 信号校准的在线策略蒸馏增强与双路径自适应加权

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan LongCat Interaction Team(美团 LongCat 交互团队) Nanjing University(南京大学) Fudan University(复旦大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对在线策略强化学习中奖励稀疏导致的信用分配难题,提出SCOPE框架,通过双路径自适应加权机制分别处理正确与错误轨迹,实现信号校准的蒸馏增强,在六个推理基准上平均提升11.42%的Avg@32和7.30%的Pass@32。

详情
AI中文摘要

在线策略强化学习已成为大型语言模型推理对齐的主导范式,但其稀疏的结果级奖励使得令牌级信用分配异常困难。在线策略蒸馏(OPD)通过引入来自教师模型的密集令牌级KL监督缓解了这一问题,但通常对所有rollout均匀应用这种监督,忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强(SCOPE),一种双路径自适应训练框架,根据正确性将在线策略rollout路由到两个互补的监督路径。对于错误轨迹,SCOPE执行教师困惑度加权的KL蒸馏,优先考虑教师展现出真正纠正能力的实例,同时降低不可靠指导的权重。对于正确轨迹,它应用学生困惑度加权的MLE,将强化集中在能力边界上的低置信度样本,而不是过度强化已掌握的样本。两条路径都采用组级归一化来自适应校准权重分布,考虑不同提示的内在难度差异。在六个推理基准上的大量实验表明,SCOPE在Avg@32和Pass@32上分别比竞争基线平均相对提升11.42%和7.30%,证明了其一致的有效性。

英文摘要

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

2604.04937 2026-06-02 cs.AI cs.CL 版本更新

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Pramana: 通过 Navya-Nyaya 微调大型语言模型进行认知推理

Sharath Sathish

发表机构 * University of York(约克大学)

AI总结 提出 Pramana 方法,利用 2500 年历史的印度 Navya-Nyaya 逻辑框架微调 LLM,通过结构化六阶段推理解决认知差距,提升可溯源性。

Comments 52 pages + appendices, comprehensive treatment of Navya-Nyaya computational formalization

详情
AI中文摘要

大型语言模型能生成流畅文本,但在系统推理方面存在困难,常常产生自信但无根据的幻觉。当苹果研究人员向数学问题添加无关背景时,LLM 性能下降了 65% (Apple Machine Learning Research),暴露出表面推理下脆弱的模式匹配。这种认知差距,即无法将主张建立在可追溯证据上的能力,限制了 AI 在需要论证的领域的可靠性。我们引入 Pramana,一种新颖的方法,通过在 Navya-Nyaya 逻辑(一种 2500 年历史的印度推理框架)上进行微调,教导 LLM 显式的认识论方法论。与通用的思维链提示不同,Navya-Nyaya 强制执行结构化的六阶段推理:SAMSHAYA(怀疑分析)、PRAMANA(证据源识别)、PANCHA AVAYAVA(包含普遍规则的五段论)、TARKA(反事实验证)、HETVABHASA(谬误检测)和 NIRNAYA(区分知识与假设的确定)。这种逻辑与认识论的整合提供了标准推理方法所缺乏的认知支架。我们在 55 个 Nyaya 结构化的逻辑问题(约束满足、布尔 SAT、多步演绎)上微调了 Llama 3.2-3B 和 DeepSeek-R1-Distill-Llama-8B。第一阶段在保留评估上实现了 100% 的语义正确性,尽管严格格式遵循率仅为 40%,这表明即使结构执行不完美,模型也能内化推理内容。消融研究表明格式提示和温度对性能有关键影响,且不同阶段的最优配置不同。我们在 Hugging Face 上发布所有模型、数据集和训练基础设施,以促进关于 AI 推理认识论框架的进一步研究。

英文摘要

Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.

2602.00906 2026-06-02 cs.LG cs.AI cs.CL cs.DS cs.IT math.IT 版本更新

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

幻觉是空间最优性的结果:成员测试的率失真定理

Anxin Guo, Jingwei Li

发表机构 * Computer Science Department, Northwestern University(西北大学计算机科学系) Department of IEOR, Columbia University(哥伦比亚大学工业工程与运筹学系)

AI总结 通过将幻觉形式化为成员测试问题,建立率失真定理,证明在有限容量下信息论最优策略必然导致对某些非事实的高置信度,从而产生幻觉。

Comments ICML 2026

详情
AI中文摘要

大型语言模型通常对缺乏可推断模式的“随机事实”以高置信度产生幻觉。我们将此类事实的记忆形式化为一个成员测试问题,统一了布隆过滤器的离散误差指标与LLM的连续对数损失。通过分析在事实在可能主张的宇宙中稀疏的情况下,我们建立了一个率失真定理:最优记忆效率由事实与非事实得分分布之间的最小KL散度刻画。这一理论框架在理想化设置下为幻觉提供了独特的解释:即使有最优训练、完美数据和简化的“封闭世界”设置,有限容量下信息论最优策略不是放弃或遗忘,而是对某些非事实赋予高置信度,从而导致幻觉。我们在合成数据和真实数据上实证验证了这一理论,表明幻觉作为有损压缩的自然结果持续存在。同一定理恢复并锐化了布隆型滤波器的经典空间下界,确定了两侧滤波器遗留的加性常数。

英文摘要

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination under an idealized setting: even with optimal training, perfect data, and a simplified ``closed world'' setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on both synthetic and real-world data, showing that hallucinations persist as a natural consequence of lossy compression. The same theorem recovers and sharpens classical space lower bounds for Bloom-type filters, pinning down an additive constant left open for two-sided filters.

2603.03312 2026-06-02 cs.CL cs.AI cs.HC eess.AS q-bio.NC 版本更新

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

逃离BLEU陷阱:一种基于信号锚定与解耦语义引导的脑电解码文本框架

Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Shenzhen Loop Area Institute(深圳环湖研究院)

AI总结 针对脑电解码文本中语义偏差、信号忽略和BLEU陷阱问题,提出SemKey多阶段框架,通过解耦语义目标(情感、主题、长度、意外性)和主动检索解码机制,强制生成基于信号而非语言先验,并采用检索和分布度量(如Fréchet距离)建立评估协议,有效缓解幻觉并达到最优性能。

详情
AI中文摘要

从非侵入性脑电信号中解码自然语言是一项有前景但充满挑战的任务。然而,当前最先进的模型仍受限于三个基本问题:语义偏差(输出退化为通用语言模板)、信号忽略(模型严重依赖大语言模型先验,即使在缺乏有意义信号时也能生成流畅文本)以及“BLEU陷阱”(高频停用词虚增n-gram指标,掩盖真正语义保真度的缺失)。为解决这些挑战,我们超越传统的端到端流水线,提出SemKey——一种新颖的多阶段框架,通过四个解耦的语义目标(情感、主题、长度和意外性)强制进行基于信号的生成。我们直接从脑电嵌入中提取这些语义锚点,然后通过主动检索解码机制统一它们,迫使大语言模型将其令牌生成锚定在神经信号上,而非默认使用语言先验。此外,我们通过建立全面的评估协议(使用严格的检索和基于分布的度量,如Fréchet距离)打破BLEU陷阱。大量实验表明,SemKey有效缓解了对噪声输入的幻觉,并在这些鲁棒协议上达到了最先进的性能。代码将在论文被接收后发布于https://github.com/xmed-lab/SemKey。

英文摘要

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental issues: Semantic Bias, where outputs collapse into generic linguistic templates; Signal Neglect, where models rely heavily on LLM priors to hallucinate fluent text even in the absence of meaningful signals; and the "BLEU Trap", where high-frequency stopwords inflate n-gram metrics, masking a lack of true semantic fidelity. To resolve these challenges, we move beyond conventional end-to-end pipelines and propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We extract these semantic anchors from EEG embeddings directly, then unify them with an Active Retrieval Decoding mechanism, compelling the LLM to ground its token generation in the neural signals rather than defaulting to linguistic priors. Furthermore, we break the BLEU Trap by establishing a comprehensive evaluation protocol using rigorous retrieval and distribution-based metrics such as Fréchet Distance. Extensive experiments demonstrate that SemKey effectively mitigates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.

2604.01562 2026-06-02 cs.SD cs.AI cs.CL cs.CY cs.HC 版本更新

Acoustic and perceptual differences between standard and accented speech and their voice clones

标准口音与带口音语音及其语音克隆的声学与感知差异

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

发表机构 * Department of Linguistics, University at Buffalo, United States(语言学系,布法罗大学,美国) Department of Computer Science and Engineering, University at Buffalo, United States(计算机科学与工程系,布法罗大学,美国) Emeritus Faculty, Australian National University, Australia(澳大利亚国立大学荣誉教职)

AI总结 通过计算和感知实验,比较标准口音与带口音普通话及其语音克隆,发现口音影响感知身份匹配和可懂度,且标准口音克隆更接近原声,带口音克隆可懂度提升更大。

详情
AI中文摘要

语音克隆通常根据整体质量进行评估,但关于口音保留及其感知后果的了解较少。我们采用计算和感知相结合的设计,比较标准口音和重度口音普通话及其语音克隆。基于嵌入的分析显示,在多个说话人判别嵌入空间中,带口音说话人的原始-克隆距离更大,但在根据每个说话人的原始内部基线变异性进行归一化后,这种差异消失。在感知研究中,标准口音说话人的克隆被评价为比带口音说话人的克隆更接近其原始声音,并且从原始到克隆的可懂度增加,其中带口音语音的增益更大。这些结果表明,即使口音变异未反映在基线归一化的说话人嵌入距离中,它也能影响语音克隆中的感知身份匹配和可懂度,并促使将口音保留视为说话人身份保留的一个明确组成部分,而不是假设它完全由现成的说话人判别嵌入所捕获。

英文摘要

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.

2511.06676 2026-06-02 cs.CL cs.CY cs.HC 版本更新

How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models

AI如何失败:用于演示自动毒性模型中的方言偏见的交互式教学工具

Subhojit Ghimire

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过定量基准测试和交互式教学工具,揭示了自动毒性模型对非裔美国人英语文本的系统性偏见,并强调人为设定的敏感度阈值才是歧视操作化的关键。

Comments 9 pages, 5 figures, 4 tables, 14 references. Preliminary abstract presented at the International Conference on Envisioning the Himalayan Future: Pathways to Sustainability and Development (PUiCON 2026) p. 105; abstract available online at: https://pufoe.edu.np/wp-content/uploads/2026/05/PUiCON_2026_Book_of-_Abstracts.pdf

详情
AI中文摘要

如今,AI驱动的审核已渗透到日常生活中,我们经常听到“AI有偏见”的说法。虽然这通常是以玩笑的方式说出,但这一轻松的评论反映了更深层次的担忧。我们如何能确定被标记为“不当”的在线帖子不是仅仅成为偏见算法的受害者?本文采用双重方法研究这一问题。首先,我对一个广泛使用的毒性模型(unitary/toxic-bert)进行定量基准测试,以衡量非裔美国人英语(AAE)和标准美国英语(SAE)文本之间的性能差异。基准测试揭示了明显的系统性偏见:平均而言,模型将AAE文本的毒性评分高出1.8倍,将“身份仇恨”评分高出8.8倍。其次,我引入了一个交互式教学工具,使这些抽象偏见变得具体可感。该工具的核心机制是一个用户可控的“敏感度阈值”,它表明有偏见的分数本身并非唯一的伤害;相反,更令人担忧的伤害是由人为设定的、看似中立的政策最终操作化的歧视。这项工作既提供了差异影响的统计证据,也提供了一个面向公众的工具,旨在培养批判性AI素养。

英文摘要

Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.

2510.22276 2026-06-02 cs.CV cs.CL 版本更新

WAON: A Large-Scale Japanese Image-Text Dataset for Cultural Adaptation in Contrastive Vision-Language Models

WAON:用于对比视觉语言模型文化适应的大规模日语图像-文本数据集

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki

发表机构 * Kyoto University(京都大学) NII LLMC(日本国家研究所语言模型中心) NII(日本国家研究所) Waseda University(早稻田大学) Institute of Science Tokyo(东京科学研究所)

AI总结 提出WAON,一个从Common Crawl构建的包含约1.55亿样本的最大公开原生日语图像-文本数据集,并通过微调实验证明其在日语文化基准上优于翻译数据。

Comments 13 pages, 7 figures

详情
AI中文摘要

对比视觉语言模型通过大规模预训练取得了显著进展。最近的研究表明,去除仅英文的标题过滤器并在全球数据上进行预训练对于提升多元文化表现是有效的。我们研究了这种全球预训练是否足以实现特定文化的理解,或者进一步使用原生数据进行适应能否超越仅全球预训练所达到的性能。为了进行这项研究,我们提出了WAON,这是从Common Crawl中的原生日语网络内容构建的最大公开可用的原生日语图像-文本数据集,包含约1.55亿个样本。我们还引入了WAON-Bench,一个手动策划的涵盖374个类别的日本文化基准。通过在多个日语图像-文本数据集上的比较微调实验,我们观察到在WAON上微调的模型在日本文化基准上始终比在英语到日语翻译数据上微调的模型表现更强。我们发布了数据集和代码。

英文摘要

Contrastive vision-language models have achieved remarkable progress through large-scale pretraining. Recent work has shown that removing English-only caption filters and pretraining on global data is effective for improving multicultural performance. We study whether such global pretraining is sufficient for culture-specific understanding, or whether further adaptation with natively sourced data can boost performance beyond what global pretraining alone achieves. To enable this investigation, we present WAON, the largest publicly available native Japanese image-text dataset constructed from native Japanese web content in Common Crawl, containing approximately 155 million examples. We also introduce WAON-Bench, a manually curated Japanese cultural benchmark spanning 374 classes. Through comparative fine-tuning experiments on multiple Japanese image-text datasets, we observe that models fine-tuned on WAON consistently achieve stronger performance on Japanese cultural benchmarks than those fine-tuned on English-to-Japanese translated data. We release our dataset and code.

2603.27459 2026-06-02 cs.CL 版本更新

A tree interpretation of arc standard dependency derivation

弧标准依存推导的树解释

Zihao Huang, Ai Ka Lee, Jungyeul Park

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Korea Advanced Institute of Science & Technology(韩国科学技术院)

AI总结 本文提出弧标准依存推导可解释为增量构建具有连续产出的词汇化有序树,并证明该表示等价于投影性,从而为投影依存分析提供树论解释。

详情
AI中文摘要

投影依存树上的弧标准推导可以解释为增量构建具有连续产出的词汇化有序树。每个\textsc{shift}、\textsc{leftarc}和\textsc{rightarc}转移对应于确定的树更新,生成的有序树唯一地确定了推导引入的依存弧。我们证明这种表示并非任意编码:单中心依存树允许这种连续有序表示当且仅当它是投影的。因此,该提议是基于推导的而非基于转换的,因为有序对象是在转移序列本身上定义的,而不是通过转换完成的依存图获得的。这为弧标准解析提供了树论解释,其中投影依存推导隐式地构建了可恢复的短语结构风格的有序树。对于非投影输入,可以通过伪投影提升和逆解码使用该解释。一个小型实现研究证实,映射的推导可以在现有的神经转移解析器中执行。

英文摘要

Arc-standard derivations over projective dependency trees can be interpreted as the incremental construction of lexicalized ordered trees with contiguous yields. Each \textsc{shift}, \textsc{leftarc}, and \textsc{rightarc} transition corresponds to a deterministic tree update, and the resulting ordered tree uniquely determines the dependency arcs introduced by the derivation. We show that this representation is not an arbitrary encoding: a single-headed dependency tree admits such a contiguous ordered representation if and only if it is projective. The proposal is therefore derivational rather than conversion-based, since the ordered object is defined over the transition sequence itself rather than obtained by transforming a completed dependency graph. This gives a tree-theoretic interpretation of arc-standard parsing, in which projective dependency derivations implicitly construct recoverable constituency-style ordered trees. For non-projective inputs, the interpretation can be used through pseudo-projective lifting and inverse decoding. A small implementation study confirms that the mapped derivations are executable in an existing neural transition-based parser.

2603.22999 2026-06-02 cs.CL 版本更新

PaperVoyager : Building Interactive Web with Visual Language Models

PaperVoyager:利用视觉语言模型构建交互式网页

Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang

发表机构 * Vast Intelligence Lab(vast 智能实验室) UTS(UTS大学) University of Liverpool(利物浦大学)

AI总结 提出PaperVoyager框架,将研究论文自动转化为可执行的交互式网页系统,通过显式建模机制和交互逻辑,显著提升生成系统的质量。

Comments 9 pages, 5 figures

详情
AI中文摘要

视觉语言模型的最新进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而,现有的文档代理主要将论文转化为静态产物,如摘要、网页或幻灯片,这对于涉及动态机制和状态转换的技术论文来说是不够的。在这项工作中,我们提出了一种论文到交互式系统的代理,将研究论文转化为可执行的交互式网页系统。给定一篇PDF论文,该代理无需人工干预即可进行端到端处理,包括论文理解、系统建模和交互式网页合成,使用户能够操作输入并观察动态行为。为了评估这一任务,我们引入了一个包含19篇研究论文的基准测试,每篇论文都配有专家构建的交互式系统作为真实参考。我们进一步提出了PaperVoyager,一个结构化生成框架,在合成过程中显式建模机制和交互逻辑。实验表明,PaperVoyager显著提高了生成的交互式系统的质量,为交互式科学论文理解提供了新的范式。

英文摘要

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

2603.25640 2026-06-02 cs.DL cs.CL 版本更新

RenoBench: A Citation Parsing Benchmark

RenoBench: 引文解析基准

Parth Sarin, Juan Pablo Alperin, Adam Buttrick, Dione Mentis

发表机构 * Graduate School of Education, Stanford University(斯坦福大学教育研究生院) Public Knowledge Project, Simon Fraser University(公共知识项目,西蒙弗雷泽大学) DataCite, Hannover, Germany(DataCite,德国汉诺威) California Digital Library, University of California Office of the President(加州数字图书馆,加州大学校长办公室)

AI总结 针对现有引文解析评估方法不可泛化、基于合成数据或不可公开获取的问题,提出从四个出版生态系统收集的公开基准RenoBench,通过自动验证和特征采样构建多语言、多类型数据集,并评估多种解析系统,结果表明语言模型(尤其微调后)表现优异,为可重复标准化评估奠定基础。

Comments Presented as a conference paper at CiteX 2026

详情
AI中文摘要

准确解析引文对于机器可读的学术基础设施是必要的。但是,尽管对该问题持续关注,现有的评估技术通常不可泛化、基于合成数据或不可公开获取。我们引入了RenoBench,一个用于引文解析的公共领域基准,来源于四个出版生态系统(SciELO、Redalyc、Public Knowledge Project和Open Research Europe)发布的PDF。从161,000条带注释的引文开始,我们应用自动验证和基于特征的采样,生成了一个包含10,000条引文的数据集,涵盖多种语言、出版类型和平台。然后,我们评估了多种引文解析系统,并报告了字段级别的精确率和召回率。我们的结果显示语言模型表现强劲,尤其是在微调后。RenoBench实现了对引文解析系统的可重复、标准化评估,并为推进自动化引文解析和元科学研究提供了基础。

英文摘要

Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized evaluation of citation parsing systems, and provides a foundation for advancing automated citation parsing and metascientific research.

2603.23485 2026-06-02 cs.CL cs.AI cs.CY 版本更新

Failure of contextual invariance in large language models

大型语言模型中语境不变性的失效

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

发表机构 * Network Science Institute, Northeastern University(网络科学研究所,东北大学) Center for Health Informatics Program, Boston Children’s Hospital(健康信息学计划中心,波士顿儿童医院) Dept. of Mathematics, City St George’s, University of London(伦敦大学城市圣乔治学院数学系) IT University of Copenhagen(哥本哈根IT大学)

AI总结 通过代词选择任务发现,在语境等价但无信息量的干扰下,大语言模型输出发生系统性偏移,表明其违反语境不变性,影响偏见评估与高风险应用。

详情
AI中文摘要

标准评估实践假设,当提示嵌入语境等价的语篇中时,大型语言模型(LLM)的输出是稳定的。这里,我们在性别推断的背景下测试这一假设。使用受控的代词选择任务,我们引入最小的、理论上无信息的语篇语境,发现这会导致模型输出出现大规模、系统性的偏移。与去语境化设置中存在的文化性别刻板印象的相关性在引入语境后减弱或消失,而理论上无关的特征(如无关指代对象的代词性别)成为模型行为最具信息量的预测因子。通过默认语境性分析发现,在模型间的19%至52%的案例中,这种依赖性在考虑语境对单个输出的所有边际效应后仍然存在,并且不能归因于简单的代词重复。这些发现表明,即使在几乎相同的句法表述下,LLM的输出也违反了语境不变性,这对偏见基准测试和高风险环境中的部署具有重要影响。

英文摘要

Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behavior. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

2508.09456 2026-06-02 cs.CV cs.CL cs.CR 版本更新

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

IAG: 基于输入感知的后门攻击针对VLM视觉定位

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Columbia University(哥伦比亚大学) Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学)

AI总结 提出IAG方法,通过文本条件UNet动态生成输入感知的触发器,实现首个多目标后门攻击VLM视觉定位,在多个模型和基准上达到最佳攻击成功率且不影响正常性能。

Comments Accepted by CVPR 2026; Code is at https://github.com/lijunxian111/IAG

详情
Journal ref
https://openaccess.thecvf.com/content/CVPR2026/papers/Li_IAG_Input-aware_Backdoor_Attack_on_VLM-based_Visual_Grounding_CVPR_2026_paper.pdf
AI中文摘要

近期视觉语言模型(VLM)的进展显著提升了视觉定位任务,该任务涉及根据自然语言查询在图像中定位对象。尽管取得了这些进展,基于VLM的定位系统的安全性尚未得到彻底研究。本文揭示了一个新颖且现实的安全漏洞:首个针对VLM视觉定位的多目标后门攻击。与依赖静态触发器或固定目标的先前攻击不同,我们提出了IAG,一种动态生成输入感知、文本引导触发器的方法,这些触发器以任意指定目标对象描述为条件来执行攻击。这是通过一个文本条件的UNet实现的,该网络将难以察觉的目标语义线索嵌入视觉输入,同时保持对良性样本的正常定位性能。我们进一步开发了一个联合训练目标,平衡语言能力与感知重建,以确保隐蔽性、有效性和隐秘性。在多个VLM(如LLaVA、InternVL、Ferret)和基准(RefCOCO、RefCOCO+、RefCOCOg、Flickr30k Entities和ShowUI)上的大量实验表明,IAG在几乎所有设置下都实现了比其他基线最佳的攻击成功率,同时不损害干净准确率,保持对现有防御的鲁棒性,并展现出跨数据集和模型的迁移性。这些发现强调了具有定位能力的VLM中的关键安全风险,并突出了对可信多模态理解的进一步研究的必要性。

英文摘要

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

2603.19453 2026-06-02 cs.CL cs.GT 版本更新

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

超越标量奖励:面向序列社会困境的LLM策略合成中的密集反馈

Víctor Gallego

发表机构 * Komorebi AI Technologies(Komorebi AI技术公司)

AI总结 本研究通过反馈工程,比较稀疏奖励与密集社会指标反馈对LLM在多智能体环境中生成程序化策略的影响,发现密集反馈能打破奖励别名,提升策略性能。

Comments Accepted to NExT-Game 2026: New Frontiers in Game-Theoretic Learning, ICML 2026 Workshop

详情
AI中文摘要

我们研究LLM策略合成:使用语言模型迭代生成多智能体环境的程序化智能体策略。我们的框架不是通过强化学习训练神经策略,而是提示LLM生成Python策略函数,在自对弈中评估它们,并在迭代中使用性能反馈进行改进。我们研究反馈工程(在改进过程中向LLM展示哪些评估信息的设计),比较稀疏反馈(仅标量奖励)与密集反馈(奖励加上社会指标:效率、平等、可持续性、和平)。在两个经典的序列社会困境(Gathering和Cleanup)和两个前沿LLM(Claude Sonnet 4.6,Gemini 3.1 Pro)上,密集反馈在所有指标上始终匹配或超过稀疏反馈。我们通过反馈别名解释这种不对称性:当标量奖励单独将不同的失败模式映射到相同的值(例如,清洁不足与过度清洁)时,社会指标打破了别名,让LLM诊断出应该采取哪个纠正方向。因此,社会指标作为协调信号而非干扰,产生了诸如Voronoi区域划分和废物自适应清洁调度等策略。代码见https://github.com/vicgalle/llm-policies-social-dilemmas。

英文摘要

We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. We explain the asymmetry through feedback aliasing: when scalar reward alone maps distinct failure modes to the same value (e.g., under- vs. over-cleaning), social metrics break the alias and let the LLM diagnose which corrective direction to take. Social metrics thus function as a coordination signal rather than a distraction, yielding strategies such as Voronoi territory partitioning and waste-adaptive cleaner schedules. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

2502.15411 2026-06-02 cs.CL cs.AI 版本更新

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

HiFi-KPI:用于从财报中提取层次化KPI的数据集

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

发表机构 * Department of Computer Science, Aalborg University(奥尔堡大学计算机科学系) ALIPES ApS(ALIPES公司) University of Copenhagen(哥本哈根大学) Pioneer Centre for AI(先锋人工智能中心)

AI总结 针对财报中关键绩效指标(KPI)跨公司可迁移性差的问题,提出包含165万段落和19.8万层次化标签的HiFi-KPI数据集,并评估分类、提取和结构化提取三个任务。

详情
AI中文摘要

准确标注财报可以为利益相关者带来显著的短期回报。机器可读的内联可扩展商业报告语言(iXBRL)是公开财务申报的强制要求。然而,其复杂且细粒度的分类法限制了标记关键绩效指标(KPI)的跨公司可迁移性。为了解决这个问题,我们引入了层次化财务关键绩效指标(HiFi-KPI)数据集,这是一个包含165万段落和19.8万个独特层次化标签的大规模语料库,这些标签与iXBRL分类法相关联。HiFi-KPI支持多个任务,我们评估了其中三个:KPI分类、KPI提取和结构化KPI提取。为了快速评估,我们还发布了HiFi-KPI-Lite,一个手动策划的8K段落子集。在HiFi-KPI-Lite上的基线实验表明,基于编码器的模型在分类任务上达到了超过0.906的宏F1分数,而大型语言模型(LLM)在结构化提取任务上达到了0.440的F1分数。最后,定性分析显示提取错误主要与日期相关。我们在https://github.com/aaunlp/HiFi-KPI上开源了所有代码和数据。

英文摘要

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings. Yet, its complex, fine-grained taxonomy limits the cross-company transferability of tagged Key Performance Indicators (KPIs). To address this, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, a large-scale corpus of 1.65M paragraphs and 198k unique, hierarchically organized labels linked to iXBRL taxonomies. HiFi-KPI supports multiple tasks and we evaluate three: KPI classification, KPI extraction, and structured KPI extraction. For rapid evaluation, we also release HiFi-KPI-Lite, a manually curated 8K paragraph subset. Baselines on HiFi-KPI-Lite show that encoder-based models achieve over 0.906 macro-F1 on classification, while Large Language Models (LLMs) reach 0.440 F1 on structured extraction. Finally, a qualitative analysis reveals that extraction errors primarily relate to dates. We open-source all code and data at https://github.com/aaunlp/HiFi-KPI.

2603.18016 2026-06-02 cs.CL cs.AI cs.DC cs.LG 版本更新

MineDraft: A Framework for Batch Parallel Speculative Decoding

MineDraft: 批量并行推测解码框架

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究院) Toyota Motor Corporation(丰田公司)

AI总结 提出MineDraft框架,通过批量并行设计将草稿生成与验证阶段重叠,显著提升推测解码的吞吐量和端到端延迟。

Comments Accepted at ICML 2026

详情
AI中文摘要

推测解码(SD)通过使用较小的草稿模型提出草稿令牌,随后由较大的目标模型验证,从而加速大型语言模型推理。然而,标准SD的性能通常受限于这些草稿和验证阶段的严格顺序执行。为解决此问题,本文提出MineDraft,一种批量并行推测解码(PSD)框架,旨在通过将草稿生成与验证重叠来有效隐藏草稿延迟。我们的理论分析表明,PSD比标准SD高效得多。MineDraft通过一种新颖的批量并行设计实现PSD,该设计维护两个请求批次,将一个批次的草稿生成与另一个批次的验证重叠。我们的实验结果显示,与标准SD相比,MineDraft在吞吐量(最高提升75%)和端到端延迟(最高降低39%)方面均有显著改进。此外,我们已将MineDraft实现为vLLM的插件,展示了其在生产级推理系统中的实用性。

英文摘要

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

2603.16142 2026-06-02 cs.CL 版本更新

Parametric Social Identity Injection and Diversification in Public Opinion Simulation

参数化社会身份注入与多样化在舆论模拟中的应用

Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu

发表机构 * DCST, Tsinghua University(清华大学信息科技系) Quancheng Laboratory(泉城实验室) Tsinghua University(清华大学) Shandong China(山东中国)

AI总结 针对大语言模型在舆论模拟中社会多样性不足的问题,提出参数化社会身份注入(PSII)框架,通过向中间隐藏状态注入显式参数化的人口属性与价值取向表示,显著提升分布保真度与多样性。

Comments Accepted to KDD 2026 Research Track. Project page: https://github.com/halsayxi/PSII

详情
AI中文摘要

大语言模型(LLMs)最近被用作舆论模拟的合成代理,为昂贵且缓慢的人类调查提供了一种有前景的替代方案。尽管具有可扩展性,当前基于LLM的模拟方法未能捕捉社会多样性,产生扁平化的群体间差异和跨人口群体的过度同质化响应。我们将这一限制识别为LLM隐藏表示中的多样性崩溃现象,其中不同的社会身份在各层之间变得越来越难以区分。受此观察启发,我们提出了参数化社会身份注入(PSII),这是一个通用框架,将人口属性和价值取向的显式参数化表示直接注入LLM的中间隐藏状态。与基于提示的人物角色条件化不同,PSII能够在表示层面实现细粒度且可控的身份调制。在多个开源LLM上使用世界价值观调查进行的广泛实验表明,PSII显著提高了分布保真度和多样性,减少了与真实世界调查数据的KL散度,同时增强了整体多样性。这项工作为LLM代理的表示级控制提供了新见解,并推动了可扩展的、具有多样性意识的舆论模拟。

英文摘要

Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses across demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation.

2603.10619 2026-06-02 cs.CL 版本更新

Disentangling Similarity and Relatedness in Topic Models

在主题模型中区分相似性和相关性

Hanlin Xiao, Yang Wang, Mauricio A. Álvarez, Rainer Breitling

发表机构 * Manchester Institute of Biotechnology, The University of Manchester, UK(曼彻斯特生物技术研究所,曼彻斯特大学,英国) Department of Computer Science, The University of Manchester, UK(计算机科学系,曼彻斯特大学,英国) Department of Chemistry, The University of Manchester, UK(化学系,曼彻斯特大学,英国)

AI总结 本文通过构建大规模合成基准和神经评分器,从心理语言学角度区分主题词之间的主题相关性和分类相似性,并证明两个维度对下游任务性能有不同影响。

Comments 26 pages, 9 figures, 18 tables

详情
AI中文摘要

大型预训练语言模型(PLMs)的最新成功推动了它们与主题建模的整合。然而,PLM增强的主题模型不仅在性能上,而且在它们捕获的语义结构类型上也与经典共现模型(如潜在狄利克雷分配(LDA))不同。我们沿着两个心理语言学轴形式化了这种区别:主题相关性(狗/骨头)和分类相似性(狗/狼)。为了衡量主题词上的这两个轴,我们使用基于LLM的标注构建了一个大型合成词对基准,并训练了一个神经评分器。在多个语料库和模型家族中,评分器将不同的主题模型家族置于联合相似性-相关性空间中的不同位置。这两个分数进一步预测了下游任务性能:需要相似性的任务受益于富含相似性的主题,而需要相关性的任务则受益于相反的情况,并且对任一轴的过度强调都会降低与对立语义结构对齐的任务的性能。没有一个轴是普遍有益的。因此,衡量两者提供了一种实用的、与模型无关的诊断方法,用于评估主题模型捕获的语义结构。

英文摘要

The recent success of large pre-trained language models (PLMs) has motivated their integration into topic modeling. However, PLM-augmented topic models differ from classical co-occurrence models such as Latent Dirichlet Allocation (LDA) not only in performance, but also in the type of semantic structure they capture. We formalize this distinction along two psycholinguistic axes: thematic relatedness (dog/bone) and taxonomic similarity (dog/wolf). To measure both axes over topic words, we construct a large synthetic benchmark of word pairs using LLM-based annotation and train a neural scorer on it. Across multiple corpora and model families, the scorer places different topic-model families at distinct positions within the joint similarity-relatedness space. The two scores further predict downstream task performance: tasks requiring similarity benefit from similarity-rich topics, whereas tasks requiring relatedness benefit from the converse, and excessive emphasis on either axis degrades performance on tasks aligned with the opposing semantic structure. Neither axis is uniformly beneficial. Measuring both therefore provides a practical, model-agnostic diagnostic for evaluating the semantic structure captured by topic models.

2603.09692 2026-06-02 cs.LG cs.AI cs.CL 版本更新

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback:使用主动学习的高效偏好数据生成

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出ActiveUltraFeedback主动学习流水线,通过不确定性估计和两种新采样方法(DRTS和DeltaUCB)动态选择最具信息量的响应对,以最少六分之一的标注数据实现与静态基线相当或更优的下游性能。

Comments 40 pages, 9 figures, 26 tables

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)已成为对齐大型语言模型(LLMs)的标准方法,但其有效性受到偏好数据获取高成本的瓶颈限制,尤其是在低资源和专家领域。为解决这一问题,我们引入了ACTIVEULTRAFEEDBACK,一个模块化的主动学习流水线,利用不确定性估计动态识别最具信息量的响应进行标注。我们的流水线支持系统评估标准响应选择方法以及两种新方法:DOUBLE REVERSE THOMPSON SAMPLING(DRTS)和DELTAUCB,这两种方法优先选择预测质量差距大的响应对,利用近期研究结果,即此类对为微调提供良好信号。实验表明,ACTIVEULTRAFEEDBACK生成的高质量数据集在下游性能上带来显著提升,尤其以静态基线六分之一的标注数据即可达到相当或更优的结果。我们的流水线可在https://github.com/lasgroup/ActiveUltraFeedback获取,偏好数据集可在https://huggingface.co/ActiveUltraFeedback获取。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

2509.25773 2026-06-02 cs.CV cs.AI cs.CL 版本更新

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

v-HUB: 从视觉和声音理解视频幽默的基准

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Wuhan University(武汉大学) Beijing Institute for General Artificial Intelligence(北京一般人工智能研究院) Independent Researcher(独立研究者)

AI总结 提出v-HUB基准,通过非语言短视频评估多模态大语言模型在仅凭视觉线索理解幽默的能力,并发现音频信息有助于提升幽默理解。

Comments 24 pages, 9 figures

详情
AI中文摘要

能够理解幽默的AI模型具有现实应用前景——例如,增强人机交互中的参与度。为了评估和诊断多模态大语言模型(MLLMs)理解幽默的能力,我们引入了v-HUB,一个新颖的视频幽默理解基准。v-HUB包含一个精心策划的非语言短视频集合,反映了仅通过视觉线索即可欣赏幽默的现实场景。我们将每个视频片段与丰富的标注配对,以支持各种评估任务和分析,包括一项关于增强幽默的环境声音的新研究。为了扩大其适用性,我们构建了一个开放式问答任务,使v-HUB能够轻松集成到现有的视频理解任务套件中。我们评估了多种MLLMs,从专门的Video-LLMs到能够原生处理音频的多功能OmniLLMs,涵盖了开源和专有领域。实验结果揭示了MLLMs在仅凭视觉线索理解幽默时面临的困难。我们的发现还表明,结合音频有助于视频幽默理解,突显了为复杂视频理解任务整合更丰富模态的前景。

英文摘要

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

2603.08026 2026-06-02 cs.CL cs.AI cs.PF 版本更新

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM: 基于显著性标记选择与部分注意力的高效扩散LLM推理

Younjoo Lee, Seungkyun Dan, Junghoo Lee, Jaiyoung Park, Jung Ho Ahn

发表机构 * University of Seoul(首尔大学)

AI总结 针对扩散语言模型迭代去噪计算昂贵的问题,提出DyLLM框架,通过仅计算显著标记的注意力与前馈操作,实现无需训练的加速推理,在保持精度的同时吞吐量提升高达9.6倍。

Comments 21 pages, 10 figures, 7 tables, accepted at ICML 2026

详情
AI中文摘要

掩码扩散语言模型支持并行令牌解码,为自回归生成的顺序性质提供了一种有前景的替代方案。然而,其迭代去噪过程仍然计算昂贵,因为每一步都重复处理整个序列。我们观察到,在这些扩散步骤中,大多数令牌表示保持稳定;只有一小部分(我们称之为显著令牌)对下一次更新有实质性贡献。利用这种时间稀疏性,我们提出了DyLLM,一种无需训练的推理框架,通过仅选择性地计算这些显著令牌来加速解码。DyLLM通过测量相邻去噪步骤之间注意力上下文的余弦相似性来识别显著性。它仅对显著令牌重新计算前馈和注意力操作,同时为其余令牌重用缓存的激活。在多种推理和代码生成基准测试中,DyLLM实现了高达9.6倍的吞吐量提升,同时基本保持了代表性开源扩散LLM(LLaDA和Dream)的基线准确性。

英文摘要

Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of representative open-source diffusion LLMs, LLaDA, and Dream.

2603.08000 2026-06-02 cs.CL cs.LG 版本更新

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker: 渐进式思维链长度校准以实现高效的大语言模型推理

Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen

发表机构 * Tsinghua University(清华大学)

AI总结 针对大型推理模型输出冗余问题,提出基于GRPO的渐进式CoT长度校准方法SmartThinker,通过动态估计最优长度和调节长度奖励系数,在压缩响应长度同时提升准确率。

Comments Accepted by ICML 2026, 18 pages, 13 figures

详情
AI中文摘要

大型推理模型(LRMs),如OpenAI o1和DeepSeek-R1,通过采用长思维链(CoT)推理路径在复杂任务上实现了高准确率。然而,这些过程固有的冗长常常导致冗余和过度思考。为了解决这一问题,现有工作利用组相对策略优化(GRPO)来减少LRM的输出长度,但其静态长度奖励设计无法根据问题相对难度和响应长度分布动态调整,导致过度压缩和准确率下降。因此,我们提出SmartThinker,一种新颖的基于GRPO的高效推理方法,具有渐进式CoT长度校准。SmartThinker有两个贡献:首先,它在训练期间动态估计具有峰值准确率的最优长度,并引导过长响应朝向该长度,以减少响应长度同时保持准确率。其次,它动态调节长度奖励系数,以避免对正确推理路径的不当惩罚。大量实验结果表明,SmartThinker在提高准确率的同时实现了高达52.5%的平均长度压缩,并在AIME25等具有挑战性的基准上实现了高达16.6%的准确率提升。源代码可在https://github.com/SJTU-RTEAS/SmartThinker获取。

英文摘要

Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.

2512.16310 2026-06-02 cs.CR cs.AI cs.CL 版本更新

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Agent工具编排泄露更多:数据集、基准测试与缓解措施

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院)

AI总结 研究LLM代理在编排多个工具时泄露敏感结论的风险(TOP-R),构建了包含1000个实例的基准TOP-Bench,并提出TOP-Align后训练方法以缓解泄露。

Comments 17 pages, 2 figures. Dataset and code are available at https://github.com/1Ponder/TOP-R

详情
AI中文摘要

基于LLM的代理越来越多地使用多个外部工具来完成复杂任务。我们研究了工具编排隐私风险(TOP-R):代理可能组合单个非敏感的工具返回结果,并披露一个非预期的敏感结论。我们通过三个条件形式化TOP-R:结论敏感性、单源不可推断性和组合可推断性。我们引入了LRSE(基于库的反向推理种子扩展),这是一个基于隐私规范、推理链、工具模式和任务场景的四库反向构建流水线,并使用它构建了TOP-Bench,一个包含1000个实例的基准测试。该基准测试在受控的两阶段工具使用协议下评估最终响应的语义泄露。在六个LLM代理中,任务完成率保持较高,但平均泄露率达到88.6%,导致H分数仅为20.4。两种仅提示的防护措施在主基准测试上将H分数提高了约2.7分。我们进一步提出了TOP-Align,一种SFT+DPO后训练方法,用于更安全的任务完成边界。在单独的后训练评估划分上,TOP-Align将H分数比相应基础模型提高了16.2分,而同一划分上仅提示缓解措施的平均增益为4.9分。这些结果表明TOP-R需要超越仅提示的缓解措施。

英文摘要

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability. We introduce LRSE (Library-Grounded Reverse-Inference Seed Expansion), a four-library reverse-construction pipeline grounded in privacy norms, reasoning chains, tool schemas, and task scenarios, and use it to build TOP-Bench, a 1,000-instance benchmark. The benchmark evaluates final-response semantic disclosure under a controlled two-stage tool-use protocol. Across six LLM agents, task completion remains high, but the average leakage rate reaches 88.6 percent, yielding an H-score of only 20.4. Two prompt-only safeguards improve H-score by about 2.7 points on the main benchmark. We further propose TOP-Align, an SFT+DPO post-training method for safer task completion boundaries. On a separate post-training evaluation split, TOP-Align improves H-score by 16.2 points over the corresponding base model, compared with a 4.9-point average gain from prompt-only mitigation on the same split. These results show that TOP-R requires mitigation beyond prompting alone.

2603.04828 2026-06-02 cs.CL 版本更新

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

从陌生到熟悉:通过梯度偏差检测大型语言模型中的预训练数据

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University(北京未来区块链与隐私计算先进创新中心,北京航空航天大学) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院)

AI总结 提出GDS方法,通过分析目标样本的梯度偏差分数(包括更新幅度、位置和神经元激活集中度)来区分预训练成员与非成员数据,实现高效且跨数据集迁移的预训练数据检测。

Comments 17 pages, 8 figures

详情
AI中文摘要

大型语言模型的预训练数据检测对于解决版权问题和减轻基准污染至关重要。现有方法主要关注微调前后的基于似然的统计特征或启发式信号,但前者易受语料库中词频偏差的影响,后者强烈依赖于微调数据的相似性。从优化角度,我们观察到在训练过程中,样本从陌生到熟悉的转变方式体现在梯度行为的系统性差异上。熟悉样本表现出更小的更新幅度、模型组件中不同的更新位置以及更尖锐激活的神经元。基于这一洞察,我们提出GDS,一种通过探测目标样本的梯度偏差分数来识别预训练数据的方法。具体来说,我们首先使用梯度轮廓表示每个样本,该轮廓捕获跨FFN和注意力模块的参数更新的幅度、位置和集中度,揭示成员与非成员数据之间的一致区别。然后将这些特征输入轻量级分类器进行二值成员推断。在五个公共数据集上的实验表明,GDS在强基线上实现了最先进的性能,并显著提高了跨数据集的可迁移性。进一步的可解释性分析揭示了梯度分布的差异,半监督结果为检测预训练数据提供了一种实用方法。

英文摘要

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyses reveal differences in gradient distributions, and the semi-supervised results offer a practical way to detect pre-training data.

2603.03202 2026-06-02 cs.CL 版本更新

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Code2Math:你的代码智能体能否通过探索有效演化数学问题?

Dadi Guo, Yuejin Xie, Qingyu Liu, Weixian Huang, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Jianjie Feng, Wenze Su, Yujiu Yang, Dongrui Liu, Yi R. Fung

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Nanjing Tech University(南京工业大学) Shanghai Jiao Tong University(上海交通大学) University of Michigan(密歇根大学) Independent Researcher(独立研究者)

AI总结 本文提出一个多智能体框架,利用代码智能体通过探索将现有数学问题自主演化为更复杂、更困难的变体,并验证其可解性和难度提升。

Comments 38 pages

详情
AI中文摘要

随着大型语言模型(LLM)在数学能力上向IMO和研究水平迈进,高质量挑战性问题的稀缺已成为LLM训练、评估和自我进化的显著瓶颈。同时,最近的代码智能体在智能体编码和推理方面展示了复杂技能,表明代码执行可以作为数学实验的可扩展环境。在本文中,我们研究了代码智能体自主将现有数学问题演化为更复杂变体的潜力。我们引入了一个多智能体框架,旨在执行问题演化,同时验证生成问题的可解性和难度增加。我们的实验表明,在充分的测试时探索下,代码智能体可以合成新的、可解的问题,这些问题在结构上与原始问题不同且更具挑战性。本工作提供了经验证据,表明代码驱动的智能体可以作为在可扩展计算环境中合成高难度数学推理问题的可行机制。代码和数据可在 https://github.com/TarferSoul/Code2Math 获取。

英文摘要

As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-evolution of LLMs. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Code and data is available at https://github.com/TarferSoul/Code2Math.

2603.03291 2026-06-02 cs.CL cs.AI 版本更新

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

一个接一个的偏差:语言奖励模型中的机械奖励塑造与持续偏差

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

发表机构 * Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过系统测量五个高质量奖励模型中的偏差,发现长度、谄媚、过度自信等持续问题,并提出一种简单的后处理干预方法(机械奖励塑造)来减轻低复杂度偏差。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

奖励模型(RMs)对于语言模型(LMs)与人类偏好的在线对齐至关重要。然而,基于RM的偏好调优容易受到奖励破解的影响,即LM策略从有缺陷的RM中学习不良行为。通过系统测量五个高质量RM(包括最先进的模型)中的偏差,我们发现尽管已有相关工作,但在长度、谄媚和过度自信方面的问题仍然存在。我们还发现了与模型特定“风格”和答案顺序相关的新偏差。我们将RM失败分类为可处理或对线性干预具有抵抗性,并提出一种简单的后处理干预措施,以减轻由虚假相关性引起的低复杂度偏差。我们提出的机械奖励塑造在不降低奖励质量且使用最少标注数据的情况下减少了目标偏差。该方法可扩展到新偏差、模型内部,并具有分布外泛化能力。

英文摘要

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific ``styles'' and answer-order. We categorize RM failures as tractable or resistant to linear intervention and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

2603.00963 2026-06-02 cs.LG cs.CL 版本更新

Stabilizing Policy Optimization via Logits Convexity

通过Logits凸性稳定策略优化

Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对强化学习训练不稳定的问题,从梯度角度分析监督微调与强化学习的稳定性差距,提出Logits凸优化(LCO)框架,通过模拟logits级凸性来稳定策略优化,实验表明该方法能提升训练稳定性并在多个基准上优于传统方法。

详情
AI中文摘要

虽然强化学习(RL)在大语言模型(LLM)近期成功中发挥了核心作用,但RL优化以不稳定著称,尤其是与监督微调(SFT)相比。本文从梯度角度研究SFT和RL之间的稳定性差距,并表明SFT损失相对于模型logits的凸性在实现稳定训练中起关键作用。我们的理论分析证明,该性质在优化过程中诱导了有利的梯度方向性。相比之下,广泛采用的策略梯度算法——使用裁剪替代目标的近端策略优化(PPO)缺乏这种稳定性质。受此观察启发,我们提出Logits凸优化(LCO),一种简单而有效的策略优化框架,将学习策略与从原始RL目标导出的最优目标对齐,从而模拟logits级凸性的稳定效果。跨多个模型家族的大量实验表明,我们的LCO框架一致地提升了训练稳定性,并在广泛的基准测试中优于传统RL方法。

英文摘要

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

2603.00829 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Constitutional Black-Box Monitoring for Scheming in LLM Agents

LLM Agent 中阴谋行为的宪法黑盒监控

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

发表机构 * University of Cambridge(剑桥大学)

AI总结 研究使用基于宪法黑盒的监控器,通过仅观察外部输入和输出检测LLM Agent的阴谋行为,并在合成数据上优化后泛化到更真实环境。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

在自主环境中安全部署大型语言模型(LLM)Agent需要可靠的监督机制。一个核心挑战是检测阴谋行为,即Agent暗中追求不一致的目标。缓解此类风险的一种方法是基于LLM的监控:使用语言模型检查Agent行为中的可疑动作。我们研究宪法黑盒监控器:仅利用外部可观测的输入和输出检测阴谋行为的提示分类器,并在从自然语言行为规范生成的合成数据上优化。我们引入两个生成合成Agent轨迹的流水线:STRIDE(迭代精炼)和Gloom(Agent-环境模拟),各生成1000个样本。通过提示扫描、人工精炼和自动提示优化,我们在这些数据集上优化前沿LLM监控器,并在ControlArena(一套Agent在更现实环境中运行的接地环境)中的7500个保留轨迹上评估性能。结果表明,仅基于合成数据选择的监控器可以泛化到更现实的环境,捕获有意义的阴谋信号。然而,我们发现性能在我们的设置中迅速饱和,简单的提示扫描匹配了更广泛优化的结果。超越这一限制不会带来进一步改进,反而导致过拟合。

英文摘要

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

2511.00206 2026-06-02 cs.AI cs.CL 版本更新

Addressing Longstanding Challenges in Cognitive Science with Language Models

用语言模型应对认知科学中长期存在的挑战

Dirk U. Wulff, Rui Mata

发表机构 * Center for Adaptive Rationality, Max Planck Institute for Human Development(适应性理性中心,马克斯·普朗克人类发展研究所) Faculty of Psychology, University of Basel(心理学系,巴塞尔大学)

AI总结 本文探讨如何利用语言模型应对认知科学中研究整合、形式化、概念清晰度等长期挑战,并指出其风险与机遇。

详情
AI中文摘要

认知科学因其多面性和跨学科性质,在研究整合、形式化、概念清晰度等领域面临持续挑战。人工智能的最新进展,特别是语言模型的发展,提供了可能有助于解决这些长期问题的工具。具体而言,它们可以帮助映射碎片化的文献、形式化言语理论、识别构念和测量之间的重叠、跨任务生成预测,以及从自然数据中提取文化或生态结构。然而,这些机遇也伴随着风险,包括过度简化、不透明性、技能退化以及偏见。综合来看,我们得出结论:当审慎地使用语言模型来补充而非取代人类能动性时,它们可以成为促进更具整合性和累积性的认知科学的工具。

英文摘要

Cognitive science faces ongoing challenges in research integration, formalization, conceptual clarity, and other areas, in part due to its multifaceted and interdisciplinary nature. Recent advances in artificial intelligence, particularly the development of language models, offer tools that may help to address these longstanding issues. Specifically, they can help map fragmented literatures, formalize verbal theories, identify overlap among constructs and measures, generate predictions across tasks, and extract cultural or ecological structure from naturalistic data. However, these opportunities come with risks, including oversimplification, opacity, deskilling, and bias. Taken together, we conclude that language models could serve as tools for a more integrative and cumulative cognitive science when used judiciously to complement, rather than replace, human agency.

2603.00021 2026-06-02 cs.CL 版本更新

From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization

从全局到局部:学习上下文感知的图表示用于文档分类与摘要

Ruangrin Ldallitsakool, Margarita Bugueño, Gerard de Melo

发表机构 * University of Potsdam(波恩大学) Hasso Plattner Institute (HPI)(哈索罗滕堡研究所)

AI总结 本文提出一种数据驱动的图构建方法,利用动态滑动窗口注意力模块捕获句子间局部与中程语义依赖,结合图注意力网络在文档分类中取得竞争性结果并降低计算成本,同时探索了该方法在抽取式摘要中的应用潜力与局限。

详情
AI中文摘要

最近的NLP系统通常将文档表示为线性标记序列。尽管这捕获了顺序关系,但可能阻碍长程依赖和全局文档结构的建模,尤其是对于长文本。本文提出一种数据驱动的方法来自动构建基于图的文档表示。基于Bugueño和de Melo(2025)的近期工作,我们利用动态滑动窗口注意力模块有效捕获句子之间的局部和中程语义依赖,以及文档内部的结构关系。在我们学习到的图上训练的图注意力网络(GAT)在文档分类上取得了有竞争力的结果,同时比先前方法需要更少的计算资源。我们进一步对所提出的图构建方法在抽取式文档摘要上进行了探索性评估,突出了其潜力和当前局限性。该项目的实现可在GitHub上找到。

英文摘要

Recent NLP systems commonly represent documents as linear token sequences. Although this captures sequential order, it can hinder modeling long-range dependencies and global document structure, especially for long texts. This paper proposes a data-driven method to automatically construct graph-based document representations. Building upon the recent work of Bugueño and de Melo (2025), we leverage the dynamic sliding-window attention module to effectively capture local and mid-range semantic dependencies between sentences, as well as structural relations within documents. Graph Attention Networks (GATs) trained on our learned graphs achieve competitive results on document classification while requiring lower computational resources than previous approaches. We further present an exploratory evaluation of the proposed graph construction method for extractive document summarization, highlighting both its potential and current limitations. The implementation of this project can be found on GitHub.

2602.23881 2026-06-02 cs.LG cs.CL 版本更新

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

LK损失:用于推测解码的直接接受率优化

Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev

发表机构 * arXiv

AI总结 针对推测解码中标准KL散度训练不能最大化接受率的问题,提出LK损失直接优化接受率,实验表明在多种架构和模型上一致提升接受指标。

Comments ICML 2026

详情
AI中文摘要

推测解码通过使用轻量级草稿模型提出候选令牌,然后由目标模型并行验证,从而加速自回归大型语言模型(LLM)推理。加速效果显著取决于接受率,然而标准训练将Kullback-Leibler(KL)散度作为代理目标进行最小化。虽然KL散度和接受率共享相同的全局最优解,但小型草稿模型由于容量有限,通常收敛到次优解,此时最小化KL并不能保证最大化接受率。为解决此问题,我们提出LK损失,这是一种直接针对接受率的特殊训练目标。在四种草稿架构和六个目标模型(参数范围从8B到685B)上的全面实验表明,与基于KL的标准训练相比,所有配置下的接受指标均有一致提升。我们在通用、编码和数学领域评估了我们的方法,并报告平均接受长度提升高达8-10%。LK损失易于实现,不引入计算开销,可直接集成到任何现有的推测器训练框架中,使其成为现有草稿训练目标的有力替代方案。

英文摘要

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we propose LK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

2602.23866 2026-06-02 cs.SE cs.CL 版本更新

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

SWE-rebench V2: 大规模语言无关的SWE任务集合

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev

发表机构 * GitHub

AI总结 提出语言无关的自动化流水线SWE-rebench V2,用于大规模收集可执行的软件工程任务并构建RL训练环境,产出涵盖20种语言、3617个仓库的32079个任务数据集。

Comments ICML 2026

详情
AI中文摘要

软件工程智能体(SWE)正在快速进步,最近的进展主要由强化学习(RL)驱动。然而,RL训练受到大规模任务集合稀缺性的限制,这些任务需要具有可复现的执行环境和可靠的测试套件。尽管越来越多的基准测试出现,适合训练的数据集在规模和多样性上仍然有限,或者通常针对有限的高资源语言生态系统。我们引入SWE-rebench V2,一个语言无关的自动化流水线,用于大规模收集可执行的真实世界SWE任务并构建RL训练环境。该流水线通过交互式设置智能体合成仓库特定的安装和测试程序,并使用一组LLM评判器过滤不合理的实例,这些评判器经过人工验证的SWE-bench注释验证。使用该流水线,我们构建了一个包含20种语言、3617个仓库的32079个任务数据集,并附带预构建的镜像以实现可复现执行。为了进一步扩展训练数据,我们还发布了超过120000个任务,包含安装说明、失败到通过的测试和丰富的元数据,其中问题陈述基于原始拉取请求描述生成。我们通过一项诊断研究验证了收集的实例,该研究涵盖了五种编程语言中的一部分任务,涉及七个流行模型,并提供了实例级元数据,标记了常见的混淆因素,如过于严格的测试和描述不充分。我们发布了数据集、收集和执行代码以及相关工件,以支持跨多种语言和仓库的大规模SWE智能体训练。

英文摘要

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,079 tasks spanning 20 languages and 3,617 repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

2602.17588 2026-06-02 cs.CL cs.HC 版本更新

Modeling Distinct Human Interaction in Web Agents

建模网络代理中不同的人类交互

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Duke University(杜克大学)

AI总结 本文通过收集400条真实用户网络导航轨迹,识别四种人机交互模式,并训练语言模型预测用户干预时机,将干预预测准确率提升61.4%-63.4%,用户评价的代理有用性提高36.8%。

Comments Preprint

详情
AI中文摘要

尽管自主网络代理取得了快速进展,但在任务执行过程中,人类参与对于塑造偏好和纠正代理行为仍然至关重要。然而,当前的代理系统缺乏对何时以及为何人类干预的原则性理解,常常在关键决策点自主行动或请求不必要的确认。在这项工作中,我们引入了建模人类干预以支持协作式网络任务执行的任务。我们收集了CowCorpus,这是一个包含400条真实用户网络导航轨迹的数据集,其中包含超过4,200个交错的人类和代理动作。我们识别了四种不同的用户与代理交互模式——放手监督、亲力监督、协作任务解决和完全用户接管。利用这些见解,我们训练语言模型(LMs)根据用户的交互风格预测用户何时可能干预,与基础LMs相比,干预预测准确率提高了61.4%-63.4%。最后,我们将这些干预感知模型部署到实时网络导航代理中,并在用户研究中评估它们,发现用户评定的代理有用性增加了36.8%。总之,我们的结果表明,结构化的人类干预建模可以产生更具适应性和协作性的代理。

英文摘要

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 36.8% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

2602.23197 2026-06-02 cs.CL cs.LG stat.ML 版本更新

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

微调不忘上下文学习:线性注意力模型的理论分析

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 本文通过线性注意力模型理论分析,揭示了微调目标如何修改注意力参数并导致少样本性能下降的条件,提出仅更新值矩阵可保持上下文学习能力。

详情
Journal ref
International Conference on Machine Learning (ICML) 2026
AI中文摘要

基于Transformer的大型语言模型展现出上下文学习能力,能够通过少量示例提示适应下游任务。实践中,这类模型常被微调以提升下游任务的零样本性能,使其无需示例即可解决问题,从而降低推理成本。然而,微调可能削弱上下文学习能力,限制微调模型在未见任务上的表现。利用线性注意力模型,我们提供了理论分析,刻画了微调目标如何修改注意力参数,并识别了导致少样本性能下降的条件。我们表明,微调所有注意力参数会损害上下文学习,而仅更新值矩阵可在保持上下文学习的同时提升零样本性能。我们进一步证明,引入辅助的少样本损失主要增强目标任务的上下文学习,但以牺牲微调未见任务上的上下文学习能力为代价。我们提供了来自合成和真实数据集的实验证据,与理论定性预测一致。

英文摘要

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We provide empirical evidence from synthetic and real-world datasets consistent with the qualitative predictions of our theory.

2602.22221 2026-06-02 cs.IR cs.AI cs.CL cs.CY 版本更新

Evaluating Reliability Asymmetries in Chinese Factual Search and AI Answers

评估中文事实搜索与AI答案中的可靠性不对称性

Geng Liu, Li Feng, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano(电子、信息与生物工程系,米兰理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 通过构建基于真实中文搜索日志的查询事实核查数据集,比较传统搜索引擎、大型语言模型和搜索集成AI概览在中文是非问题上的准确性、回答频率、极性差距及区域信息需求差异,揭示可靠性不仅取决于回答正确性,还受回答频率、否定主张处理和信息需求暴露风险影响。

详情
AI中文摘要

搜索引擎和AI驱动的系统越来越多地成为获取事实信息的媒介,但在现实信息寻求场景中,其可靠性仍难以评估。我们通过从真实中文搜索日志构建基于查询的事实核查数据集,并比较传统搜索引擎、独立大型语言模型和搜索集成AI概览等九种系统,在中文网络生态中研究这一问题。聚焦于中文事实性是非问题,我们根据证据推导的基准事实评估系统是否提供正确、错误或不确定的判断。我们发现,当系统给出明确答案时,准确率相似(73.2%至78.9%),但给出明确答案的频率差异显著:搜索引擎对超过83%的查询给出明确答案,而Qwen-Max则不到一半。我们还发现一致的极性差距:所有系统在标记为“是”的查询上表现优于标记为“否”的查询。我们利用百度指数数据识别健康相关搜索关注度较高的中国省份,这可能表明更大的错误信息暴露风险。总体而言,我们的结果表明,可靠性不仅取决于系统回答时的正确性,还取决于回答频率、如何处理否定主张以及信息需求可能增加暴露风险的地方。

英文摘要

Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evaluate in realistic information-seeking settings. We study this problem in the Chinese web ecosystem by constructing a query-based fact-checking dataset from real Chinese search logs and comparing nine systems across traditional search engines, standalone large language models, and search-integrated AI Overviews. Focusing on factual Chinese-language factual Yes/No questions, we evaluate whether systems provide correct, incorrect, or uncertain decisions against evidence-derived ground truth. We find that systems are similarly accurate when they provide definitive answers, but differ sharply in how often they do so. Conditional accuracy ranges from 73.2% to 78.9%, yet search engines answer definitively on over 83% of queries, while Qwen-Max does so on fewer than half. We also find a consistent polarity gap: all systems perform better on yes-labeled queries than on no-labeled queries. We also use Baidu Index data to identify Chinese provinces with higher health-related search attention, which may indicate greater potential exposure to misinformation. Overall, our results show that reliability depends not only on whether systems are correct when they answer, but also on how often they answer, how they handle negative claims, and where information demand may increase exposure risks.

2512.09730 2026-06-02 cs.CL cs.LG 版本更新

Interpreto: An Explainability Library for Transformers

Interpreto:一个用于Transformer的可解释性库

Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich, Charlotte Claye, François Hoofd, Raphael Bernas, Nicholas Asher, Céline Hudelot, Fanny Jourdan

发表机构 * IRT Saint Exupéry Toulouse(伊尔杜夫圣埃克苏佩里图卢斯) IRIT Toulouse(图卢兹IRIT) Khoury College of Computer Sciences(科赫里计算机科学学院) Ampere(阿姆佩尔) MICS, CentraleSupélec(MICS,中央超导学院) Scienta Lab(科学实验室) Thales Avionics(泰勒斯航空电子) ANITI

AI总结 Interpreto是一个开源Python库,通过归因方法和基于概念的解释,为HuggingFace语言模型(从早期BERT变体到LLM)提供统一的解释工作流,其端到端基于概念的流水线是主要创新。

Comments Accepted to ACL 2026 System Demonstration. Equal contribution: Poché and Jourdan

详情
AI中文摘要

Interpreto是一个用于解释HuggingFace语言模型(从早期BERT变体到LLM)的开源Python库。它提供了两类互补的方法:归因方法和基于概念的解释。该库通过为分类和文本生成提供统一的API来暴露解释工作流,从而连接了最新研究和实用工具。一个关键的区别在于其端到端的基于概念的流水线(从激活提取到概念学习、解释和评分),这超越了特征级归因,在现有库中并不常见。参见GitHub: https://github.com/FOR-sight-ai/interpreto 和演示网站: https://for-sight-ai.github.io/interpreto-demo/。

英文摘要

Interpreto is an open-source Python library for interpreting HuggingFace language models, from early BERT variants to LLMs. It provides two complementary families of methods: attribution methods and concept-based explanations. The library bridges recent research and practical tooling by exposing explanation workflows through a unified API for both classification and text generation. A key differentiator is its end-to-end concept-based pipeline (from activation extraction to concept learning, interpretation, and scoring), which goes beyond feature-level attributions and is uncommon in existing libraries. See GitHub: https://github.com/FOR-sight-ai/interpreto and the demo website: https://for-sight-ai.github.io/interpreto-demo/.

2602.18008 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

LLM 是否准备好进行神经集成机制建模?一个基准测试与智能体框架

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出神经集成机制建模(NIMM)基准测试,评估大语言模型在三个科学领域构建神经集成机制模型的能力,并设计树引导的智能体框架 NIMMGen,通过分支级搜索和原子模型细化显著提升搜索稳定性和解质量。

Comments 25 pages, 8 figures

详情
AI中文摘要

大语言模型(LLM)在从数据构建机制模型方面显示出潜力。然而,现有评估主要关注简化设置,未能捕捉真实世界科学建模的复杂性。在实践中,此类建模通常涉及神经集成公式,其中机制模型组件和神经网络组件共同构建,导致搜索空间显著复杂化。受此差距驱动,我们引入了神经集成机制建模(NIMM)基准测试,该基准测试评估 LLM 生成的神经集成机制模型在三个科学领域上的表现。在 NIMM 上的实验表明,现有基于 LLM 的方法难以有效探索这一复杂空间,导致搜索稳定性和解质量有限。为应对这一挑战,我们提出了 NIMMGen,一种树引导的智能体框架,通过分支级搜索实现多样化探索,并通过原子模型细化改进解。大量实验表明,NIMMGen 在 NIMM 上达到了最先进的性能,显著提升了搜索稳定性和解质量。

英文摘要

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus on simplified settings and fail to capture the complexity of real-world scientific modeling. In practice, such modeling often involves neural-integrated formulations, where a mechanistic model component and a neural network component are jointly constructed, leading to a significantly more complex search space. Motivated by this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) benchmark, which evaluates LLM-generated neural-integrated mechanistic models across three scientific domains. Experiments on NIMM reveal that existing LLM-based approaches struggle to effectively explore this complex space, resulting in limited search stability and solution quality. To address this challenge, we propose NIMMGen, a tree-guided agentic framework that enables diversified exploration via branch-level search and improves solutions through atomic model refinement. Extensive experiments demonstrate that NIMMGen achieves state-of-the-art performance on NIMM, significantly improving search stability and solution quality.

2602.12984 2026-06-02 cs.CL 版本更新

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

SciAgentGym:LLM代理中多步科学工具使用的基准测试

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

发表机构 * Fudan NLP Group(复旦大学自然语言处理组)

AI总结 为解决当前基准忽视代理在科学工作流中编排工具能力的问题,提出包含1780个领域特定工具的可扩展交互环境SciAgentGym和分层评估套件SciAgentBench,并设计数据合成方法SciForge,通过微调使SciAgent-8B超越更大模型,展示科学工具使用能力的跨领域正迁移。

详情
AI中文摘要

科学推理本质上需要整合复杂的工具包来导航领域特定知识。然而,当前的基准测试在很大程度上忽视了代理编排工具以进行此类严谨工作流的能力。为弥补这一差距,我们引入了SciAgentGym,一个可扩展的交互环境,包含四个自然科学学科中的1,780个领域特定工具,并由强大的执行基础设施支持。此外,我们提出了SciAgentBench,一个分层评估套件,旨在从基本动作到长期工作流对代理能力进行压力测试。我们的评估识别出一个关键瓶颈:最先进的模型仍然难以处理复杂的科学工具使用,并且随着交互范围的扩展,其性能显著下降。为解决这一问题,我们提出了SciForge,一种数据合成方法,将工具动作空间建模为依赖图,以生成逻辑感知的训练轨迹。通过在这些轨迹上进行微调,我们的SciAgent-8B优于显著更大的Qwen3-VL-235B-Instruct,同时表现出科学工具使用能力的跨领域正迁移。这些结果凸显了下一代自主科学代理的潜力。

英文摘要

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models still struggle with complex scientific tool-use, and their performance degrades substantially as interaction horizons extend. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

2602.11852 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型Transformer:迈向可解释设计的语言模型架构

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出原型Transformer(ProtoT),一种用线性代价原型模块替代二次代价自注意力的自回归语言模型架构,原型自动捕获可命名概念,提升可解释性并支持行为编辑。

Comments Accepted at ICML 2026. Equal contribution: Yordan Yordanov and Matteo Forasassi. 40 pages, 28 figures, 22 tables

详情
AI中文摘要

尽管最先进的语言模型(LM)在某些领域超越了大多数人类,但其推理过程仍然不透明,降低了信任度并增加了欺骗和幻觉的风险。我们引入了原型Transformer(ProtoT),一种自回归LM架构,它将Transformer的二次代价自注意力模块替换为基于原型的线性代价模块,原型是学习到的参数向量。在ProtoT中,原型创建了在不同时间尺度上聚合上下文信息的通信通道。我们表明,这种结构导致原型在训练过程中自动捕获可命名的概念,例如“女人”,为解释模型推理和对模型行为进行有针对性的编辑提供了途径。与基线相比,ProtoT在模型和数据规模上具有良好的扩展性,对输入扰动具有鲁棒性,并在文本生成和下游任务(包括GLUE)上表现良好。这些结果表明,ProtoT是朝着设计上更可解释的自回归语言模型迈出的有希望的一步。

英文摘要

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

2602.11790 2026-06-02 cs.AI cs.CL 版本更新

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

超越端到端视频模型:基于LLM的多智能体系统用于教育视频生成

Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

发表机构 * Baidu Inc.(百度公司)

AI总结 提出LASEV,一种基于LLM的分层多智能体系统,通过将教育视频生成分解为多个专业智能体协作,解决端到端模型在逻辑严谨性和知识表示方面的不足,实现低成本、高吞吐量的自动化教学视频生产。

Comments Accepted at ACM SIGKDD 2026 (KDD '26), Applied Data Science Track. 10 pages, 2 figures, 5 tables. The project is available at \url{https://robitsg.github.io/LASEV}

详情
AI中文摘要

尽管最近的端到端视频生成模型在视觉导向的内容创作中表现出令人印象深刻的性能,但在需要严格逻辑严谨性和精确知识表示的场景(如教学和教育媒体)中仍然受限。为解决此问题,我们提出LASEV,一种基于LLM的分层多智能体系统,用于从教育问题生成高质量教学视频。LASEV将教育视频生成表述为一个多目标任务,同时要求正确的逐步推理、教学连贯的叙述、语义忠实的视觉演示以及精确的视听对齐。为解决先前方法的局限性——包括低程序保真度、高生产成本和有限的可控性——LASEV将生成工作流分解为通过中央编排智能体协作的专业智能体,共享生产状态、显式质量门控和迭代批评机制。具体来说,编排智能体监督一个用于严格问题求解的求解智能体、一个生成可执行可视化代码的插图智能体,以及一个面向学习者的教学脚本的叙述智能体。此外,工作智能体的所有输出都经过语义批评、基于规则的约束和基于工具的编译检查。该系统不直接合成像素,而是构建一个结构化的可执行视频脚本,该脚本通过模板驱动的组装规则确定性编译为同步的视觉和叙述,实现无需手动编辑的全自动生产。在大规模部署中,LASEV实现了每天超过一百万视频的吞吐量,与当前行业标准方法相比成本降低超过95%,同时保持高接受率。

英文摘要

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LASEV decomposes the generation workflow into specialized agents that collaborate through a central Orchestrating Agent, shared production state, explicit quality gates, and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization code, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated production without manual editing. In large-scale deployments, LASEV achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.

2602.11177 2026-06-02 cs.CL cs.AI 版本更新

What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection

LLMs 对阿尔茨海默病了解多少?用于 AD 检测的多损失微调和探针分析

Lei Jiang, Yue Zhou, Natalie Parde

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 本文通过多损失微调 BERT、T5 和 Llama-1B 模型,在三个语料库上实现文本 AD 检测新 SOTA,并利用线性探针分析内部表征中 AD 相关信息的编码。

详情
AI中文摘要

可靠的阿尔茨海默病(AD)早期检测具有挑战性,特别是由于标记数据的有限可用性。虽然大型语言模型(LLMs)在跨领域表现出强大的迁移能力,但通过监督微调将其适应 AD 领域仍 largely unexplored。在这项工作中,我们跨三个异构转录语料库(Pitt、CCC、ADRC)实证评估了各种模型架构,以研究它们在基于文本的 AD 检测中的有效性,并分析任务相关信息如何在其内部表征中编码。据我们所知,我们微调的 BERT 和 T5 模型在 Pitt 和 CCC 数据集上建立了新的最先进水平,同时在 ADRC 上取得了强劲性能。同时,仅解码器的 Llama-1B 在所有三个语料库上取得了与 BERT 和 T5 相当的高度竞争结果,突显了其在 AD 检测中的有效性。我们进一步对 Llama-1B 骨干网络进行了全面评估,分析了跨语料库可迁移性、最优输入块大小粒度以及临床转录标记的影响。此外,我们使用线性探针实证表明,微调以反映 AD 相关信号的方式改变了单个标记(语言标记和内容词)的表征。

英文摘要

Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across do mains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we empirically evaluate various model architectures across three heterogeneous transcript corpora (Pitt, CCC, ADRC) to investigate their effectiveness for text-based AD detection and analyze how task-relevant information is encoded within their internal representations. To the best of our knowledge, our fine-tuned BERT and T5 models establish a new state-of-the-art on the Pitt and CCC datasets, while achieving strong performance on ADRC. In parallel, the decoder-only Llama-1B achieves highly competitive results comparable to BERT and T5 across all three corpora, highlighting its effectiveness for AD detection. We further conduct a comprehensive evaluation of the Llama-1B backbone, analyzing cross-corpus transferability, optimal input chunk-size granularity, and the impact of clinical transcript markers. Also, we use linear probing to empirically show that fine-tuning shifts the representations of individual tokens, both linguistic markers and content words, in ways that reflect AD-related signal.

2602.08236 2026-06-02 cs.CV cs.AI cs.CL 版本更新

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

何时想象以及想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出自适应测试时框架AVIC/AVIC-R,通过世界模型选择性调用和缩放视觉想象,在空间推理中平衡准确性与效率,超越GPT-4o等基线。

Comments the first two authors are equally contributed. Project page: https://adaptive-visual-tts.github.io/

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了快速进展,但当正确答案取决于场景在未见或替代视角下的外观时,视觉空间推理仍然不可靠。最近的工作通过使用世界模型进行视觉想象来增强推理,但诸如想象何时真正必要、多少想象有益、以及何时想象有害等问题仍知之甚少。在实践中,无差别的想象可能会增加计算量,甚至通过引入误导性证据而降低性能。在这项工作中,我们深入分析了作为空间推理可控资源的测试时视觉想象。我们首先研究静态视觉证据何时足够,想象何时改进推理,以及过度或不必要的想象如何影响准确性和效率。为了支持这一分析,我们随后引入了AVIC,一个基于世界模型的自适应测试时框架,该框架在选择性调用和缩放视觉想象之前,明确推理当前视觉证据的充分性。最后,为了进一步学习这种门控和规划行为,而无需任何关于何时想象以及想象多少的标注,我们引入了AVIC-R,它通过来自QA正确性奖励和想象成本惩罚的GRPO来训练策略。在空间推理基准(SAT, MMSI)和具身导航基准(R2R)上,我们的结果揭示了想象至关重要、边际或有害的明确场景,并表明选择性控制可以匹配或超越固定想象策略,同时大幅减少世界模型调用和语言标记。我们的AVIC-R超越了包括GPT-4o和GPT-4.1在内的强大专有基线,同时调用世界模型的频率更低。总体而言,我们的发现强调了分析和控制测试时想象对于高效可靠的空间推理的重要性。

英文摘要

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

2602.06065 2026-06-02 stat.ML cond-mat.dis-nn cs.CL cs.LG 版本更新

Deep networks learn to parse uniform-depth context-free languages from local statistics

深度网络从局部统计中学习解析均匀深度的上下文无关语言

Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

发表机构 * GitHub

AI总结 通过引入可调类概率上下文无关文法并设计基于深度卷积网络的推理算法,揭示了语言结构从局部统计中涌现的机制,并验证了深度卷积和Transformer架构的预测。

Comments Accepted as regular paper at ICML 2026

详情
AI中文摘要

理解语言结构如何仅从句子中学习是认知科学和机器学习中的一个核心问题。大型语言模型(LLMs)内部表征的研究支持其在预测下一个词时解析文本的能力,同时独立于表面形式表示语义概念。然而,哪些数据统计使这些成就成为可能,以及需要多少数据,仍然在很大程度上未知。概率上下文无关文法(PCFGs)为研究这些问题提供了一个可处理的测试平台。然而,先前的工作要么侧重于训练网络使用的类解析算法的后验表征,要么侧重于具有固定语法(无需解析)的PCFGs的可学习性。在这里,我们(i)引入了一个可调的PCFGs类别,其中歧义程度和跨尺度的相关结构都可以被控制;(ii)提供了一种学习机制——一种受深度卷积网络结构启发的推理算法——将可学习性和样本复杂度与特定语言统计联系起来;(iii)在深度卷积和基于Transformer的架构上经验性地验证了我们的预测。总体而言,我们提出了一个统一框架,其中不同尺度的相关性消除了局部歧义,使数据的层次化表征得以涌现。

英文摘要

Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

2511.21140 2026-06-02 cs.LG cs.CL stat.AP stat.ML 版本更新

How to Correctly Report LLM-as-a-Judge Evaluations

如何正确报告LLM作为评估者的评估结果

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM作为评估者时存在偏差的问题,提出一种插件式校正框架,实现无偏估计和统计原理的不确定性量化,并证明在分布偏移下仍保持无偏性。

详情
Journal ref
International Conference on Machine Learning (ICML) 2026
AI中文摘要

大型语言模型(LLMs)被广泛用作模型响应的可扩展评估者,以替代人工标注者。然而,LLM评估者的不完美灵敏度和特异性会导致朴素评估分数产生偏差。我们提出一个简单的插件式框架,可校正此偏差并实现统计原理的不确定性量化。我们的框架构建置信区间,该区间同时考虑来自测试数据集和人工标注校准数据集的不确定性。此外,它采用自适应策略分配校准样本以获得更紧的区间。重要的是,我们刻画了由真实评估分数和LLM评估者的灵敏度与特异性定义的参数区间,在这些区间内,基于LLM的评估比仅人工评估产生更可靠的估计。此外,我们证明,与现有方法相比,我们的框架在测试集和校准集之间存在分布偏移时仍保持无偏性。

英文摘要

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

2502.16174 2026-06-02 cs.LG cs.AI cs.CL cs.CR 版本更新

Efficient LLM Moderation with Multi-Layer Latent Prototypes

基于多层潜在原型的高效LLM审核

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

发表机构 * University of Warsaw(华沙大学)

AI总结 提出多层原型审核器(MLPM),利用多层中间表示的原型实现轻量、高效且可定制的输入审核,在多个基准上达到最优性能,并可与输出审核结合提升响应安全性。

详情
AI中文摘要

尽管现代LLM在后训练过程中与人类价值观对齐,但在部署时仍需稳健的审核以防止有害输出。现有方法存在性能与效率的权衡,且难以定制以满足用户特定需求。针对这一差距,我们引入了多层原型审核器(MLPM),一种轻量级且高度可定制的输入审核工具。我们提出利用多层中间表示的原型来提高审核质量,同时保持高效率。通过设计,我们的方法对生成流水线的开销可忽略不计,并可无缝应用于任何模型。MLPM在多种审核基准上实现了最先进的性能,并在不同大小的模型系列中表现出强大的可扩展性。此外,我们展示了它能平滑集成到端到端审核流水线中,并在与输出审核技术结合时进一步提高响应安全性。总体而言,我们的工作为安全、稳健且高效的LLM部署提供了一种实用且可适应的解决方案。

英文摘要

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

2511.16886 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

TRMs中的潜在推理实际上是策略改进算子

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev Rayan Banerjee Fakhri Karray Martin Takac

AI总结 本文通过将潜在递归推理形式化为策略改进算法,解释了递归步骤何时有效提升性能,并提出结合强化学习和扩散方法的训练方案,在Tiny Recursive Model上实现18倍前向传递减少且保持性能。

详情
AI中文摘要

最近,具有潜在递归的小模型在复杂推理任务上取得了有希望的结果。这些结果通常由这样的理论解释:这种递归增加了网络的深度,使其能够紧凑地模拟更大模型的能力。然而,递归添加层的性能仍然落后于具有相同前馈深度的单次通过模型。这意味着在循环版本中,并非每个递归步骤都有效地贡献于深度。这提出了一个问题:潜在推理何时以及为何能提高性能,何时会导致无效计算?在我们的工作中,我们证明了潜在递归推理为这个问题提供了答案。我们展示了潜在递归推理可以形式化为策略改进算法。基于这些见解,我们提出使用强化学习和扩散方法的训练方案用于潜在推理模型。以Tiny Recursive Model作为测试平台,我们展示了通过我们的修改,可以避免无效计算步骤,并将前向传递总数减少18倍,同时保持性能。总的来说,我们展示了递归步骤的策略改进视角如何解释模型行为,并为进一步改进提供见解。

英文摘要

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we demonstrate that latent recursive reasoning provides answer to this question. We show that latent recursive reasoning can be formalized as a policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

2602.03318 2026-06-02 cs.CL 版本更新

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

MIRROR: 一种用于运筹学优化建模的具有迭代自适应修正与分层检索的多智能体框架

Yifan Shi, Jiayi Wang, Minyi Wu, Ye Fan, Jialong Shi, Jianyong Sun

发表机构 * Xi’an Jiaotong University(西安交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出一种免微调的多智能体框架MIRROR,通过执行驱动的迭代自适应修正和分层检索机制,将自然语言优化问题直接转化为数学模型和求解器代码,在标准运筹学基准上优于现有方法。

详情
AI中文摘要

运筹学依赖于专家驱动的建模——这是一个缓慢且脆弱的过程,难以适应新场景。虽然大语言模型可以自动将自然语言转化为优化模型,但现有方法要么依赖昂贵的后训练,要么采用多智能体框架,然而大多数方法仍缺乏可靠的协作纠错和任务特定检索,常常导致错误输出。我们提出MIRROR,一个免微调的端到端多智能体框架,直接将自然语言优化问题转化为数学模型和求解器代码。MIRROR集成了两个核心机制:(1) 执行驱动的迭代自适应修正,用于自动纠错;(2) 分层检索,从精心策划的示例库中获取相关的建模和编码示例。实验表明,MIRROR在标准运筹学基准上优于现有方法,在复杂工业数据集如IndustryOR和Mamo-ComplexLP上取得了显著结果。通过将精确的外部知识注入与系统纠错相结合,MIRROR为非专家用户提供了高效可靠的运筹学建模解决方案,克服了通用大语言模型在专家优化任务中的根本局限性。

英文摘要

Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.

2509.23782 2026-06-02 cs.CL 版本更新

Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

弥合大语言模型在多项选择题中的知识-预测差距

Yoonah Park, Haesung Pyun, Yohan Jo

发表机构 * KAIST(韩国科学技术院)

AI总结 通过分析隐藏表示中的知识子空间和预测子空间,提出轻量级推理时干预方法KAPPA来对齐两者,从而减少LLM在多项选择题上的知识-预测差距。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管大语言模型(LLM)在多种任务上表现强劲,但其可信度受到不忠实于内部知识的异常行为的限制。特别是,LLM在多项选择题(MCQ)上经常失败,即使它们在隐藏表示中编码了正确答案,这揭示了内部知识与输出行为之间的错位。我们通过三步分析隐藏表示来研究和缓解MCQ上的这种知识-预测差距。首先,我们量化了跨模型和数据集的差距的普遍性和幅度。其次,我们通过在残差流中识别不同的知识和预测子空间来提供几何解释。第三,我们引入了KAPPA,一种轻量级的推理时干预方法,用于对齐残差流中的两个子空间以减少知识-预测差距。我们的结果为LLM中的知识-预测差距提供了几何且可解释的解释。此外,KAPPA有效地减少了跨多种MCQ基准和模型的差距,并泛化到自由形式设置中。

英文摘要

While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.

2501.18649 2026-06-02 cs.CL cs.AI cs.IR cs.LG 版本更新

Fake News Detection After LLM Laundering: Measurement and Explanation

LLM清洗后的假新闻检测:测量与解释

Rupak Kumar Das, Jonathan Dodge

发表机构 * College of IST Pennsylvania State University(宾夕法尼亚州立大学信息科学与技术学院)

AI总结 研究测量检测器在识别LLM改写假新闻时的有效性,发现检测器难以检测LLM改写的假新闻,并通过LIME解释发现情感偏移是检测失败的原因之一。

详情
AI中文摘要

凭借其先进的能力,大型语言模型(LLM)可以生成高度令人信服且上下文相关的假新闻,这可能有助于传播错误信息。尽管针对人类撰写文本的假新闻检测已有大量研究,但检测LLM生成的假新闻这一领域仍探索不足。本研究测量了检测器在识别LLM改写的假新闻方面的有效性,特别是确定在检测流程中添加改写步骤是有助于还是阻碍检测。本研究贡献如下:(1)检测器在检测LLM改写的假新闻时比检测人类撰写文本更困难;(2)我们发现了哪些模型在哪些任务(逃避检测、通过改写逃避检测以及为语义相似性进行改写)上表现出色;(3)通过LIME解释,我们发现了检测失败的一个可能原因:情感偏移;(4)我们发现了一个关于改写质量测量的令人担忧的趋势:尽管BERTSCORE很高,但样本仍表现出情感偏移;(5)我们提供了一对数据集,用改写输出和分数扩充了现有数据集。该数据集可在GitHub上获取。

英文摘要

With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub

2602.03719 2026-06-02 cs.CL 版本更新

BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning

BranPO:面向长程智能体强化学习的可扩展对比分支采样

Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Nanyang Technological University(南洋理工大学)

AI总结 针对长程智能体强化学习中轨迹级奖励稀疏且信用分配困难的问题,提出一种无值函数方法BranPO,通过截断轨迹并重采样延续构建对比分支,实现局部对比监督,无需密集奖励。

Comments 26 pages, 5 figures

详情
AI中文摘要

智能体强化学习使大型语言模型能够执行多轮规划和使用工具,但在稀疏轨迹级奖励下,长程训练仍然具有挑战性,其中单一结果被统一分配给所有决策。先前的方法通过基于树的探索或过程级评估引入更细粒度的监督,但往往成本高昂或产生噪声信用信号。在智能体轨迹中,早期错误可能被后续动作纠正,而看似有希望的中间状态可能因后续决策不佳而失败。我们将此属性称为非单调正确性,这使得结果奖励或状态值不足以指导从每个状态应采取什么动作。为了解决这个问题,我们提出了分支相对策略优化(BranPO),一种无值函数方法,无需密集奖励即可构建局部对比监督。BranPO在中间前缀处截断轨迹,并重采样延续以形成对比分支,这些分支共享相同的前缀但在最终结果上不同,从而隔离导致成功或失败的决策。我们进一步引入了难度感知分支采样和冗余步骤掩码,以提高采样效率并抑制冗余更新。实验表明,BranPO在多个多跳问答基准测试中一致优于各种基线类别,且无需额外训练成本,并泛化到更广泛的长程智能体任务中,实现持续改进。我们的代码可在https://github.com/YubaoZhao/BranPO获取。

英文摘要

Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly assigned to all decisions. Prior methods introduce finer-grained supervision via tree-based exploration or process-level evaluation, but often incur high cost or produce noisy credit signals. In agentic trajectories, early mistakes may still be corrected by later actions, while seemingly promising intermediate states can fail due to poor subsequent decisions. We call this property non-monotonic correctness, which makes outcome rewards or state values insufficient for guiding what actions should be taken from each state. To address this, we propose Branching Relative Policy Optimization (\textbf{BranPO}), a value-free method that constructs localized contrastive supervision without dense rewards. BranPO truncates trajectories at intermediate prefixes and resamples continuations to form contrastive branches that share the same prefix but diverge in final outcomes, thereby isolating decisions that drive success or failure. We further introduce difficulty-aware branch sampling and Redundant Step Masking to improve sampling efficiency and suppress redundant updates. Experiments show that BranPO consistently outperforms diverse baseline categories across multiple multi-hop QA benchmarks without additional training cost, and generalizes to broader long-horizon agentic tasks with consistent improvements. Our code is available at https://github.com/YubaoZhao/BranPO.

2602.03619 2026-06-02 cs.CL 版本更新

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

从人类偏好中学习查询特定评分标准用于DeepResearch报告生成

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Shihan Dou, Zisu Huang, Muzhao Tian, Xiaohua Wang, Yang Liu, Pluto Zhou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou

发表机构 * Tencent(腾讯) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出一种通过强化学习训练查询特定评分标准生成器的流水线,以解决DeepResearch长报告生成中缺乏可验证奖励信号的问题,并在人类偏好测试和下游任务中取得显著性能提升。

详情
AI中文摘要

如今,开发可靠的DeepResearch式长文本报告生成仍然具有挑战性,因为训练和评估缺乏可验证的奖励信号。因此,基于评分标准的评估已成为常见做法。然而,现有方法要么依赖于粗粒度、预定义的评分标准,缺乏足够的细粒度,要么依赖于人工构建的查询特定评分标准,成本高昂且难以扩展。在本文中,我们提出了一种流水线,用于训练偏好基础的查询特定评分标准生成器,专门用于DeepResearch报告生成。我们首先构建了一个DeepResearch式查询数据集,其中包含成对报告上的人类偏好注释,并通过强化学习训练评分标准生成器,使用结合偏好一致性、格式有效性和基于LLM的评分标准评估的混合奖励。我们在两个阶段评估生成的评分标准生成器。首先,在保留的人类偏好测试集上,学习到的评分标准比通用的、提示的或SFT训练的评分标准更有效地区分偏好报告和拒绝报告。其次,当用作训练DeepResearch系统的奖励信号时,我们的评分标准生成器在简单的单智能体ReAct框架和复杂的多智能体工作流上,在DeepResearch Bench上均取得了显著的性能提升。

英文摘要

Nowadays, developing reliable DeepResearch-style long-form report generation remains challenging, as training and evaluation lack verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train preference-grounded query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining preference consistency, format validity, and LLM-based rubric evaluation. We evaluate the resulting rubric generators in two stages. First, on a held-out human-preference test set, the learned rubrics discriminate preferred from rejected reports more effectively than generic, prompted, or SFT-trained rubric alternatives. Second, when used as reward signals to train DeepResearch systems, our rubric generators yield substantial performance gains under both a simple single-agent ReAct framework and a complex multi-agent workflow on the DeepResearch Bench.

2602.03554 2026-06-02 cs.LG cs.AI cs.CE cs.CL 版本更新

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

当单一答案不够时:重新思考面向大语言模型的单步逆合成基准

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Mathieu Reymond, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

发表机构 * DeepMind, London, UK(伦敦英国深度思维公司)

AI总结 针对现有逆合成基准依赖单一真实答案的局限,提出基于化学合理性度量ChemCensor的新评估框架,并构建数据集CREED训练模型以提升性能。

详情
AI中文摘要

最近的进展扩展了大语言模型(LLMs)在药物发现中的应用,包括合成规划。然而,逆合成性能的客观评估仍然有限。现有的基准和指标通常依赖于已发表的合成程序以及基于单一真实答案的Top-K准确率,这未能捕捉真实世界合成规划的开放性。我们提出一个新的单步逆合成基准框架,使用ChemCensor(一种化学合理性的新度量)来评估通用型和化学专用型LLMs。通过强调合理性而非精确匹配,该方法更符合人类合成规划实践。我们还引入了CREED,一个包含数百万经ChemCensor验证的反应记录的新数据集,用于LLM训练,并使用它训练了一个在该基准下优于LLM基线的模型。

英文摘要

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

2602.03203 2026-06-02 cs.CL cs.LG 版本更新

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

ForesightKV: 通过学习长期贡献优化推理模型的KV缓存驱逐

Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ForesightKV框架,通过监督学习和强化学习预测KV对在长文本生成中的重要性,在仅用一半缓存预算下优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期,大型语言模型通过生成长推理轨迹展现出卓越的推理能力。然而,随着序列长度增长,键值缓存线性扩展,导致显著的内存和计算成本。现有的KV缓存驱逐方法通过丢弃不重要的KV对来缓解此问题,但往往难以捕捉复杂的KV依赖关系,导致性能下降。为了更好地平衡效率与性能,我们引入了ForesightKV,一种基于训练的KV缓存驱逐框架,学习在长文本生成过程中预测哪些KV对应该被驱逐。我们首先设计了Golden Eviction算法,该算法利用未来注意力分数在每一步识别最优的驱逐KV对。然后通过监督训练,使用成对排序损失将这些轨迹和每一步的分数进行蒸馏。此外,我们将缓存驱逐建模为马尔可夫决策过程,并应用GRPO算法来缓解低熵令牌上语言模型损失显著增加的问题。在三个推理模型的AIME2024和AIME2025基准测试上的实验表明,ForesightKV在仅用一半缓存预算的情况下始终优于先前方法,同时从监督学习和强化学习方法中协同受益。代码可在https://github.com/RUCAIBox/ForesightKV获取。

英文摘要

Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches. Code is available at https://github.com/RUCAIBox/ForesightKV.

2602.02823 2026-06-02 cs.CL 版本更新

R2-Router: A New Paradigm for LLM Routing with Reasoning

R2-Router: 一种基于推理的LLM路由新范式

Jiaqi Xue, Qian Lou, Jiarong Xing, Heng Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有LLM路由忽略输出长度对质量和成本影响的问题,提出R2-Router,通过将输出长度预算作为可控变量联合选择最优LLM和预算,实现低成本高质量路由。

Comments Accepted by ICML 2026

详情
AI中文摘要

随着LLM在能力和成本上的多样化,LLM路由通过学习预测每个LLM对给定查询的质量和成本,然后选择高质量低成本的LLM。然而,现有路由器隐式假设每个LLM对每个查询有固定的质量和成本,忽略了同一LLM的质量随输出长度变化的事实。这导致当估计成本超过预算时,路由器会排除强大的LLM,错过了这些LLM通过缩短输出长度以降低成本仍能提供高质量的机会。为了解决这个问题,我们引入了R2-Router,它将输出长度预算视为可控变量,并联合选择最佳LLM和长度预算,通过长度约束指令强制执行预算。这使得R2-Router能够发现,在先前方法无法看到的成本效益相当的情况下,受限输出的强大LLM可以超越较弱的LLM。结合路由器框架,我们构建了R2-Bench,这是第一个捕捉LLM在不同输出长度预算下行为的路由数据集。实验表明,与现有路由器相比,R2-Router以4-5倍更低的成本实现了最先进的性能。这项工作开辟了一个新方向:路由即推理,其中路由器从反应式选择器演变为深思熟虑的推理器,探索使用哪个LLM以及以何种成本预算。代码公开于https://github.com/UCF-ML-Research/R2-Router。

英文摘要

As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5\times lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget. The code is publicly available at https://github.com/UCF-ML-Research/R2-Router.

2602.00742 2026-06-02 cs.CL 版本更新

CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs

CURP: 基于码本的连续用户表示用于大语言模型的个性化生成

Liang Wang, Xinyi Mou, Xiaoyou Liu, Xuanjing Huang, Zhongyu Wei

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) School of Computer Science, Fudan University(复旦大学计算机科学学院)

AI总结 提出CURP框架,通过双向用户编码器和离散原型码本提取多维用户特征,实现少量可训练参数的即插即用个性化生成,在变体生成任务上优于强基线。

详情
AI中文摘要

用户建模通过个体的偏好和行为模式来表征用户,从而在当代方法中实现与大语言模型(LLMs)的个性化模拟和生成。然而,现有方法(无论是基于提示的方法还是基于训练的方法)在平衡个性化质量与计算和数据效率方面面临挑战。我们提出了一个新颖的框架CURP,它采用双向用户编码器和离散原型码本来提取多维用户特征。这种设计使得即插即用的个性化成为可能,且只需少量可训练参数(约2000万参数,约占模型总大小的0.2%)。通过在变体生成任务上的大量实验,我们表明CURP相比强基线实现了更优的性能和泛化能力,同时提供了更好的可解释性和可扩展性。代码可在https://github.com/RaidonWong/CURP_code获取。

英文摘要

User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2\% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at https://github.com/RaidonWong/CURP_code

2507.08038 2026-06-02 cs.CL cs.AI 版本更新

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

AblationBench:评估实证AI研究中消融实验的自动规划

Talor Abramovich, Gal Chechik

发表机构 * Google Research(谷歌研究)

AI总结 提出AblationBench基准套件,包含作者消融和审稿人消融两个任务,用于评估语言模型在AI研究中规划消融实验的能力,实验表明当前最佳模型仅能识别45%的原始消融,低于人类水平。

Comments AI4Science Workshop, ICML 2026; Project page: https://ablation-bench.github.io/

详情
AI中文摘要

语言模型代理越来越多地被用于自动化科学研究,然而评估其科学贡献仍然是一个挑战。获得此类见解的关键机制是通过消融实验。为此,我们引入了AblationBench,这是一个用于评估代理在实证AI研究中进行消融规划任务的基准套件。它包括两个任务:AuthorAblation,帮助作者基于方法部分提出消融实验,包含83个实例;以及ReviewerAblation,帮助审稿人发现完整论文中缺失的消融,包含350个实例。对于这两个任务,我们开发了基于LM的评判器,作为自动评估框架。我们对前沿LM的实验表明,这些任务仍然具有挑战性,性能最佳的LM系统平均仅能识别45%的原始消融,低于人类水平。我们观察到作者任务和审稿人任务之间存在相反的性能趋势,这归因于模型基础的不同。最后,我们分析了当前LM在这些任务上的局限性,并发现思维链提示优于基于代理的方法。我们的数据可在https://huggingface.co/collections/ai-coscientist/ablationbench获取,代码可在https://github.com/ai-scientist-bench/ablation-bench获取。

英文摘要

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 45% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

2601.22947 2026-06-02 cs.CL cs.LG 版本更新

Reconsidering Positional Supervision in Masked Diffusion Language Model Training

重新审视掩码扩散语言模型训练中的位置监督

Mengyu Ye, Keito Kudo, Ryosuke Takahashi, Jun Suzuki

发表机构 * Tohoku University(东大大学) RIKEN(理化学研究所) NII LLMC(国家信息研究所LLMC)

AI总结 针对掩码扩散语言模型对位置偏移敏感的问题,提出基于连接主义时序分类(CTC)的目标函数,通过引入松弛令牌和更新折叠映射来吸收位置不确定性,从而在开放生成基准上取得一致提升。

Comments preprint, WIP

详情
AI中文摘要

掩码扩散语言模型(MDLM)通过并行去掩码生成文本,最近成为自回归语言模型的替代方案。它们可以被视为使用位置交叉熵(CE)损失训练的并行解码器,与非自回归翻译(NAT)设置相同。在NAT中,CE训练的并行解码器被认为对小的位置偏移敏感,因为CE会严厉惩罚它们。我们询问CE训练的MDLM在迭代解码下是否同样对此类偏移敏感。为了探究这一点,我们应用了一种受控干预,在解码过程中引入这些偏移。在LLaDA-8B-Instruct和Arena-Hard上,仅将1%的生成令牌移动一个位置,就显著降低了相对于未干预模型的胜率,表明MDLM在迭代并行解码下对此类小偏移敏感。受此启发,我们将连接主义时序分类(CTC)(一种已知能缓解该问题的对齐灵活目标)适配到MDLM监督微调中。通过放宽CE施加的严格位置匹配,CTC为损失提供了吸收小位置偏移的空间;具体地,我们修改了CTC目标,使用一个特殊的<slack>令牌来吸收目标令牌与输出位置之间的位置不确定性,并更新了折叠映射以保留目标表面形式。在四个开放生成基准上,所得模型在原始模型和匹配的交叉熵训练基线上均有一致改进,且在所有四个基准上具有统计显著性。这些结果表明,训练侧的对齐灵活性是MDLM SFT的一个有用设计维度,与先前工作中探索的推理时方法互补。

英文摘要

Masked diffusion language models (MDLMs) generate text by unmasking tokens in parallel and have recently emerged as alternatives to autoregressive language models. They can be viewed as parallel decoders trained with a position-wise cross-entropy (CE) loss, the same setup as non-autoregressive translation (NAT). In NAT, CE-trained parallel decoders have been argued to be sensitive to small positional shifts, since CE penalizes them harshly. We ask whether CE-trained MDLMs are similarly sensitive to such shifts under iterative decoding. To probe this, we apply a controlled intervention that introduces them during decoding. On LLaDA-8B-Instruct with Arena-Hard, displacing as little as 1% of generated tokens by one position substantially reduces win rates against the unintervened model, showing that MDLMs are sensitive to such small shifts under iterative parallel decoding. Motivated by this, we adapt connectionist temporal classification (CTC), an alignment-flexible objective known to mitigate it there, to MDLM supervised fine-tuning. By relaxing the strict position-wise match that CE imposes, CTC gives the loss room to absorb small positional shifts; concretely, we modified CTC objective to use a special <slack> token that absorbs positional uncertainty between target tokens and output positions, and a updated collapse map that preserves target surface forms. Across four open-ended generation benchmarks, the resulting model consistently improves over both the original model and a matched cross-entropy-trained baseline, with statistically significant gains on all four. These results identify training-side alignment flexibility as a useful design dimension for MDLM SFT, complementary to the inference-time approaches explored in prior work.

2512.00986 2026-06-02 cs.CL 版本更新

ADRA-Bank: A Modular Benchmark for Academic Deep Research Agents

ADRA-Bank:面向学术深度研究智能体的模块化基准

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

发表机构 * The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学) Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学) National University of Singapore, Singapore, Singapore(新加坡国立大学) Huawei Technologies, Hong Kong SAR, China(华为技术有限公司) Independent(独立)

AI总结 针对现有基准侧重检索而忽视规划与推理、且缺乏学术领域覆盖的问题,提出ADRA-Bank模块化基准,包含10个学术领域的200个人工标注实例,并设计ADRA-Eval评估范式,通过端到端和隔离评估两种模式测试规划、检索和推理能力,揭示智能体在跨源检索和跨领域一致性上的不足,并指出提升高层规划能力是释放基础LLM推理潜力的关键。

详情
AI中文摘要

学术出版物激增催生了自动化深度研究(DR)系统的需求,但准确评估这些系统仍是一个开放问题。首先,现有基准通常狭隘地聚焦于检索,而忽视了高层规划与推理。其次,现有基准偏向通用领域,而非作为DR智能体核心应用的学术领域。为弥补这些不足,我们引入了ADRA-Bank,一个面向学术DR智能体的模块化基准。该基准基于学术文献,是一个包含10个学术领域(涵盖研究论文和综述论文)200个人工标注实例的数据集。此外,我们提出了一个模块化的学术DR智能体评估范式(ADRA-Eval),利用学术论文的丰富结构来评估规划、检索和推理的核心能力。它采用两种互补模式:针对任务智能体的端到端评估,以及针对作为潜在基础模型的基础LLM的隔离评估。结果揭示了不均衡的能力:虽然智能体表现出专业优势,但在多源检索和跨领域一致性上存在困难。此外,提升高层规划能力是释放作为基础模型的基础LLM推理潜力的关键因素。通过暴露这些可操作的失败模式,ADRA-Bank提供了一个诊断工具,以指导更可靠的自动化学术研究助手的开发。

英文摘要

A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents. To address these gaps, we introduce ADRA-Bank, a modular benchmark for Academic DR Agents. Grounded in academic literature, our benchmark is a human-annotated dataset of 200 instances across 10 academic domains, including both research and review papers. Furthermore, we propose a modular Evaluation Paradigm for Academic DR Agents (ADRA-Eval), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an end-to-end evaluation for \task agents and an isolated evaluation for foundational LLMs as potential backbones. Results reveal uneven capabilities: while agents show specialized strengths, they struggle with multi-source retrieval and cross-field consistency. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, ADRA-Bank provides a diagnostic tool to guide the development of more reliable automatic academic research assistants.

2501.13428 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Softplus注意力与重加权提升大语言模型的长度外推能力

Bo Gao, Michael W. Spratling, Letizia Gionfrida

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种两阶段注意力机制,用Softplus和l1归一化替代Softmax,并引入基于不变熵的动态缩放因子和重加权机制,以提升数值稳定性、缓解注意力下沉现象,并显著改善长度外推性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

近年来,大语言模型取得了显著成功,这主要归功于自注意力机制。然而,传统的Softmax注意力存在数值不稳定性,并且随着推理令牌数量的增加,性能会下降。本文通过提出一种新的注意力设计原则来解决这些问题,将注意力视为一个两阶段过程。第一阶段(归一化)通过用数值更稳定的Softplus后接$l_{1}$归一化替代Softmax来改进标准注意力。此外,我们引入了一个基于不变熵的动态缩放因子。我们证明,这种新颖的注意力机制优于传统的Softmax注意力和最先进的非Softmax替代方案。我们的第二个提议是引入第二阶段处理(锐化),该阶段由一个重加权机制组成,该机制放大重要的注意力权重,同时削弱较弱的权重。这使得模型能够更有效地聚焦于相关令牌,缓解注意力下沉现象,并从根本上改善长度外推。这种新颖的两阶段自注意力替代方案被证明能确保数值稳定性,并显著提升长度外推能力,在训练长度的16倍时保持几乎恒定的验证损失,同时在具有挑战性的长上下文检索任务和下游基准测试中取得优异结果。此外,符号回归实验表明,我们的方法使模型能够从轨道轨迹序列中恢复牛顿万有引力定律,这为适当的注意力机制对于基础模型发展真正的物理世界模型至关重要提供了证据。我们的代码可在 https://github.com/iminfine/freeattn 获取。

英文摘要

Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models. Our code is available at https://github.com/iminfine/freeattn.

2601.21579 2026-06-02 cs.CL cs.LG 版本更新

KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

KromHC: 基于Kronecker积残差矩阵的流形约束超连接

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic

发表机构 * University of Technology Sydney(悉尼科技大学)

AI总结 针对超连接中的训练不稳定和参数爆炸问题,提出KromHC方法,利用Kronecker积分解小规模双随机矩阵来参数化残差矩阵,在保证精确双随机性的同时将参数复杂度降至O(n^2C)。

详情
AI中文摘要

超连接(HC)在神经网络中的成功也凸显了训练不稳定和可扩展性受限的问题。流形约束超连接(mHC)通过将残差连接空间投影到Birkhoff多面体上缓解了这些挑战,但它面临两个问题:1)其迭代Sinkhorn-Knopp(SK)算法并不总是产生精确的双随机残差矩阵;2)mHC的参数复杂度为O(n^3C),其中n是残差流的宽度,C是特征维度。最近提出的mHC-lite通过Birkhoff-von-Neumann定理重新参数化残差矩阵以保证双随机性,但其参数复杂度面临阶乘爆炸,即O(nC·n!)。为了解决这两个挑战,我们提出KromHC,它使用较小双随机矩阵的Kronecker积来参数化mHC中的残差矩阵。通过沿张量化残差流的每个模式对因子残差矩阵施加流形约束,KromHC保证了残差矩阵的精确双随机性,同时将参数复杂度降低到仅O(n^2C)。实验表明,KromHC匹配甚至超越了其他最先进的mHC变体,同时需要显著更少的可训练参数。代码见https://github.com/wz1119/KromHC。

英文摘要

The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exactly doubly stochastic residual matrices; 2) mHC incurs a prohibitive $O(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $O \left( nC \cdot n! \right)$. To address both challenges, we propose KromHC, which uses the Kronecker products of smaller doubly stochastic matrices to parametrize the residual matrix in mHC. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to only $O(n^2C)$. Experiments show that KromHC matches or even outperforms other state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is at https://github.com/wz1119/KromHC.

2601.21444 2026-06-02 cs.CV cs.AI cs.CL 版本更新

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

APB-V: 通过序列并行感知的近似注意力加速长视频理解

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu

发表机构 * NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China(清华大学北京校区自然语言处理组、国防科技大学、人工智能研究院、北京理工大学、清华大学) Department of CS&T, Central South University, Changsha, China(中南大学计算机与技术系,长沙,中国) BUPT, Beijing, China(北京邮电大学,北京,中国) Pattern Recognition Center, WeChat AI, Tencent Inc.(腾讯公司微信人工智能研究院)

AI总结 提出APB-V,一种序列并行框架,通过分布式近似注意力在多GPU上加速长视频推理,显著提升速度且不损失性能。

Comments ACL 2026 main

详情
AI中文摘要

长视频推理的效率仍然是一个关键瓶颈,主要由于大型多模态模型(LMMs)预填充阶段的密集计算。现有方法要么压缩视觉嵌入,要么在单个GPU上应用稀疏注意力,导致加速有限或性能下降,并限制了LMMs处理更长、更复杂视频的能力。为了克服这些问题,我们提出了APB-V,一种具有优化注意力的序列并行框架,可在多个GPU上加速长视频推理。通过分布近似注意力,APB-V减少了计算量并增加了并行性,使得无需压缩即可高效处理更多视觉嵌入,从而提升任务性能。系统级优化,如负载均衡和融合前向传递,进一步释放了APB-V的潜力,相较于FlashAttn、ZigZagRing和APB,分别实现了12.72倍、1.70倍和1.18倍的加速,且没有明显的性能损失。代码可在https://github.com/thunlp/APB获取。

英文摘要

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose APB-V, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, APB-V reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of APB-V, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

2601.21237 2026-06-02 cs.DS cs.CL cs.LG 版本更新

Characterizing the Effect of Noise in Language Generation in the Limit

极限情况下语言生成中噪声影响的刻画

Aaron Li, Ian Zhang

发表机构 * Harvard University(哈佛大学) Duke University(杜克大学)

AI总结 本文在极限语言生成模型中,通过分析噪声字符串对生成能力的影响,证明了单个噪声字符串严格减少可生成集合族,且有限噪声等价于单个噪声,并首次刻画了非均匀噪声依赖的可生成性。

Comments ICML 2026

详情
AI中文摘要

Kleinberg 和 Mullainathan 最近提出了一个用于研究语言生成现象的正式框架,称为极限语言生成。在该模型中,对手从未知目标语言中给出示例字符串的枚举,算法需要在有限时间内正确生成目标语言中未见过的字符串。Li、Raman 和 Tewari(2025)后来引入了非均匀和均匀生成的细化概念,Raman 和 Raman(2025)引入了噪声模型,允许对手插入无关字符串。噪声模型中的一个自然问题是通过研究每个额外无关字符串的影响来量化噪声效应。我们在此设置中展示了两个互补的结果。首先,我们证明对于均匀和非均匀生成,单个噪声字符串严格减少了可生成的集合族,从而回答了 Raman 和 Raman(2025)中的一个开放问题。然后,我们证明对于均匀和非均匀生成,单个噪声字符串的生成等价于任何有限噪声量的生成,这与 Bai、Panigrahi 和 Zhang(2026)展示的极限噪声生成的严格层次结构形成鲜明对比。最后,我们利用先前的结果首次提供了非均匀噪声依赖可生成性的刻画。

英文摘要

Kleinberg and Mullainathan recently proposed a formal framework for studying the phenomenon of language generation, called language generation in the limit. In this model, an adversary gives an enumeration of example strings from an unknown target language, and the algorithm is tasked with correctly generating unseen strings from the target language within finite time. Refined notions of non-uniform and uniform generation were later introduced by Li, Raman, and Tewari (2025), and a noisy model was introduced by Raman and Raman (2025), which allows the adversary to insert extraneous strings. A natural question in the noisy model is to quantify the effect of noise, by studying the impact of each additional extraneous string. We show two complementary results in this setting. We first show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated, thus answering an open question in Raman and Raman (2025). Then, we show for both uniform and non-uniform generation that generation with a single noisy string is equivalent to generation with any finite amount of noise, sharply contrasting with the strict hierarchy for noisy generation in the limit shown by Bai, Panigrahi, and Zhang (2026). Finally, we leverage our previous results to provide the first known characterization for non-uniform noise-dependent generatability.

2601.20803 2026-06-02 cs.CL 版本更新

Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning Applied to Few-Shot Relation Extraction

结构化语义信息有助于为上下文学习检索更好的示例,应用于少样本关系抽取

Aunabil Chakma, Mihai Surdeanu, Eduardo Blanco

发表机构 * University of Arizona(亚利桑那大学)

AI总结 提出基于句法-语义结构相似性的示例选择策略,结合大语言模型生成示例,构建混合系统提升少样本关系抽取性能。

详情
AI中文摘要

本文提出了几种自动获取上下文学习额外示例的策略,有效地将关系抽取从1-shot转变为少样本设置。具体来说,我们引入了一种新颖的示例选择策略,其中新示例基于其底层句法-语义结构与提供的1-shot示例的相似性进行选择。我们表明,与LLM生成的示例相比,我们的策略导致了互补的词汇选择和句子结构。当两种策略结合时,生成的混合系统比单独使用任一方法更全面地描绘了感兴趣的关系。我们的框架在数据集(FS-TACRED和FS-FewRel)和LLM系列(Qwen和Gemma)之间具有良好的迁移性。总体而言,我们的混合系统持续优于替代策略,在FS-TACRED上达到了最先进的性能,并在定制的FewRel子集上取得了显著的提升。

英文摘要

This paper presents several strategies to automatically obtain additional examples for in-context learning, effectively transforming relation extraction from a 1-shot to a few-shot setting. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided 1-shot example. We show that our strategy results in complementary word choices and sentence structures compared to LLM-generated examples. When both strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid system consistently outperforms alternative strategies achieving state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.

2601.19919 2026-06-02 cs.CL cs.AI cs.SD 版本更新

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

ASKD-Whisper: 自适应自知识蒸馏用于高效低延迟自动语音识别

Junseok Lee, Nahun Kim, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Co., Ltd(OKESTRO公司) Sejong University(世宗大学)

AI总结 提出自适应自知识蒸馏(ASKD)动态课程框架,通过逐步减少对教师模型的依赖并引入自知识蒸馏阶段,在压缩Whisper模型时实现5倍推理加速和1.07%词错误率降低。

Comments Title and content have been updated

详情
AI中文摘要

知识蒸馏(KD)是将大规模基础模型压缩为可部署架构的最有效范式之一。在自动语音识别(ASR)背景下,先前研究主要侧重于强制学生模型严格模仿大型教师模型的预测分布。然而,这种静态依赖通常存在固有权衡:虽然学生快速获得基本语言表示,但同时继承了教师特定领域的盲点和过度自信的幻觉,导致分布外泛化能力严重下降。为有效缓解此问题,我们提出自适应自知识蒸馏(ASKD),一种动态课程框架。ASKD随着训练进行系统地衰减对教师分布的依赖——从而释放学生独立推理能力——随后采用自知识蒸馏阶段作为结构正则化器。通过应用ASKD,我们将庞大的Whisper架构蒸馏为紧凑变体ASKD-Whisper。在跨多种声学领域的综合评估中,ASKD-Whisper不仅实现了5倍推理延迟加速,还以1.07%更低的词错误率(WER)超越了其教师模型。这些结果表明,ASKD有效防止了教师引起的过拟合,并为可泛化模型压缩建立了新的最先进水平。

英文摘要

Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectures. In the context of Automatic Speech Recognition (ASR), previous studies have predominantly focused on forcing the student model to strictly mimic the predictive distribution of a massive teacher model. However, this static dependency often presents an inherent trade-off: while the student rapidly acquires basic linguistic representations, it simultaneously inherits the teacher's domain-specific blind spots and over-confident hallucinations, leading to a severe decline in out-of-distribution generalization capacity. To effectively mitigate this issue, we propose Adaptive Self-Knowledge Distillation (ASKD), a dynamic curriculum framework. ASKD systematically decays the dependency on the teacher's distribution as training progresses-thereby unlocking the student's independent reasoning capacity-and subsequently employs a self-knowledge distillation phase to act as a structural regularizer. By applying ASKD, we distill the massive Whisper architecture into a compact variant, ASKD-Whisper. In our comprehensive evaluations across diverse acoustic domains, ASKD-Whisper not only achieves a 5x speedup in inference latency but also outperforms its teacher model by yielding a 1.07% lower word error rate (WER). These results demonstrate that ASKD effectively prevents teacher-induced overfitting and establishes a new state-of-the-art for generalizable model compression.

2510.18439 2026-06-02 cs.CL 版本更新

Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

基于视觉线索检测手语翻译中的幻觉:是依据视觉信息还是猜测?

Yasser Hamidullah, Koel Dutta Chowdhury, Yusser Al Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

发表机构 * German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心(DFKI GmbH)) Saarland Informatics Campus(萨尔兰州信息学校园) Barcelona Supercomputing Center (BSC-CNS)(巴塞罗那超级计算中心(BSC-CNS))

AI总结 针对手语翻译中模型依赖语言先验而非视觉输入导致幻觉的问题,提出一种基于特征敏感性和反事实信号的令牌级可靠性度量,用于量化视觉信息利用程度,并在两个基准上验证其预测幻觉率、跨数据集泛化及与文本信号结合提升风险评估的效果。

Comments Published at ICLR2026 Code available at \url{https://github.com/yhamidullah/hallucination-slt}

详情
AI中文摘要

幻觉是指模型生成流畅但缺乏视觉证据支持的文本,这是视觉-语言模型的主要缺陷,在手语翻译(SLT)中尤为关键。在SLT中,意义依赖于视频中的精确依据,而无词汇表模型尤其脆弱,因为它们直接将连续的手势运动映射为自然语言,缺少作为对齐的中间词汇表监督。我们认为,当模型依赖语言先验而非视觉输入时会产生幻觉。为捕捉这一点,我们提出一种令牌级可靠性度量,用于量化解码器使用视觉信息的程度。我们的方法结合了基于特征的敏感性(衡量视频被掩蔽时的内部变化)和反事实信号(捕捉干净视频与修改视频之间的概率差异)。这些信号被聚合为句子级可靠性分数,提供了一种紧凑且可解释的视觉依据度量。我们在两个SLT基准(PHOENIX-2014T和CSL-Daily)上,使用基于词汇表和无词汇表模型评估了所提出的度量。结果表明,可靠性能够预测幻觉率,跨数据集和架构泛化,并在视觉退化下降低。除了这些定量趋势,我们还发现可靠性能够区分有依据的令牌和猜测的令牌,从而无需参考即可进行风险估计;当与基于文本的信号(置信度、困惑度或熵)结合时,进一步改善了幻觉风险评估。定性分析揭示了为什么无词汇表模型更容易产生幻觉。综合来看,我们的发现将可靠性确立为诊断SLT中幻觉的实用且可复用的工具,并为多模态生成中更鲁棒的幻觉检测奠定了基础。

英文摘要

Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

2601.17952 2026-06-02 cs.CL cs.AI 版本更新

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

面向临床神经科学中基于Transformer的语言模型的稳定可解释性的单语义归因框架

Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

发表机构 * Department of Computer Science and Technology(计算机科学与技术系) Cancer Research UK(癌症研究英国公司) Cambridge Institute(剑桥研究所) University of Cambridge(剑桥大学) United Kingdom(英国) DIMES(迪梅斯) University of Calabria(卡拉布里亚大学) Italy(意大利) Department of Computer Automatic and Management Engineering (DIAG)(计算机自动与管理工程系) Sapienza Università di Roma(罗马大学萨皮恩扎) Department of Psychiatry(精神病学系) School of Computing(计算学院) University of Kent(肯特大学) Department of Psychology(心理学系)

AI总结 提出一种通过单语义特征提取整合归因与机制视角的统一可解释性框架,生成稳定的输入级重要性分数,促进语言模型在认知健康和神经退行性疾病中的安全应用。

详情
AI中文摘要

可解释性仍然是语言模型在临床环境中部署的关键挑战,例如阿尔茨海默病的进展诊断,其中早期和可信的预测至关重要。现有的归因方法由于基于Transformer的语言模型和LLM表示的多语义性质而表现出高方法间变异性和不稳定的解释,而机制可解释性方法缺乏与模型输入和输出的直接对齐,并且不提供显式的重要性分数。我们引入了一个统一的可解释性框架,通过单语义特征提取整合了归因和机制视角。通过在基于Transformer的LM层级别构建单语义嵌入空间,并优化框架以显式减少方法间变异性,我们的方法生成稳定的输入级重要性分数,并通过感兴趣层的解压缩表示突出显著特征,推进了语言模型在认知健康和神经退行性疾病中的安全可信应用。

英文摘要

Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer disease, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of Transformer-Based LM and LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an transformer-based LM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LMs in cognitive health and neurodegenerative disease.

2601.17898 2026-06-02 cs.CL 版本更新

Assessment of Generative Named Entity Recognition in the Era of Large Language Models

大语言模型时代生成式命名实体识别的评估

Qi Zhan, Yile Wang, Hui Huang

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院)

AI总结 本文系统评估了开源大语言模型在平面和嵌套命名实体识别任务上的性能,发现参数高效微调结合结构化格式可达到与传统模型竞争的性能,且NER能力源于指令遵循和生成能力而非记忆,微调对通用能力影响很小。

详情
AI中文摘要

命名实体识别(NER)正随着大语言模型(LLMs)的兴起从序列标注任务演变为生成式范式。我们对开源LLMs在平面和嵌套NER任务上进行了系统评估。我们研究了几个研究问题,包括生成式NER与传统NER模型之间的性能差距、输出格式的影响、LLMs是否依赖记忆,以及微调后通用能力的保持情况。通过在八个不同规模的LLMs和四个标准NER数据集上的实验,我们发现:(1)通过参数高效微调和结构化格式(如内联括号或XML),开源LLMs实现了与传统基于编码器的模型竞争的性能,并超越了使用上下文学习技术的基于解码器的LLMs;(2)LLMs的NER能力源于指令遵循和生成能力,而不仅仅是实体-标签对的记忆;(3)应用NER指令微调对LLMs的通用能力影响很小,甚至由于增强了实体理解,在DROP等数据集上将性能提高了25.50到45.32个F1点。这些发现表明,使用LLMs的生成式NER是传统方法的一种有前景且用户友好的替代方案。我们在https://github.com/szu-tera/LLMs4NER上发布了数据和代码。

英文摘要

Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open-source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine-tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter-efficient fine-tuning and structured formats like inline bracketed or XML, open-source LLMs achieve performance competitive with traditional encoder-based models and surpass decoder-based LLMs with in-context learning techniques; (2) The NER capability of LLMs stems from instruction-following and generative power, not mere memorization of entity-label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP by 25.50 to 45.32 F1 points due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user-friendly alternative to traditional methods. We release the data and code at https://github.com/szu-tera/LLMs4NER.

2511.12487 2026-06-02 cs.NE cs.AI cs.CL 版本更新

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

ToxSearch: 面向大型语言模型毒性搜索的提示演化

Onkar Shelar, Travis Desell

发表机构 * Rochester Institute of Technology(罗切斯特技术研究所)

AI总结 提出ToxSearch,一种黑盒演化框架,通过同步稳态循环演化提示来测试大型语言模型的安全性,并分析不同操作符的行为及跨模型迁移性。

Comments 16 pages

详情
Journal ref
In: García-Sánchez, P., Díaz Álvarez, J., Murphy, A. (eds) Applications of Evolutionary Computation. EvoApplications 2026. Lecture Notes in Computer Science, vol 16525. Springer, Cham
AI中文摘要

大型语言模型即使在安全对齐后,仍然容易受到引发毒性内容的对抗性提示的攻击。我们提出了ToxSearch,一种黑盒演化框架,通过同步稳态循环演化提示来测试模型安全性。该系统采用多种操作符,包括词汇替换、否定、回译、释义以及两种语义交叉操作符,同时一个审核预言机提供适应度指导。操作符级分析显示出异质性行为:词汇替换提供了最佳的收益-方差权衡,语义相似性交叉充当精确的低吞吐量插入器,而全局重写表现出高方差和较高的拒绝成本。使用在LLaMA 3.1 8B上演化的精英提示,我们观察到实际有意义但衰减的跨模型迁移,大多数目标上的毒性大约减半,较小的LLaMA 3.2变体表现出最强的抵抗力,而一些跨架构模型保留了较高的毒性。这些结果表明,小的、可控的扰动是系统性红队测试的有效载体,并且防御措施应预期对抗性提示的跨模型重用,而不是仅关注单模型加固。

英文摘要

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

2601.16462 2026-06-02 cs.CL 版本更新

Finding What Matters: Anchoring Context Knowledge with Evolving Indices for Iterative Retrieval

寻找关键:通过演化索引锚定上下文知识进行迭代检索

Mingyan Wu, Zhenghao Liu, Xinze Li, Yuqing Lan, Yukun Yan, Shuo Wang, Cheng Yang, Minghe Yu, Zheni Zeng, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Institute for AI, Tsinghua University, China(清华大学人工智能研究院计算机科学与技术系) Beijing National Research Center for Information Science and Technology, China(北京信息科学与技术国家研究中心) School of Computer Science, Beijing University of Posts and Telecommunications, China(北京邮电大学计算机学院)

AI总结 提出KAIR框架,通过迭代检索中动态更新的知识索引锚定关键证据,引导大语言模型在多跳问答中有效推理并缓解噪声干扰。

详情
AI中文摘要

检索增强生成(RAG)已成为通过引入外部知识来减轻大语言模型(LLMs)幻觉的主流范式。然而,现有的RAG系统通常难以有效整合和推理分散在嘈杂检索文档中的关键证据,尤其是在多跳场景中。本文提出KAIR,一种用于迭代检索的知识锚定框架,该框架在检索到的知识中锚定知识,以引导LLMs定位关键信息。在迭代检索过程中,KAIR逐步更新知识索引,以锚定检索文档中的显著证据。演化索引作为导航锚定索引,使LLM能够评估知识充分性并制定后续检索查询。最后,KAIR通过联合利用检索文档和最终确定的锚定索引生成答案。在四个多跳问答基准上的实验表明,KAIR始终优于强RAG基线。进一步分析显示,KAIR有效锚定关键知识并减轻了迭代检索过程中的上下文噪声,提高了LLM关联和推理分散在检索文档中的证据的能力。所有代码和数据可在https://github.com/NEUIR/KAIR获取。

英文摘要

Retrieval-Augmented Generation (RAG) has become a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. However, existing RAG systems often struggle to effectively integrate and reason over key evidence scattered across noisy retrieved documents, particularly in multi-hop scenarios. In this paper, we propose KAIR, a Knowledge Anchoring framework for Iterative Retrieval that anchors knowledge within retrieved knowledge to guide LLMs to locate the key information. During iterative retrieval, KAIR progressively updates the knowledge index to anchor salient evidence from retrieved documents. The evolving index serves as a navigational anchoring index that enables the LLM to assess knowledge sufficiency and formulate subsequent retrieval queries. Finally, KAIR generates answers by jointly leveraging the retrieved documents and the finalized anchoring index. Experiments on four multi-hop question answering benchmarks demonstrate that KAIR consistently outperforms strong RAG baselines. Further analysis shows that KAIR effectively anchors key knowledge and alleviates the context noise during iterative retrieval, improving the LLM's ability to associate and reason over dispersed evidence across retrieved documents. All code and data are available at https://github.com/NEUIR/KAIR.

2509.06093 2026-06-02 cs.DB cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model

基于轻结构化文本数据库和推理大语言模型的自然语言材料加工设计

Yuze Liu, Zhaoyuan Zhang, Xiangsheng Zeng, Yihe Zhang, Leping Yu, Liu Yang, Lejia Wang, Xi Yu

发表机构 * State Key Laboratory of Advanced Materials for Intelligent Sensing, Key Laboratory of Organic Integrated Circuit, Ministry of Education & Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University(智能传感先进材料国家重点实验室、有机集成电路重点实验室、教育部、天津分子光电子科学重点实验室、化学系、天津大学) Language Intelligence Technology Co., Ltd.(语言智能技术有限公司) College of Intelligence and Computing, Tianjin University(智能与计算学院、天津大学) School of Materials and Chemical Engineering, Ningbo University of Technology(材料与化学工程学院、宁波工业大学)

AI总结 将材料合成规划重构为文本推理问题,通过轻结构化知识基底结合检索增强生成与经验增强推理,在氮化硼纳米片剥离中三轮迭代获得高质量协议。

详情
AI中文摘要

材料合成步骤主要以叙述性文本形式记录在论文、方案和实验室记录中,这使得传统数据驱动优化框架难以处理。这种自然语言特性对复杂多阶段过程(如氮化硼纳米片(BNNS)的制备)构成了特殊挑战,其中结果取决于剥离、功能化和功能化中的路径依赖选择。在这里,我们将材料合成规划重构为一个文本推理问题,该问题由一个轻结构化的知识基底支持,该基底保留了程序逻辑和因果上下文,同时暴露了可计算元素以供检索。基于这种表示,我们的框架结合了语义匹配、词汇搜索和参数感知过滤,以支持检索增强生成,提供更准确、更有依据的合成指导。我们进一步引入了经验增强推理,其中从多源叙述中迭代提炼的文本指导支持假设生成、故障诊断和方案修订。我们在BNNS的目标剥离中验证了该框架,这是一个受多变量约束且文献方案在实验室间可迁移性有限的合成问题。通过将分散的文献证据与实验观察到的故障模式相结合,系统仅在三轮迭代内就收敛到一个高性能方案,该方案产生了符合目标规格的高质量超薄纳米片,大大缩短了通常由专家主导的冗长试错周期。通过实现对程序知识的自然语言推理,该框架将AI从文献辅助推向复杂材料工作流程中的主动合成规划、适应和加速。

英文摘要

Materials synthesis procedures are predominantly documented as narrative text in papers, protocols, and laboratory records, placing them beyond the reach of conventional data-driven optimization frameworks. This language-native character poses a particular challenge for complex, multistage processes such as the preparation of boron nitride nanosheets (BNNS), where outcomes depend on path-dependent choices in exfoliation, functionalization, and functionalization. Here, we recast synthesis planning of the materials as a text reasoning problem enabled by a lightly structured knowledge substrate that preserves the procedural logic and causal contexts while exposing computable elements for retrieval. Built on this representation, our framework combines semantic matching, lexical search, and parameter-aware filtering to support retrieval-augmented generation with more accurate and better-grounded synthesis guidance. We further introduce experience-augmented reasoning, in which iteratively refined text guides distilled from multi-source narratives support hypothesis generation, failure diagnosis, and protocol revision. We validated the framework in the targeted exfoliation of BNNS, a synthesis problem governed by multivariate constraints and limited transferability of literature protocols across laboratory settings. By integrating dispersed literature evidence with experimentally observed failure modes, the system converged within only three iterative rounds on a high-performing protocol that yielded high-quality ultrathin nanosheets meeting the target specifications, substantially shortening what is often a prolonged cycle of expert-led trial-and-error. By enabling language-native reasoning over procedural knowledge, this framework moves AI beyond literature assistance toward active synthesis planning, adaptation and acceleration in complex materials workflows.

2601.14230 2026-06-02 cs.CL cs.AI cs.HC 版本更新

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

MASCOT: 迈向多智能体社会协作伴侣系统

Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对多智能体系统中的人格崩溃和社会谄媚问题,提出MASCOT框架,通过双层优化策略(人格感知行为对齐与协作对话优化)提升角色一致性和对话贡献。

Comments 15 pages, 9 figures. https://hello-diana.github.io/MASCOT/

详情
AI中文摘要

多智能体系统(MAS)正成为情感和认知支持方面有前景的社会协作伴侣。然而,现有系统经常遭受人格崩溃(即智能体退化为通用、同质化的助手行为)和社会谄媚(即智能体产生冗余、非建设性的对话)。我们提出MASCOT,一个用于多视角社会协作伴侣的多智能体框架。MASCOT引入了一种新颖的双层优化策略来协调个体和集体行为:1)人格感知行为对齐,一个RLAIF驱动的流程,用于微调个体智能体以实现特定于智能体的身份;2)协作对话优化,一个群体级适应过程,促进互补、多样和富有成效的对话。我们使用源自领域内和领域外(OOD)设置的人类真实情境评估MASCOT,并与最先进的基线进行比较。MASCOT将人格一致性提高了最多+14.1,社会贡献提高了最多+10.6。广泛的评估套件,包括人类评估、多个LLM评判、三方比较和自动指标,进一步表明MASCOT产生了更符合角色且更少冗余的多智能体对话。

英文摘要

Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non-constructive dialogue. We propose MASCOT, a multi-agent framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that fine-tunes individual agents for agent-specific identities; and 2) Collaborative Dialogue Optimization, a group-level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human-grounded contexts drawn across both in-domain and out-of-domain (OOD) settings against state-of-the-art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three-way comparisons, and automatic metrics, further shows that MASCOT produces more role-consistent and less redundant multi-agent dialogue.

2601.09696 2026-06-02 cs.CL 版本更新

Empathy Applicability Modeling for General Health Queries

通用健康查询的共情适用性建模

Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem

发表机构 * University of Michigan(密歇根大学) Lahore University of Management Sciences(拉合尔管理科学大学)

AI总结 提出共情适用性框架(EAF),基于临床、语境和语言线索对患者查询的情感反应和解释适用性进行分类,以支持异步医疗中的共情沟通。

Comments Accepted at Findings of ACL 2026

详情
AI中文摘要

LLM 越来越多地被整合到临床工作流程中,但它们往往缺乏临床共情,而临床共情是有效医患沟通的重要方面。现有的 NLP 框架侧重于反应性地标记医生回应中的共情,但对共情需求的预期建模支持有限,尤其是在通用健康查询中。我们引入了共情适用性框架(EAF),这是一种理论驱动的方法,基于临床、语境和语言线索,根据情感反应和解释的适用性对患者查询进行分类。我们发布了一个真实患者查询的基准数据集,由人类标注者和 GPT-4o 双重标注。在具有人类共识的子集中,我们也观察到人类与 GPT 之间的一致性较高。为了验证 EAF,我们在人类标注和仅 GPT 标注上训练分类器以预测共情适用性,取得了强性能,并优于启发式方法和零样本 LLM 基线。错误分析突出了持续的挑战:隐性痛苦、临床严重性模糊性和语境困难,强调了多标注者建模、临床医生在环校准和文化多样性标注的必要性。EAF 提供了一个在生成回应之前识别共情需求的框架,为预期共情建模建立了基准,并支持异步医疗中的共情沟通。

英文摘要

LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by human annotators and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.

2512.16401 2026-06-02 cs.CL 版本更新

Navigating the Reality Gap: On-Device Continual Adaptation of ASR for Clinical Telephony

导航现实差距:面向临床电话的ASR设备端持续适应

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar

发表机构 * BITS Pilani, Pilani Campus, India(印度比斯帕利尼学院帕利尼校区) Qure.ai, India(印度Qure.ai)

AI总结 针对临床电话场景中ASR性能严重下降的问题,使用Gram Vaani语料库作为代理,研究设备端持续适应方法,发现经验回放与弹性权重巩固之间的关键交互作用。

Comments 17 pages. Under review

详情
AI中文摘要

自动语音识别(ASR)可以显著减少临床工作流程中的文档负担,但在现实电话环境中,标准模型会严重退化,因为嘈杂的音频、方言变异和严格的数据驻留约束阻止了基于云的适应。我们使用Gram Vaani(一个覆盖农村医疗和农业热线的电话印地语语料库)作为严格设备端约束下临床语音的最接近可用代理,研究了这一“现实差距”。我们表明,一个强大的多语言模型(IndicWav2Vec)在标准干净印地语上的词错误率(WER)从11.59%下降到该代理电话数据上的41.71%。我们在多个基线、数据集和随机种子上评估了一系列在现实约束下的设备端适应方案,从全微调到参数高效的LoRA和基于流的持续学习。聚焦于持续学习,我们的核心发现突出了经验回放(ER)和弹性权重巩固(EWC,由正则化强度λ参数化)之间的关键交互。我们表明,标准的正EWC(λ>0)可能对抗回放驱动的更新,限制适应。反转EWC的强度(λ<0)表明,在ER引导的适应下,它可以作为方向控制信号:负λ增强回放驱动的可塑性,而调度的λ允许稳定性和可塑性的相位相关控制。在多个数据集上的评估中,我们发现多领域回放为适应提供了坚实基础,而EWC调节稳定性-可塑性动态而不改变最终性能。这些结果表明,有效的设备端适应取决于理解数据驱动和参数级学习信号如何交互,而不是孤立地选择方法。

英文摘要

Automatic Speech Recognition (ASR) can significantly reduce documentation burden in clinical workflows, but standard models degrade sharply in real-world telephony settings where noisy audio, dialectal variation, and strict data residency constraints prevent cloud-based adaptation. We study this "reality gap" using Gram Vaani: a telephonic Hindi corpus spanning rural healthcare and agricultural helplines, as the closest available proxy for clinical speech under strict on-device constraints. We show that a robust multilingual model (IndicWav2Vec) degrades from 11.59\% WER on standard clean Hindi to \textbf{41.71\% WER} on this proxy telephony data. We evaluate a progression of on-device adaptation regimes under realistic constraints, from full fine-tuning to parameter-efficient LoRA and stream-based continual learning, across multiple baselines, datasets, and seeds. Focusing on continual learning, our central finding highlights a critical interaction between Experience Replay (ER) and Elastic Weight Consolidation (EWC, parameterized by regularization strength $λ$). We show that standard positive EWC ($λ> 0$) can oppose replay-driven updates, limiting adaptation. Reversing EWC's strength ($λ< 0$) suggests that it can act as a directional control signal under ER-guided adaptation: negative $λ$ reinforces replay-driven plasticity, while a scheduled $λ$ enables phase-dependent control of stability and plasticity. Across evaluations on multiple datasets, we find that multi-domain replay provides a strong foundation for adaptation, while EWC modulates stability-plasticity dynamics without altering final performance. These results show that effective on-device adaptation depends on understanding how data-driven and parameter-level learning signals interact, rather than choosing methods in isolation.

2601.03615 2026-06-02 cs.CL cs.SD eess.AS 版本更新

SARA: Stress Test Reasoning in Audio Deepfake Detection

SARA: 音频深度伪造检测中的压力测试推理

Binh Nguyen, Charles Fleming, Thai Le

发表机构 * Indiana University(印第安纳大学) Cisco Research(思科研究)

AI总结 提出SARA框架,通过声学感知、推理-判决一致性与不和谐三个维度评估音频语言模型在对抗攻击下的推理可靠性,发现声学攻击降低一致性而语言攻击保持一致性但成功率更高,且推理轨迹的文本一致性可作为检测对抗样本的潜在指标。

Comments Preprint for ACL 2026 submission

详情
AI中文摘要

音频语言模型(ALMs)通过提供推理轨迹来透明化其预测,为可解释的音频深度伪造检测(ADD)提供了有前景的转变,超越了黑盒分类器。然而,这种推理可能不支持模型预测,反映出一致性差,或者更糟的是,可能用看似合理但具有误导性的解释来合理化错误预测。此外,ALM推理在对抗攻击下的行为仍未得到充分探索,引发了关于这种解释能力实际可靠性的疑问。为填补这一空白,本研究引入了SARA(音频推理的移位分析),这是一个诊断框架,从三个维度评估ALM推理:声学感知、推理-判决一致性与不和谐。我们针对声学和语言对抗攻击测试了五个开源ALM。结果表明,声学攻击显著降低了推理-判决一致性(平均下降14.20%),经常引发内部逻辑冲突。相反,语言攻击在保持推理一致性的同时实现了更高的攻击成功率。我们进一步证明,生成的推理轨迹的文本一致性也可作为对抗输入的潜在指标,从而无需访问原始声学信号即可有效检测受扰音频(F1为0.78)。这些发现表明,即使最终分类输出受损,推理轨迹仍具有诊断效用。

英文摘要

Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADD), moving beyond \textit{black-box} classifiers by providing transparency to their predictions via reasoning traces. However, such reasoning may not support the model predictions, reflecting poor coherence, or, worse, may rationalize incorrect predictions with plausible but misleading explanation. Moreover, the behavior of ALM reasoning under adversarial attacks remains under-explored, raising questions about the practical reliability of such explanation capabilities. To address this gap, this study introduces \textbf{SARA} (\textbf{S}hift \textbf{A}nalysis of \textbf{R}easoning in \textbf{A}udio), a diagnostic framework that evaluates ALM reasoning across three dimensions: acoustic perception, reasoning-verdict coherence and dissonance. We test five open-source ALMs against both acoustic and linguistic adversarial attacks. We show that acoustic attacks significantly degrade reasoning-verdict coherence (average decrease of 14.20\%), frequently inducing internal logical conflicts. Conversely, linguistic attacks achieve higher attack success rates while maintaining reasoning coherence. We further demonstrate that the textual coherence of generated reasoning traces also serves as a latent indicator of adversarial inputs, enabling effective detection of perturbed audio (0.78 in F1) \textit{without accessing the raw acoustic signal}. These findings suggest that reasoning traces provide diagnostic utility that persists even when final classification outputs are compromised.

2512.22060 2026-06-02 cs.CR cs.CL cs.CY 版本更新

Toward Secure and Compliant AI: Organizational Standards and Protocols for NLP Model Lifecycle Management

面向安全与合规的人工智能:NLP模型生命周期管理的组织标准与协议

Sunil Arora, John Hastings

发表机构 * ICAART 2026

AI总结 提出SC-NLP-LMF六阶段框架,通过系统综述整合NIST AI RMF等标准,结合偏差检测、差分隐私等方法,保障NLP系统从开发到退役的安全与合规,并以医疗案例验证其有效性。

Comments 9 pages, 2 tables, 1 figure

详情
Journal ref
Proc. of the 18th International Conference on Agents and Artificial Intelligence - Vol. 2: ICAART, 1624-1634, 2026
AI中文摘要

自然语言处理系统越来越多地用于医疗、金融和政府等敏感领域,处理大量个人和受监管数据。然而,这些系统引入了与安全、隐私和法规遵从相关的独特风险,现有的人工智能治理框架未能完全解决这些问题。本文介绍了安全合规的NLP生命周期管理框架(SC-NLP-LMF),这是一个全面的六阶段模型,旨在确保NLP系统从开发到退役的安全运行。该框架通过对45篇同行评审和监管来源进行基于PRISMA的系统综述而开发,与领先标准(包括NIST AI RMF、ISO/IEC 42001:2023、欧盟AI法案和MITRE ATLAS)保持一致。它集成了偏差检测、隐私保护(差分隐私、联邦学习)、安全部署、可解释性和安全模型退役等成熟方法。一个医疗案例研究展示了SC-NLP-LMF如何检测新兴术语漂移(例如,与COVID相关的语言)并指导合规的模型更新。该框架为组织提供了一个实用的、覆盖全生命周期的结构,用于在高风险环境中开发、部署和维护安全且负责任的NLP系统。

英文摘要

Natural Language Processing (NLP) systems are increasingly used in sensitive domains such as healthcare, finance, and government, where they handle large volumes of personal and regulated data. However, these systems introduce distinct risks related to security, privacy, and regulatory compliance that are not fully addressed by existing AI governance frameworks. This paper introduces the Secure and Compliant NLP Lifecycle Management Framework (SC-NLP-LMF), a comprehensive six-phase model designed to ensure the secure operation of NLP systems from development to retirement. The framework, developed through a systematic PRISMA-based review of 45 peer-reviewed and regulatory sources, aligns with leading standards, including NIST AI RMF, ISO/IEC 42001:2023, the EU AI Act, and MITRE ATLAS. It integrates established methods for bias detection, privacy protection (differential privacy, federated learning), secure deployment, explainability, and secure model decommissioning. A healthcare case study illustrates how SC-NLP-LMF detects emerging terminology drift (e.g., COVID-related language) and guides compliant model updates. The framework offers organizations a practical, lifecycle-wide structure for developing, deploying, and maintaining secure and accountable NLP systems in high-risk environments.

2512.20638 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

揭示大型语言模型及其基准测试中的能力差距

Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出一种基于稀疏自编码器概念激活的新方法,自动发现模型在细粒度概念上的弱点(模型差距)和基准测试覆盖不平衡(基准差距),并通过内部表示评估和跨基准比较进行验证。

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型的评估严重依赖标准化基准测试。这些基准测试提供了有用的聚合指标,但可能掩盖(i)模型薄弱的特定子领域(“模型差距”)和(ii)基准测试本身的不平衡覆盖(“基准差距”)。为了自动揭示这两类差距,我们提出了一种简单的新方法,利用稀疏自编码器的概念激活,在逐概念基础上识别细粒度差距。该方法还受益于将评估基于模型的内部表示,以及易于跨基准测试进行比较。我们将该方法应用于五个流行的开源模型和十几个基准测试,作为示例说明。作为对该方法的验证,我们发现我们的自动无监督方法能够恢复文献中先前记录的模型差距(例如与谄媚相关的差距),并识别出新的模型差距。我们还能够自动揭示基准差距:应属于给定基准测试范围的核心概念。我们的“能力差距”方法可以通过提供模型行为的概念级分解,并帮助基准测试开发者迭代基准测试设计,来补充现有基准测试。代码可在 https://competency-gaps.github.io 获取。

英文摘要

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our "competency gaps" method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at https://competency-gaps.github.io.

2512.07795 2026-06-02 cs.AI cs.CL cs.LG 版本更新

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

ReasonBENCH: 基准测试LLM推理的(不)稳定性

Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Indian Institute of Technology Delhi(德里印度理工学院) EPFL(苏黎世联邦理工学院)

AI总结 提出ReasonBench基准套件,通过30次独立试验揭示LLM推理系统在贪婪解码下仍存在结构化方差,并引入全局噪声和运行噪声分类法,证明稳定性是推理系统的固有属性,倡导分布感知评估。

Comments 29 pages, 19 tables, 85 figures

详情
AI中文摘要

LLM推理系统的基准分数被报告为单一数字,然而相同的模型、策略和任务在重复执行时,即使在贪婪解码(T=0)下也会产生显著不同的答案和成本。这种方差并非统计上的麻烦:性能最高的策略在与最接近的对手进行头对头运行时仅获胜77%,这意味着单次观测到的分数可能会无声地错误排序系统。我们引入了ReasonBench,一个基准套件,记录了10种推理策略、12个模型和6个任务的30次独立试验,将质量和成本视为分布而非点估计。我们发现这种方差是有结构的而非随机的:一个双组分分类法——全局噪声(捕捉跨基准的不均匀性)和运行噪声(捕捉基准内的随机性)——揭示了策略架构预测稳定性分布,而模型和策略则移动分布的正交方面。层次分解将四分之三的分数方差归因于基准、系统和项目结构,而单次运行评估无声地吸收了持久的残差。最后,成本和成本非对称地解耦:廉价方法在结构上对联合成本-质量失败免疫,而昂贵方法无论其准确性如何仍然暴露。这些发现确立了不稳定性作为推理系统的固有属性,并促使分布感知评估成为标准实践。

英文摘要

Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as distributions rather than point estimates. We find that this variance is structured rather than random: a two-component taxonomy -- Global Noise, capturing cross-benchmark unevenness, and Run Noise, capturing within-benchmark stochasticity -- reveals that strategy architecture predicts stability profiles, while models and strategies shift orthogonal aspects of the distribution. A hierarchical decomposition attributes three-quarters of score variance to benchmark, system, and item structure, with a persistent residual that single-run evaluation silently absorbs. Finally, cost and quality decouple asymmetrically: cheap methods are structurally immune to joint cost-quality failure, while expensive methods remain exposed regardless of their accuracy. These findings establish instability as an inherent property of reasoning systems and motivate distribution-aware evaluation as standard practice.

2511.20639 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Collaboration in Multi-Agent Systems

多智能体系统中的潜在协作

Jiaru Zou, Ruizhong Qiu, Gaotang Li, Xiyuan Yang, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

发表机构 * University of Washington(华盛顿大学)

AI总结 提出LatentMAS框架,使LLM智能体在连续潜在空间直接协作,无需文本中介,实现更高精度、更低开销和更快推理。

Comments ICML2026 Spotlight, Project: https://github.com/Gen-Verse/LatentMAS

详情
AI中文摘要

多智能体系统(MAS)将大语言模型(LLM)从独立的单模型推理扩展到协同的系统级智能。现有LLM智能体依赖基于文本的中介进行推理和通信,而我们更进一步,使模型能够在连续潜在空间内直接协作。我们引入了LatentMAS,一个端到端无需训练的框架,实现了LLM智能体间的纯潜在协作。在LatentMAS中,每个智能体首先通过最后一层的隐藏嵌入而非文本进行自回归潜在思维生成。然后,一个共享的潜在工作记忆保存并传递每个智能体的内部表示和潜在思维,确保无需重新编码的无损信息交换。我们提供了详细的理论分析,表明LatentMAS比基于文本的标准MAS具有更高的表达能力和无损信息保存能力,且整体复杂度更低。此外,在涵盖数学和科学推理、常识理解及代码生成的9个综合基准测试上的实证评估表明,LatentMAS优于先进的单智能体和基于文本的MAS基线,准确率最高提升14.6%,输出token使用量减少70.8%-83.7%,端到端推理速度提升4倍至4.3倍。代码和数据完全开源:https://github.com/Gen-Verse/LatentMAS。

英文摘要

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent's internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4$\times$-4.3$\times$ faster end-to-end inference. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

2512.05671 2026-06-02 cs.CL 版本更新

ClinTutor-R1: Advancing Scalable and Robust One-to-Many Alignment in Clinical Socratic Education

ClinTutor-R1:在临床苏格拉底式教育中推进可扩展且稳健的一对多对齐

Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung

发表机构 * Hong Kong University of Science and Technology(香港理工大学)

AI总结 针对临床查房等一对多教学场景中上下文稀释与目标错位问题,提出基于多智能体模拟器ClinEdu构建的苏格拉底式教学数据集ClinTeach,并开发首个显式内部思维机制的一对多对齐视觉语言智能体ClinTutor-R1,通过建模个体信念状态与群体共识,在静态基准、交互评估、专家评审及200人真实用户研究中超越基线模型20%以上,达到与专有模型相当的性能,并展现出随学生规模扩展的教学质量可扩展性。

Comments Accepted by ICML 2026 (Spotlight)

详情
AI中文摘要

尽管大型语言模型(LLMs)在二元(一对一)教学中取得了显著成功,但在临床查房等一对多对齐场景中面临重大挑战,此时教师必须同时指导一群多样化的受训者。当前模型常受上下文稀释与目标错位影响,难以平衡个体脚手架与集体学习进度。为此,我们引入ClinEdu,一个模拟群体动态复杂性的多智能体教学模拟器。基于该平台,我们构建了ClinTeach,一个大规模的苏格拉底式教学对话数据集,并提出ClinTutor-R1,这是首个明确设计用于实现临床教育中一对多对齐的视觉语言智能体,采用显式内部思维机制来建模个体信念状态与群体共识。我们通过涵盖静态基准、ClinEdu内原位交互评估、专家评估以及200名参与者的真实用户研究的综合协议验证了我们的框架。实验结果表明,ClinTutor-R1在性能上超越基线模型超过20%,并达到与专有模型相当的水平,同时在扩展学生群体时展现出维持教学质量的扩展性。

英文摘要

While Large Language Models (LLMs) have achieved remarkable success in dyadic (one-on-one) instruction, they face significant challenges in One-to-Many alignment, such as clinical ward rounds, where an instructor must simultaneously guide a diverse group of trainees. Current models often suffer from context dilution and goal misalignment, failing to balance individual scaffolding with collective learning progress. To address this, we introduce ClinEdu, a multi-agent pedagogical simulator that models the complexity of group dynamics. Leveraging this platform, we construct ClinTeach, a large-scale dataset of Socratic teaching dialogues, and propose ClinTutor-R1, the first vision-language agent explicitly architected to achieve one-to-many alignment in clinical education, employing an explicit internal thinking mechanism to model both individual belief states and group consensus. We validate our framework through a comprehensive protocol covering static benchmarks, in-situ interactive evaluation within ClinEdu, expert assessment, and a 200-participant real user study. Experimental results demonstrate that ClinTutor-R1 outperforms base models by over 20% and achieves parity with proprietary models, while exhibiting scalability in maintaining instructional quality across expanding student cohorts.

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(坡山科学技术大学(POSTECH))

AI总结 本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis,研究视觉干扰项如何影响视觉语言模型的测试时缩放行为,发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度,并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情
AI中文摘要

无关信息(即干扰项)如何影响视觉语言模型(VLM)的测试时缩放?先前关于纯文本语言模型的研究表明,文本干扰项可以加剧逆缩放,导致模型推理更长但推理轨迹效率更低。在这项工作中,我们研究了类似现象是否在多模态设置中出现。我们引入了Idis(带干扰项的图像),这是一个视觉问答数据集,系统性地沿着语义和数值维度变化干扰项。我们的分析揭示,视觉干扰项以与文本干扰项根本不同的方式影响推理VLM:尽管逆缩放仍然出现,但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查,我们提出了一种简单的提示策略,以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

2511.20409 2026-06-02 cs.CL 版本更新

NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization

NormEval:文本归一化中语义保真度评估的统一多指标框架

Md Abdullah Al Kafi, Raka Moni, Walayat Hussain

发表机构 * Multidisciplinary Action Research (MARS) lab(多学科行动研究(MARS)实验室) Artificial Intelligence for Decision Excellence (AIDX) Lab(人工智能决策卓越(AIDX)实验室)

AI总结 提出NormEval统一多指标框架,通过压缩比、模型性能变化、信息保留分数、算法有效性分数和平均归一化莱文斯坦距离五个指标,从宏观效率、下游效用和微观形态保真度三个维度评估文本归一化的语义保真度。

详情
AI中文摘要

词干提取和词形还原等文本归一化方法是NLP流程的基本组成部分。随着为不同语言开发新的归一化工具,评估方法仍然零散,依赖于压缩比、下游准确率或序列到序列预测分数等单一指标,无法区分有益的词汇缩减和有害的语义扭曲。此外,文本归一化支撑着高风险领域的智能系统,包括临床决策支持和法律文档分析,因此原则性的评估方法至关重要。本文提出NormEval,一个统一的多语言评估框架,包含五个互补指标:压缩比(CR)、模型性能变化(MPD)、信息保留分数(IRS)、算法有效性分数(AES)和平均归一化莱文斯坦距离(ANLD)。这些指标从三个维度评估归一化质量:宏观效率、下游效用和微观形态保真度。该框架实现了安全门假设:ANLD作为内在的结构卫生检查,利用字符级差异(Δ)揭示宏观嵌入和下游任务掩盖的激进突变。在孟加拉语和英语数据集上的全面消融实验表明,所有组成部分都是不可或缺的,移除任何一个指标都会导致至少一个评估方面的下降,最终导致误导性的算法排名。

英文摘要

Text normalization methods such as stemming and lemmatization are fundamental components of NLP pipelines. As new normalization tools are developed for diverse languages, evaluation methodologies remain fragmented, relying on Compression Ratio, downstream accuracy, or sequence-to-sequence prediction scores in isolation, failing to distinguish between beneficial vocabulary reduction and harmful semantic distortion. Moreover, text normalization underpins intelligent systems in high-stakes domains, including clinical decision support and legal document analysis, and principled evaluation methodology is essential. This paper proposes NormEval, a unified, multilingual evaluation framework comprising five complementary metrics: Compression Ratio (CR), Model Performance Delta (MPD), Information Retention Score (IRS), Algorithm Effectiveness Score (AES), and Average Normalized Levenshtein Distance (ANLD). These metrics assess normalization quality across three dimensions: macro-level efficiency, downstream utility, and micro-level morphological fidelity. The framework operationalizes a Safety Gate hypothesis: ANLD functions as an intrinsic structural hygiene check, utilizing character-level divergence ($Δ$) to reveal aggressive mutations that macro-level embeddings and downstream tasks mask. Comprehensive ablation experiments on both Bangla and English datasets show that all the components are indispensable, and that the removal of any individual metric leads to a decrease in at least one evaluation aspect, which ultimately results in misleading algorithm rankings.

2511.14460 2026-06-02 cs.CL 版本更新

Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

Agent-R1:一个统一且模块化的智能体强化学习框架

Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Jie Ouyang, Yucong Luo, Yitong Zhou, Qi Liu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室)

AI总结 提出Agent-R1框架,通过步骤级轨迹表示和灵活上下文管理,解决智能体强化学习中轨迹表示不匹配问题,支持多种优化策略。

Comments This paper serves as the technical report of the Agent-R1 project

详情
AI中文摘要

大型语言模型(LLMs)已从单轮文本生成器迅速演变为日益强大的智能体的基础。随着这些智能体承担更复杂的推理、决策、工具使用和长周期任务,强化学习(RL)在塑造其行为方面变得越来越重要。这种转变在智能体强化学习中尤为明显,其中模型必须与工具和环境进行多轮交互,而不是生成单个独立的响应。在这种模式下,将轨迹视为一个不断增长的令牌序列的通常观点变得越来越不充分:它使上下文演变变得僵化,并在展开和训练之间产生表示不匹配。本文提出了Agent-R1,一个统一且模块化的智能体强化学习框架,该框架围绕步骤级轨迹表示、灵活的上下文管理以及用于工作流、环境和优化的分层接口构建。关键思想是将每个交互步骤视为基本的强化学习转移,同时保持优化层灵活:一旦在步骤级别对交互进行建模,该框架就可以支持令牌级信用分配、步骤级信用分配或其他兼容的设计。这些设计选择使框架能够兼容多种优化策略,而不是将其绑定到单个算法上。这些组件共同为智能体强化学习提供了一个有原则、可扩展且可重用的基础。

英文摘要

Large language models (LLMs) have rapidly evolved from single-turn text generators into the foundation of increasingly capable agents. As these agents take on more complex reasoning, decision making, tool use, and long-horizon tasks, reinforcement learning (RL) is becoming increasingly important for shaping their behavior. This shift is especially visible in agentic RL, where models must interact with tools and environments across multiple rounds rather than produce a single standalone response. In this regime, the usual view of a trajectory as one ever-growing token sequence becomes increasingly inadequate: it makes context evolution rigid and creates representation mismatches between rollout and training. This paper presents Agent-R1, a unified and modular framework for agentic RL built around step-level trajectory representation, flexible context management, and layered interfaces for workflows, environments and optimization. The key idea is to treat each interaction step as the basic reinforcement-learning transition, while keeping the optimization layer flexible: once the interaction is modeled at the step level, the framework can support token-level credit assignment, step-level credit assignment, or other compatible designs. These design choices make the framework compatible with a range of optimization strategies rather than tying it to a single algorithm. Together, these components provide a principled, extensible, and reusable substrate for agentic RL.

2511.07910 2026-06-02 cs.CL 版本更新

Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning

最后一层逻辑到逻辑:赋予LLM逻辑一致的结构化知识推理能力

Songze Li, Zhiqiang Liu, Zhaoyan Gong, Xiaoke Guo, Zhongpu Bo, Zhengke Gui, Lei Liang, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) ZJU-Ant Group Joint Lab of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)

AI总结 针对LLM在结构化知识推理中的逻辑漂移问题,提出Logits-to-Logic框架,通过logits增强和过滤模块修正输出逻辑缺陷,在多个KGQA基准上取得最佳性能。

Comments EMNLP 2026 Submission

详情
AI中文摘要

大型语言模型(LLM)通过在大量非结构化文本上进行预训练,在自然语言推理任务中表现出色,使其能够理解自然语言中的逻辑并生成逻辑一致的响应。然而,非结构化知识与结构化知识之间的表示差异使得LLM天生难以维持逻辑一致性,导致在知识图谱问答(KGQA)等结构化知识推理任务中出现 extit{逻辑漂移}挑战。现有方法通过设计嵌入提示中的复杂工作流来指导LLM推理,以解决这一局限性。然而,这些方法仅提供输入级指导,未能从根本上解决LLM输出中的 extit{逻辑漂移}问题。此外,它们僵化的推理工作流无法适应不同的任务和知识图谱。为了增强LLM在结构化知识推理中的逻辑一致性,我们专门针对自回归生成过程中的logits输出。我们提出了 extit{Logits-to-Logic}框架,该框架将logits增强和logits过滤作为核心模块,以纠正LLM输出中的逻辑缺陷。大量实验表明,我们的方法显著提高了LLM在结构化知识推理中的逻辑一致性,并在多个KGQA基准上取得了最先进的性能。

英文摘要

Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consistent responses. However, the representational differences between unstructured and structured knowledge make LLMs inherently struggle to maintain logic consistency, leading to \textit{Logic Drift} challenges in structured knowledge reasoning tasks such as Knowledge Graph Question Answering (KGQA). Existing methods address this limitation by designing complex workflows embedded in prompts to guide LLM reasoning. Nevertheless, these approaches only provide input-level guidance and fail to fundamentally address the \textit{Logic Drift} in LLM outputs. Additionally, their inflexible reasoning workflows cannot adapt to different tasks and knowledge graphs. To enhance LLMs' logic consistency in structured knowledge reasoning, we specifically target the logits output from the autoregressive generation process. We propose the \textit{Logits-to-Logic} framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs. Extensive experiments show that our approach significantly improves LLMs' logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.

2511.05913 2026-06-02 cs.CL cs.AI 版本更新

NILC: Discovering New Intents with LLM-assisted Clustering

NILC:利用LLM辅助聚类发现新意图

Hongtao Wang, Renchi Yang, Wenqing Lin

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) JD.com(京东公司)

AI总结 提出NILC框架,通过迭代聚类结合大语言模型优化质心和文本嵌入,实现无监督和半监督新意图发现。

详情
AI中文摘要

新意图发现(NID)旨在从无标签的用户话语中识别新意图和已知意图,在实际对话系统中广泛应用。现有的NID工作主要采用级联架构,其中第一阶段专注于将话语编码为信息丰富的文本嵌入,而第二阶段通常通过K-Means将相似的嵌入聚类为簇(即意图)。然而,这种级联流程无法利用两个阶段的反馈进行相互优化,同时仅依赖嵌入的聚类忽略了细微的文本语义,导致性能次优。为弥补这一差距,本文提出NILC,一种专门针对有效NID的新型聚类框架。特别地,NILC遵循迭代工作流,通过借助大语言模型(LLM)精心优化不确定话语的聚类质心和文本嵌入,从而审慎地更新聚类分配。具体来说,NILC首先利用LLM为聚类创建额外的语义质心,从而丰富嵌入的欧几里得质心的上下文语义。此外,利用LLM通过重写增强从聚类中识别出的困难样本(模糊或简洁的话语),以便后续的聚类纠正。进一步,我们通过非平凡技术(软种子和软必须链接)注入监督信号,在半监督设置下实现更准确的NID。在无监督和半监督设置下,将NILC与多个近期基线进行大量实验比较,结果表明NILC在六个不同领域的基准数据集上一致地实现了显著的性能提升。

英文摘要

New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.

2511.05650 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Optimizing Diversity and Quality through Base-Aligned Model Collaboration

通过基座对齐模型协作优化多样性与质量

Yichen Wang, Chenghao Yang, Tenghao Huang, Muhao Chen, Jonathan May, Mina Lee

发表机构 * University of Chicago(芝加哥大学) University of Southern California, Information Sciences Institute(南加州大学信息科学研究所) University of California, Davis(加州大学戴维斯分校)

AI总结 提出基座对齐模型协作框架(BACo),在推理时通过令牌级路由策略动态结合基座LLM与其对齐版本,以单次前向传递同时提升生成多样性和质量。

Comments ICML 2026. (47 pages, 22 figures)

详情
AI中文摘要

对齐极大地提升了大语言模型(LLM)的输出质量,但以牺牲多样性为代价,导致跨代生成高度相似的输出,尤其是在开放式生成任务中。我们提出基座对齐模型协作(BACo),一种推理时令牌级模型协作框架,动态结合基座LLM与其对齐版本,以优化多样性和质量。利用基于不确定性和内容的信号,BACo采用路由策略决定每个令牌从哪个模型解码。先前的多样性提升方法通常以质量下降为代价,或需要昂贵的解码或后训练。相比之下,BACo在单次前向传递中事后同时实现高多样性和高质量,同时提供强可控性。我们引入一系列有效的路由策略,并在三个开放式生成任务中使用13个多样性和质量指标进行评估。BACo持续超越最先进的推理时基线。使用我们最佳的路由器,BACo在多样性和质量上实现了21.3%的联合提升,这一结果进一步得到人工评估的支持。总体而言,我们的结果表明,基座模型与对齐模型之间的协作为优化多样性-质量权衡提供了一种有效且可控的机制。

英文摘要

Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs across generations, especially in open-ended generation tasks. We propose Base-Aligned Model Collaboration (BACo), an inference-time token-level model collaboration framework that dynamically combines a base LLM with its aligned counterpart to optimize diversity and quality. Using uncertainty and content-based signals, BACo employs routing strategies to determine, at each token, which model to decode from. Prior diversity-promoting methods often improve diversity at the expense of quality or require expensive decoding or post-training. In contrast, BACo achieves both high diversity and quality post hoc within a single pass, while offering strong controllability. We introduce a family of effective routing strategies and evaluate them across three open-ended generation tasks with 13 diversity and quality metrics. BACo consistently surpasses state-of-the-art inference-time baselines. With our best router, BACo achieves a 21.3% joint improvement in diversity and quality, which is further supported by human evaluations. Overall, our results demonstrate that collaboration between base and aligned models provides an effective and controllable mechanism for optimizing the diversity-quality trade-off.

2506.22666 2026-06-02 cs.CR cs.CL cs.LG stat.ML 版本更新

VERA: Variational Inference Framework for Jailbreaking Large Language Models

VERA:用于越狱大型语言模型的变分推理框架

Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University(计算机科学系,普渡大学)

AI总结 提出VERA框架,将黑盒越狱提示生成视为变分推理问题,训练小型攻击者LLM近似目标LLM的对抗提示后验,无需重新优化即可生成多样且流畅的越狱提示。

Comments Accepted by NeurIPS 2025

详情
AI中文摘要

仅通过API访问最先进LLM的兴起凸显了在现实环境中识别模型漏洞的有效黑盒越狱方法的需求。由于缺乏基于梯度的优化原则性目标,大多数现有方法依赖于遗传算法,这些算法受限于其初始化和对人工策划提示池的依赖。此外,这些方法需要对每个提示进行单独优化,未能提供模型漏洞的全面表征。为弥补这一差距,我们引入了VERA:用于越狱的变分推理框架。VERA将黑盒越狱提示生成视为变分推理问题,训练一个小型攻击者LLM来近似目标LLM在对抗提示上的后验。一旦训练完成,攻击者可以针对目标查询生成多样化、流畅的越狱提示,而无需重新优化。实验结果表明,VERA在一系列目标LLM上取得了强劲的性能,凸显了概率推理在对抗性提示生成中的价值。

英文摘要

The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.

2510.24870 2026-06-02 cs.CL cs.CV cs.IR 版本更新

Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

看穿MiRAGE:评估多模态检索增强生成

Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Human Language Technology Center of Excellence(人类语言技术卓越中心)

AI总结 提出MiRAGE框架,通过InfoF1和CiteF1指标评估多模态RAG的事实性和引用支持,并验证其与人工判断的一致性。

Comments https://github.com/alexmartin1722/mirage

详情
AI中文摘要

我们介绍了MiRAGE,一个用于从多模态源进行检索增强生成(RAG)的评估框架。随着视听媒体成为在线信息的普遍来源,RAG系统必须将这些来源的信息整合到生成中。然而,现有的RAG评估以文本为中心,限制了它们在多模态环境中的适用性。MiRAGE是一种以声明为中心的多模态RAG评估方法,包括评估事实性和信息覆盖率的InfoF1,以及评估引用支持和完整性的CiteF1。我们表明,当由人类应用时,MiRAGE与输出质量的外部判断高度一致。我们还介绍了MiRAGE的自动实现,以及三种著名的基于文本的RAG指标——ALCE、ARGUE和RAGAS——的多模态变体,展示了以文本为中心的工作的局限性,并为自动评估奠定了基础。我们发布开源实现,并概述了多模态RAG的评估方法。

英文摘要

We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal settings. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, which assesses factuality and information coverage, and CiteF1, which assesses citation support and completeness. We show that, when applied by humans, MiRAGE strongly aligns with extrinsic judgments of output quality. We additionally introduce an automatic implementation of MiRAGE as well as multimodal variants of three prominent text-based RAG metrics -- ALCE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline evaluation methods for multimodal RAG.

2510.24081 2026-06-02 cs.CL 版本更新

Global PIQA: Evaluating Commonsense Reasoning Across 100+ Languages and Cultures

Global PIQA:评估跨100多种语言和文化的常识推理

Tyler A. Chang, Catherine Arnett, Abdelrahman Sadallah, Abdelrahman Eldesokey, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarimayum Meerajita Sharma, Aditi Gupta, Adril Putra Merin, Adwoa Bremang, Afitab Iyigun, Afonso Simplício, Ahmed Essouaied, Aicha Chorana, Akhil Eppa, Akintunde Oladipo, Akriti Kuri, Akshay Ramesh, Aleksei Dorkin, Alfred Malengo Kondoro, Alham Fikri Aji, Ali Eren Çetintaş, Allan Hanbury, Alou Dembele, Alp Niksarli, Álvaro Arroyo, Amin Bajand, Amol Khanna, Ana Chkhaidze, Ana Carolina Condez, Anamaria-Roberta Hartl, Andiswa Mkhonto, Andrew Hoblitzell, Andrew Tran, Angelos Poulis, Anirban Majumder, Anjali Chaudhary, Anna Vacalopoulou, Annette Kuuipolani Kanahele Wong, Annika Simonsen, Anton Kovalev, Anupam Nayak, Ashvanth S, Ayodeji Lana, Ayu Purwarianti, Bashar Alhafni, Benedict Busole, Bernard Ghanem, Bharti Nathani, Biljana Stojanovska Đurić, Blessing Ogundipe, Bolaotan Agbonile, Bragi Bergsson, Bruce Torres Fischer, Burak Tutar, Burcu Çınar, Cade Kane, Can Udomcharoenchaikit, Chadi Helwe, Chaithra Reddy Nerella, Chen Cecilia Liu, Chiamaka Nwokolo, Christopher Homan, Clément Sampebgo, Cristina España-Bonet, Cynthia Amol, Daeyoep Lee, Dan Saattrup Smart, Dana Arad, Daniil Dzenhaliou, Dasol Choi, David Liu, David Semedo, David Anugraha, Deborah Popoola, Deividas Mataciunas, Delphine Nyaboke, Dennis Owusu, Dhyuthy Krishna Kumar, Diogo Tavares, Diogo Glória-Silva, Divyanshu Goyal, DongGeon Lee, E. Kelly Buchanan, Ebele Nwamaka Anajemba, Egonu Ngozi Grace, Elena Mickel, Elias Herranen, Eliza Acharya, Eman Nisar, Emile Anand, Emmanuel Habumuremyi, Emuobonuvie Maria Ajiboye, Eryawan Presma Yulianrifat, Esther Adenuga, Ewa Rudnicka, Faith Itiola, Faran Taimoor Butt, Fareeha Fayyaz Sheikh, Fathima Thekkekara, Fatima Haouari, Faustin Nsengiyumva, Fenal Ashokbhai Ilasariya, Filbert Aurelian Tjiaranata, Firas Laakom, Francesca Grasso, Francesco Periti, Francesco Orabona, Gbenga Kayode Solomon, Genta Indra Winata, Gia Nghia Ngo, Gloria Udhedhe-oze, Gonçalo Vinagre, Gopi Naga Sai Ram Challagolla, Gorka Urbizu-Garmendia, Gouthami Vadithya, Guijin Son, Gulnaz Abdykadyrova, Gyan Swaroop Mohapatra, Hafeez Ullah, Hafsteinn Einarsson, Hai Hu, Hamidreza Saffari, Hamza Zaidi, Haopeng Zhang, Harethah Abu Shairah, Harry Vuong, Hele-Andra Kuulmets, Hitesh Laxmichand Patel, Houda Bouamor, Hwanjo Yu, Iben Nyholm Debess, İbrahim Ethem Deveci, Ikhlasul Akmal Hanif, Ikhyun Cho, Inês Vieira, Inês Calvo, Isaac Manzi, Ismael Illa Salifou, Ismail Daud, Ismail Yusuf, Itay Itzhak, Ivan Zhelyazkov, Ivan Belashkin, Ivan Spada, Jacob Brinton, Jafar Isbarov, Jaka Čibej, Jan Kocoń, Jan Cuhel, Jauza Krito, Jebish Purbey, Jennifer Za, Jennifer Mickel, Jenny Kunz, Jessica Ratovondranto, Jeyarajalingam Varsha, Jihae Jeong, Jimena Tena Dávalos, Jinu Lee, João Magalhães, John Seon Keun Yi, Jongin Kim, Joseph Chataignon, Joseph Marvin Imperial, Jubeerathan Thevakumar, Judith Land, Julia Alekseenko, Junchen Jiang, Jungwhan Kim, Kairit Sirts, Kamesh R, Kamesh V, Kanda Tshinu, Kätriin Kukk, Kaustubh Ponkshe, Kavsar Huseynova, Ke He, Kenneth Enevoldsen, Kent Joshua Alvarez, Kerem Zaman, Khalil Mrini, Kian Kyars, Komal Gour, Krishnakumar Lainitha, Krister Kruusmaa, Kunal Mukherjee, Kusum Chouhan, Laura Castro, Laura M. Porrino-Moscoso, Lenny Sivi Za Nzambi, Leshem Choshen, Levent Sencan, Lilja Øvrelid, Lisa Alazraki, Loretta Oma Jones, Lovina Ehimen-Ugbede, Luheerathan Thevakumar, Luxshan Thavarasa, Mahnoor Malik, Mamadou K. Keita, Mansi Jangid, Marco De Santis, Marcos Garcia, Marek Šuppa, Mariam D'Ciofalo, Marii Ojastu, Marium Attaullah, Maryam Sikander, Mausami Narayan, Maximos Skandalis, Mehak Mehak, Mehmet İlteriş Bozkurt, Melaku Bayu, Menan Velayuthan, Mhasilenuo Vizo, Michael Leventhal, Michał Marcińczuk, Mina Almasi, Mirna Potočnjak, Mithil Bangera, Mohammadamin Shafiei, Mohiba Ansari, Mridul Sharma, Mrityunjaya Indoria, Mughees Ur Rehman, Muhammad Ravi Shulthan Habibi, Murat Kolić, Murat Barkın Kınay, Nada Galant, Naina Singh Rathore, Naphat Permpredanun, Narada Maugin, Nathalie Norman, Nicholas Kluge Corrêa, Nikola Ljubešić, Nirmal Thomas, Nisansa de Silva, Nisheeth Joshi, Nitish Ponkshe, Nizar Habash, Nneoma Udeze, Noel Thomas, Noémi Ligeti-Nagy, Nouhoum Coulibaly, Odunayo Ogundepo, Odunayo Kareemat Buliaminu, Oghojafor Godswill Fejiro, Okechukwu God'spraise, Olanrewaju Samuel, Olaoye Deborah Oluwaseun, Olasoji Akindejoye, Olga Snissarenko, Onyinye Anulika Chiemezie, Orkun Kınay, Osman Tursun, Oyelade Oluwafemi Joshua, Oyesanmi Fiyinfoluwa, Pablo Rodríguez, Pablo Gamallo, Palak Arora, Pedro Valente, Peter Rupnik, Philip Oghenesuowho Ekiugbo, Prakhar Agarwal, Pramit Sahoo, Prokopis Prokopidis, Pua Niau-Puhipau, Quadri Yahya, Rachele Mignone, Raghav Singhal, Rahul Raja, Ram Mohan Rao Kadiyala, Raphael Merx, Rasmus Larsen, Ratnavel Rajalakshmi, Rishav Ghosh, Romina Oji, Ron Kekeha Solis, Rui Guerra, Rushikesh Zawar, Sa'ad Nasir Bashir, Saeed Alzaabi, Sahil Sandeep, Sai Pavan Batchu, Sai Sandeep Kantareddy, Saleha Muzammil, Salsabila Zahirah Pranida, Sam Buchanan, Samuel Rutunda, Sander Land, Sarah Sulollari, Sardar Ali, Saroj Sapkota, Sarveswaran Kengatharaiyer, Saulius Tautvaisas, Sayambhu Sen, Sayantani Banerjee, Sebastien Diarra, Segun Afolayan, Senthilnathan M, Sewoong Lee, Shaan Shah, Shankar Venkitachalam, Sharifa Djurabaeva, Sharon Ibejih, Shivanya Shomir Dutta, Siddhant Gupta, Silvia Paniagua Suárez, Sina Ahmadi, Sivasuthan Sukumar, Siyuan Song, Snegha A, Sokratis Sofianopoulos, Sona Elza Simon, Sonja Benčina, Sophie Gvasalia, Sphurti More, Spyros Dragazis, Stefan Milosavljević, Stephan P. Kaufhold, Suba S, Sultan Alrashed, Surangika Ranathunga, Taiga Someya, Taja Kuzman Pungeršek, Tal Haklay, Tasi'u Jibril, Tatsuya Aoyama, Tea Abashidze, Terenz Jomar Dela Cruz, Terra Blevins, Themistoklis Nikas, Theresa Idoko, Thu Mai Do, Tilek Chubakov, Tina Munda, Tobiloba Owoeye, Tommaso Gargiani, Uma Rathore, Uni Johannesen, Uwuma Ugwu, Vallerie Alexandra Putra, Vanya Bannihatti Kumar, Varvara Arzt, Vasily Konovalov, Vasudevan Nedumpozhimana, Viktoria Ondrejova, Viktoryia Horbik, Vishnu Vardhan Reddy Kummitha, Vuk Dinić, Walelign Sewunetie, Winston Wu, Xiaojing Zhao, Yacouba Diarra, Yaniv Nikankin, Yash Mathur, Yash Bagla, Yeshil Bangera, Yixi Chen, Yiyuan Li, Yolanda Xavier, Yonatan Belinkov, Zaid Alyafeai, Zhargal Batozargalova, Zhengyang Shan, Zhi Rui Tam, Zilu Tang, Zuzana Nadova, Baber Abbasi, Stella Biderman, David Stap, Duygu Ataman, Fabian Schmidt, Hila Gonen, Jiayi Wang, David Ifeoluwa Adelani

发表机构 * th Multilingual Representation Learning (MRL) Workshop(第五届多语言表示学习(MRL)研讨会)

AI总结 本文提出Global PIQA,一个由全球350多位研究人员手工构建的、覆盖100多种语言和文化的参与式常识推理基准,用于评估大语言模型在不同语言和文化中的表现。

Comments Preprint

详情
AI中文摘要

迄今为止,几乎没有针对大语言模型(LLM)的、覆盖大量语言和文化的特定文化评估基准。在本文中,我们提出了Global PIQA,一个针对100多种语言的参与式常识推理基准,由来自全球65个国家的350多位研究人员手工构建。Global PIQA中的141种语言变体覆盖五大洲、19个语系和24种书写系统。在Global PIQA的非平行分割中,超过50%的示例涉及当地食物、习俗、传统或其他特定文化元素。在平行分割中,我们将更多“文化无关”的常识推理问题翻译成131种语言变体,以进行直接的跨语言比较。在两个分割中,所有示例均已由母语者验证。我们发现,最先进的LLM在Global PIQA上整体表现良好,但在低资源语言中表现较弱(例如,平行分割中语言间的准确率差距高达68%)。Global PIQA强调,在许多语言和文化中,日常知识仍然是LLM需要改进的领域,此外还有更广泛讨论的能力,如复杂推理和专家知识。除了用于LLM评估,Global PIQA还让我们一窥人类语言所嵌入的广泛文化多样性。

英文摘要

To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by over 350 researchers from over 65 countries around the world. The 141 language varieties in Global PIQA cover five continents, 19 language families, and 24 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. In the parallel split, we translate more "culturally agnostic" commonsense reasoning questions into 131 language varieties, for direct cross-lingual comparisons. In both splits, all examples have been verified by native speakers of the languages. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (e.g. up to a 68% accuracy gap between languages in the parallel split). Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement in LLMs, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.

2510.21372 2026-06-02 cs.CL 版本更新

HalleluBERT: Let Every Token That Has Meaning Bear Its Weight

HalleluBERT: 让每个有意义的词都承载其权重

Raphael Schmitt

发表机构 * Raphael Schmitt(拉斐尔·施米特)

AI总结 提出基于RoBERTa的希伯来语编码器家族HalleluBERT,在49.1 GB去重希伯来语网络文本和维基百科上从头训练,使用希伯来语特定字节级BPE词汇,在命名实体识别和情感分类基准上优于单语和多语基线。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 3022-3030, 2026
AI中文摘要

基于Transformer的模型推动了NLP的发展,但希伯来语仍然缺乏一个经过大规模训练并以base和large两种变体发布的RoBERTa编码器。我们提出HalleluBERT,一个基于RoBERTa的编码器家族,在49.1 GB去重的希伯来语网络文本和维基百科上从头训练,使用希伯来语特定的字节级BPE词汇。在希伯来语命名实体识别(BMC, NEMO)和情感分类(SMCD)的本地基准上,HalleluBERT优于单语和多语基线,并在三个基准上取得了最高的未加权平均分。我们在MIT许可下发布模型权重和分词器,以支持可复现的希伯来语NLP研究。

英文摘要

Transformer-based models have advanced NLP, yet Hebrew still lacks a RoBERTa encoder that is trained at scale and released in both base and large variants. We present HalleluBERT, a RoBERTa-based encoder family trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia using a Hebrew-specific byte-level BPE vocabulary. On native Hebrew benchmarks for named entity recognition (BMC, NEMO) and sentiment classification (SMCD), HalleluBERT outperforms monolingual and multilingual baselines, and yields the highest unweighted mean score across the three benchmarks. We release model weights and tokenizer under the MIT license to support reproducible Hebrew NLP research.

2510.21364 2026-06-02 cs.CL 版本更新

SindBERT, the Sailor: Charting the Seas of Turkish NLP

SindBERT,水手:绘制土耳其语NLP的海洋

Raphael Schmitt, Stefan Schweter

发表机构 * School of Computation, Information and Technology, Technical University of Munich(慕尼黑技术大学计算、信息与技术学院) Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg(弗赖堡大学医学系一般实践研究所) Independent Researcher(独立研究者)

AI总结 针对形态丰富语言预训练不足的问题,提出首个大规模土耳其语RoBERTa编码器SindBERT,在多个任务上表现有竞争力,并揭示了基准饱和与语料质量的重要性。

Comments Published at SIGTURK 2026, co-located with EACL 2026

详情
Journal ref
Proceedings of the Second Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2026), pp. 1-13, 2026
AI中文摘要

Transformer模型彻底改变了NLP,但许多形态丰富的语言在大规模预训练中仍然代表性不足。通过SindBERT,我们着手绘制土耳其语NLP的海洋,为土耳其语提供了首个大规模基于RoBERTa的编码器。SindBERT在312 GB土耳其语文本(mC4、OSCAR23、Wikipedia)上从头训练,以base和large两种配置发布,是土耳其语可用的首个大规模仅编码器语言模型。我们在词性标注、命名实体识别、攻击性语言检测和TurBLiMP语言可接受性基准上评估SindBERT。结果表明,SindBERT与现有的土耳其语和多语言模型相比具有竞争力,其中large变体在四个任务中的两个上取得了最佳分数,但总体上没有一致的扩展优势。这种平坦的扩展趋势,在XLM-R和EuroBERT中也观察到,表明当前的土耳其语基准可能已经饱和。同时,与更小但更精心策划的模型(如BERTurk)的比较表明,语料库质量和多样性可以超过纯粹的数据量。总之,SindBERT既作为土耳其语NLP的公开资源发布,也作为关于扩展限制和语料库组成在形态丰富语言中核心作用的实证案例研究。SindBERT模型在MIT许可下发布,并提供fairseq和Huggingface格式。

英文摘要

Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312~GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.

2510.17532 2026-06-02 cs.CL cs.LG 版本更新

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

OncoReason: 在大语言模型中构建临床推理以实现稳健且可解释的生存预测

Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska

AI总结 提出统一多任务学习框架,通过监督微调、思维链提示和强化学习三种对齐策略,使自回归大语言模型在MSK-CHORD数据集上联合进行二元生存分类、连续生存时间回归和自然语言理由生成,实现可解释的生存预测。

Comments This manuscript is withdrawn to allow careful review and correction of bibliographic issues identified after submission, including references that could not be adequately verified. These matters should be resolved before further circulation

详情
AI中文摘要

预测癌症治疗结果需要模型既准确又可解释,尤其是在存在异质性临床数据的情况下。虽然大语言模型(LLM)在生物医学自然语言处理中表现出强大的性能,但它们通常缺乏在高风险决策支持中至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架,将自回归LLM与临床推理对齐,用于MSK-CHORD数据集上的结果预测。我们的模型被训练来联合执行二元生存分类、连续生存时间回归和自然语言理由生成。我们评估了三种对齐策略:(1)标准监督微调(SFT),(2)带有思维链(CoT)提示的SFT,以引发逐步推理,以及(3)组相对策略优化(GRPO),一种强化学习方法,将模型输出与专家推导的推理轨迹对齐。使用LLaMa3-8B和Med42-8B骨干网络的实验表明,CoT提示将F1提高了+6.0,并将MAE降低了12%,而GRPO在BLEU、ROUGE和BERTScore上实现了最先进的可解释性和预测性能。我们进一步表明,现有的生物医学LLM由于架构限制通常无法产生有效的推理轨迹。我们的发现强调了在多任务临床建模中推理感知对齐的重要性,并为精准肿瘤学中可解释、可信赖的LLM设立了新的基准。

英文摘要

Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

2505.22961 2026-06-02 cs.CL cs.LG 版本更新

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

ToMAP: 利用心智理论训练对手感知的大语言模型说服者

Peixuan Han, Zijia Liu, Jiaxuan You

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出ToMAP方法,通过两个心智理论模块增强大语言模型对对手心理状态的感知与分析,结合强化学习生成更有效、多样化的说服性论据,在3B参数下超越GPT-4o等大模型。

详情
AI中文摘要

大语言模型(LLMs)在说服方面显示出有前景的潜力,但现有关于训练LLM说服者的工作仍处于初步阶段。值得注意的是,虽然人类擅长主动且动态地建模对手的想法和观点,但当前的LLMs在这种心智理论(ToM)推理上存在困难,导致多样性和对手意识有限。为解决这一局限,我们引入了心智理论增强说服者(ToMAP),这是一种通过整合两个心智理论模块来构建更灵活的说服者智能体的新方法,这些模块增强了说服者对对手心理状态的感知和分析。具体来说,我们首先提示说服者考虑对目标中心主张的可能反对意见,然后使用文本编码器配合训练好的MLP分类器来预测对手对这些反主张的当前立场。我们精心设计的强化学习框架使说服者学会如何分析对手相关信息并利用它来生成更有效的论据。实验表明,ToMAP说服者仅包含3B参数,却在多个被说服者模型和多样化语料库上以39.4%的相对增益优于GPT-4o等更大的基线模型。值得注意的是,ToMAP在训练过程中展现出复杂的推理链和减少的重复性,从而产生更多样化和有效的论据。ToMAP的对手感知特性也使其适用于长对话,并能采用更具逻辑性和对手感知的策略。这些结果强调了我们方法的有效性,并突出了其在开发更具说服力的语言智能体方面的潜力。代码可在 https://github.com/ulab-uiuc/ToMAP 获取。

英文摘要

Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.

2510.00615 2026-06-02 cs.AI cs.CL 版本更新

ACON: Optimizing Context Compression for Long-horizon LLM Agents

ACON:面向长周期LLM智能体的上下文压缩优化

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

发表机构 * University of Washington(华盛顿大学)

AI总结 提出ACON框架,通过自然语言空间优化压缩策略,在不微调模型的情况下减少峰值token使用量26-54%并提升任务成功率,同时可蒸馏至小模型。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为动态真实环境中的智能体,其成功依赖于对动作和观察的精确记录。然而,长周期智能体任务中无限制的上下文增长导致两个关键瓶颈:高昂的推理内存成本以及因无关信息导致的推理退化。现有压缩方法未能完全解决这一问题,通常依赖脆弱的启发式规则或需要对专有或大规模LLM进行不切实际的参数更新。我们引入了智能体上下文优化(ACON),这是一个统一框架,可将观察和历史记录最优地压缩为简洁、信息丰富的表示。与先前工作不同,ACON采用自然语言空间中的优化:它基于智能体的失败分析迭代地细化压缩指南,在无需模型微调的情况下保留关键状态信息。为了进一步最小化计算开销,我们将优化后的压缩器蒸馏到更小的模型中。在AppWorld、OfficeBench和Multi-objective QA上的实验表明,与现有压缩基线相比,ACON将峰值token使用量减少了26-54%,同时提高了任务成功率。值得注意的是,它使较小的LM能够有效地作为长周期智能体运行,通过减轻上下文干扰实现了高达46%的性能提升。我们的代码可在https://github.com/microsoft/acon获取。

英文摘要

Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining precise records of actions and observations. However, the resulting unbounded context growth in long-horizon agentic tasks makes two critical bottlenecks: prohibitive inference memory costs and reasoning degradation due to irrelevant information. Existing compression methods fail to fully address this, often relying on brittle heuristics or requiring parameter updates impractical for proprietary or large-scale LLMs. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both observations and history into concise, informative representations. Distinct from prior works, ACON employs an optimization in natural language space: it iteratively refines compression guidelines based on failure analysis of the agent, ensuring critical state information is preserved without model fine-tuning. To further minimize computational overhead, we distill the optimized compressor into smaller models. Experiments on AppWorld, OfficeBench, and Multi-objective QA demonstrate that ACON reduces peak token usage by 26-54% while improving task success over existing compression baselines. Notably, it enables smaller LMs to function effectively as long-horizon agents, achieving up to 46% performance improvement by mitigating context distraction. Our code is available at https://github.com/microsoft/acon.

2510.11713 2026-06-02 cs.CL cs.LG 版本更新

Are Large Reasoning Models Interruptible?

大型推理模型是否可中断?

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究大型推理模型在中断和动态上下文两种现实动态场景下的鲁棒性,发现静态评估高估了鲁棒性,并揭示了推理泄漏、恐慌和自我怀疑等新故障模式。

Comments ICML 2026; Project Page: http://dynamic-lm.github.io

详情
AI中文摘要

大型推理模型(LRM)的实际应用通常需要对变化的提示或环境进行推理。在这项工作中,我们挑战冻结世界假设,并在两种现实的动态场景下评估LRM的鲁棒性:中断(测试模型在预算受限输出下的响应准确性)和动态上下文(测试模型对飞行中变化的适应能力)。在需要长形式推理的数学和编程基准测试中,静态评估始终高估鲁棒性:即使在静态设置中达到高准确率的最先进的LRM,在中断或暴露于变化上下文时也可能不可预测地失败,当更新在推理过程后期引入时,性能下降高达60%。我们的分析进一步揭示了多种新的故障模式,包括推理泄漏(模型在中断时将推理折叠到最终答案中)、恐慌(在时间压力下模型完全放弃推理并返回错误答案)以及自我怀疑(在尝试整合更新信息时性能下降)。项目页面:http://dynamic-lm.github.io/

英文摘要

Real-world applications of Large Reasoning Models (LRMs) often require reasoning about changing prompts or environments. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the accuracy of model responses under budget-constrained outputs, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades when trying to incorporate updated information. Project Page: http://dynamic-lm.github.io/

2509.06948 2026-06-02 cs.CL 版本更新

Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

超越两阶段训练:用于大语言模型推理的协作式SFT与RL

Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出BRIDGE框架,通过选择性知识迁移使SFT辅助RL训练,在数学推理等任务上优于两阶段和单阶段混合方法。

Comments ICML 2026

详情
AI中文摘要

监督微调(SFT)和可验证奖励强化学习(RLVR)是两种广泛使用的后训练范式,用于提升大语言模型(LLMs)的推理能力。近期方法尝试通过重新加权或调度目标将SFT和RLVR整合到单一阶段。然而,这种耦合可能适得其反,因为监督更新并非对奖励优化均匀有益。为解决此问题,我们提出BRIDGE,一个可扩展框架,其中SFT通过选择性迁移有助于奖励优化的知识来学习监督RL。具体地,BRIDGE在每个元训练步骤交替两种更新:基础模型更新融合SFT和RL梯度,以及轻量级低秩适配器(LoRA)的更新,通过最大化协作增益信号(定义为联合SFT-RL训练相对于仅RL基线的奖励)来协调两个目标。在五个数学推理基准上,BRIDGE持续优于两阶段冷启动、朴素混合和代表性单阶段整合基线,平均绝对提升超过三个点,且训练动态更稳定。我们进一步展示BRIDGE可扩展到逻辑推理,并在无需额外训练的情况下泛化到分布外的代码和科学领域,同时在噪声奖励下保持鲁棒性。

英文摘要

Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-training paradigms for improving the reasoning ability of large language models (LLMs). Recent methods attempt to integrate SFT and RLVR in a single stage by reweighting or scheduling their objectives. However, such coupling can be counterproductive because supervised updates are not uniformly beneficial for reward optimization. To address this, we propose BRIDGE, a scalable framework in which SFT learns to supervise RL by selectively transferring knowledge that improves reward optimization. Specifically, BRIDGE alternates two updates at each meta-training step: a base-model update that fuses the SFT and RL gradients, and an update to a lightweight low-rank adapter (LoRA) that coordinates the two objectives by maximizing a cooperative-gain signal, defined as the reward of joint SFT-RL training over an RL-only baseline. Across five mathematical reasoning benchmarks, BRIDGE consistently outperforms two-stage cold start, naive mixing, and representative single-stage integration baselines, yielding over three points average absolute improvement and more stable training dynamics. We further show that BRIDGE extends to logical reasoning and generalizes out-of-distribution to code and science without additional training, while staying robust under noisy rewards.

2510.10943 2026-06-02 cs.MA cs.CL 版本更新

The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems

智能的社会成本:多智能体系统中刻板偏见的涌现、传播与放大

Thi-Nhung Nguyen, Linhao Luo, Amardeep Kaur, Rollin Omari, Tamas Abraham, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University(数据科学与人工智能系,莫纳什大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 本研究提出一个评估框架,通过三个智能体级指标量化多智能体系统中偏见的涌现、传播和放大,发现通信可触发高达70%的新偏见、传播至80%以上智能体并放大3倍以上刻板印象,且密集竞争性通信增加偏见,系统易受简单偏见注入攻击。

详情
AI中文摘要

大型语言模型(LLM)中的偏见仍然是一个持续存在的挑战,常常导致跨社会群体的刻板印象和不公平对待。虽然先前的工作主要关注单个LLM,但多智能体系统(MAS)的出现——其中多个LLM协作和通信——引入了偏见如何涌现、传播和放大的新的且未被充分探索的动态。为了系统地研究这些动态,我们提出了一个简单的评估框架,包含三个智能体级指标,用于量化多智能体交互过程中偏见的涌现、传播和放大。我们在不同的LLM骨干网络、社会群体配置、通信行为和对抗性设置下,对三个偏见基准评估了MAS。我们的结果表明,通信可以触发高达70%的新偏见涌现,将偏见传播到超过80%的智能体,并将刻板印象放大3倍以上。我们进一步发现,更密集和竞争性的通信通常会增加偏见。最后,我们证明了MAS极易受到简单的偏见注入攻击,而现有的防御策略只能提供有限的保护。我们的发现为多智能体LLM系统的公平性和鲁棒性提供了重要见解。

英文摘要

Bias in large language models (LLMs) remains a persistent challenge, often leading to stereotyping and unfair treatment across social groups. While prior work has mainly focused on individual LLMs, the emergence of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and underexplored dynamics in how bias emerges, propagates, and amplifies. To systematically investigate these dynamics, we propose a simple evaluation framework with three agent-level metrics that quantify bias emergence, propagation, and amplification throughout multi-agent interaction. We evaluate MAS across three bias benchmarks under varying LLM backbones, social-group configurations, communication behaviors, and adversarial settings. Our results show that communication can trigger up to 70\% new bias emergence, propagate bias across over 80\% of agents, and amplify stereotypes by more than 3$\times$. We further find that denser and competitive communication generally increases bias. Finally, we demonstrate that MAS are highly vulnerable to simple bias injection attacks, and existing defense strategies provide only limited protection. Our findings provide important insights into the fairness and robustness of multi-agent LLM systems.

2510.10676 2026-06-02 cs.AR cs.CL cs.RO eess.AS 版本更新

Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation

Bhasha-Rupantarika: 面向多语言神经机器翻译的算法-硬件协同设计方法

Mukul Lokhande, Tanushree Dewangan, Mohd Sharik Mansoori, Tejas Chaudhari, Akarsh J., Damayanti Lokhande, Adam Teman, Santosh Kumar Vishvakarma

发表机构 * Special Manpower Development Program for Chip to Start-Up (SMDP-C2S)(芯片到初创企业专项人才发展计划(SMDP-C2S)) Ministry of Electronics and Information Technology (MeitY)(电子与信息技术部(MeitY))

AI总结 提出一种通过算法-硬件协同设计实现的轻量高效多语言翻译系统Bhasha-Rupantarika,采用亚字节精度量化(FP8/INT8/INT4/FP4)在FPGA上实现模型大小减少4.1倍、推理速度提升4.2倍,为资源受限环境下的多语言AI部署提供可行方案。

详情
Journal ref
International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2026
AI中文摘要

本文介绍了Bhasha-Rupantarika,一个通过算法-硬件协同设计为资源受限环境量身定制的轻量高效多语言翻译系统。该方法研究了亚字节精度级别(FP8、INT8、INT4和FP4)的模型部署,实验结果表明模型大小减少4.1倍(FP4),推理速度提升4.2倍,对应吞吐量提高至66 tokens/s(提升4.8倍)。这凸显了超低精度量化对于使用FPGA加速器的物联网设备实时部署的重要性,实现了与预期相当的性能。我们的评估涵盖了印度语言和国际语言之间的双向翻译,展示了其在低资源语言环境中的适应性。FPGA部署显示LUT减少1.96倍,FF减少1.65倍,与OPU相比吞吐量提升2.2倍,与HPTA相比提升4.6倍。总体而言,该评估提供了一种基于量化感知翻译且兼顾硬件效率的可行解决方案,适用于可部署的多语言AI系统。完整的代码[https://github.com/mukullokhande99/Bhasha-Rupantarika/]和可复现数据集已公开,便于研究人员快速集成和进一步开发。

英文摘要

This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.

2510.09608 2026-06-02 cs.CV cs.AI cs.CL 版本更新

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM:无限视频流的实时理解

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Yao Lu, Song Han

发表机构 * MIT(麻省理工学院) NVIDIA(英伟达)

AI总结 提出StreamingVLM,通过统一训练与流推理的框架,利用注意力汇点状态复用和滑动窗口机制实现无限视频流的实时稳定理解,在Inf-Streams-Eval基准上以8 FPS速度达到66.18%胜率,并提升通用VQA能力。

Comments Published as a conference paper at ICLR 2026. The first two authors contributed equally to this work

详情
AI中文摘要

视觉语言模型(VLM)可以为实时助手和自主代理提供动力,但它们面临一个关键挑战:理解近乎无限的视频流而不增加延迟和内存使用。对整个视频进行全注意力处理会导致二次计算成本和在长视频上性能不佳。同时,简单的滑动窗口方法也存在缺陷,它们要么破坏连贯性,要么由于冗余重计算而遭受高延迟。在本文中,我们介绍了StreamingVLM,一种专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一框架,将训练与流推理对齐。在推理过程中,我们通过重用注意力汇点状态、最近视觉令牌的短窗口和最近文本令牌的长窗口来维护一个紧凑的KV缓存。这种流式能力通过一个简单的监督微调(SFT)策略灌输,该策略在短的重叠视频块上应用全注意力,有效地模拟了推理时的注意力模式,而无需在过长的上下文中进行训练。为了评估,我们构建了Inf-Streams-Eval,一个新的基准,包含平均超过两小时的视频,需要帧与文本之间的密集、每秒对齐。在Inf-Streams-Eval上,StreamingVLM对GPT-4O mini实现了66.18%的胜率,并在单个NVIDIA H100上以高达8 FPS的速度保持稳定、实时的性能。值得注意的是,我们的SFT策略还增强了通用的VQA能力,无需任何VQA特定的微调,在LongVideoBench上提高了+4.30,在OVOBench Realtime上提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

英文摘要

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

2510.08825 2026-06-02 cs.CL 版本更新

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

图上搜索:面向知识图谱的大语言模型推理的迭代式知情导航

Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Zhan Su, Lingfeng Xiao, Jian-Yun Nie

发表机构 * Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) Digital Media, CBC(数字媒体,CBC) Halmstad University College(哈姆斯塔德大学学院) University of Waterloo(滑铁卢大学)

AI总结 提出Search-on-Graph方法,通过让大语言模型自身基于知识图谱结构和完整推理历史选择关系路径,遵循观察-思考-导航范式,在六个KGQA基准上超越现有方法,无需任务特定微调。

Comments Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

详情
AI中文摘要

大语言模型(LLMs)结合知识图谱(KGs)为知识密集型推理提供了一种有前景的方法。该方法的核心是在KG中选择合适的推理路径。然而,现有方法面临一个共同限制:推理路径选择通常由独立模块使用与推理需求仅弱相关的标准进行,这往往导致选择错误的关系或过早剪枝相关路径。我们提出Search-on-Graph(SoG),一种通过让LLM自身选择要遵循的关系来加强路径选择与推理之间联系的方法,该选择基于可用的KG结构和完整的推理历史。SoG遵循“观察-思考-导航”范式:每一步,LLM观察当前实体可用的关系连接,思考哪条路径最有助于回答问题,并相应地进行导航。这种上下文感知的导航充分利用了LLM的推理能力,而不是依赖具有替代标准的独立选择模块。在六个知识图谱问答(KGQA)基准上的实验表明,SoG优于最先进的方法,同时无需任务特定的微调,并能泛化到不同的KG模式。

英文摘要

Large language models (LLMs) augmented with knowledge graphs (KGs) offer a promising approach for knowledge-intensive reasoning. Central to this approach is the selection of appropriate reasoning paths in the KG. Yet, existing methods face a common limitation: reasoning path selection is often performed by separate modules using criteria that are only weakly connected to the reasoning requirements. This often results in selecting incorrect relations or premature pruning of relevant paths. We propose Search-on-Graph (SoG), a method that strengthens the connection between path selection and reasoning by having the LLM itself select which relations to follow, informed by both the available KG structure and the complete reasoning history. SoG follows an \textit{observe-think-navigate} paradigm: at each step, the LLM observes the relational connections available at the current entity, reasons about which path best advances toward answering the question, and navigates accordingly. This context-aware navigation fully exploits the LLM's reasoning capabilities rather than relying on independent selection modules with surrogate criteria. Experiments on six knowledge graph question answering (KGQA) benchmarks demonstrate that SoG outperforms state-of-the-art methods while requiring no task-specific fine-tuning and generalizing across different KG schemas.

2506.11083 2026-06-02 cs.CL 版本更新

RedDebate: Safer Responses Through Multi-Agent Red Teaming Debates

RedDebate:通过多智能体红队辩论实现更安全的响应

Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu

发表机构 * Department of Electrical Computer Engineering \& Ingenuity Labs Research Institute, Queen's University, Kingston, Canada

AI总结 提出RedDebate框架,利用多智能体辩论和长期记忆模块,通过全自动红队测试减少大语言模型的不安全输出。

详情
AI中文摘要

我们引入了RedDebate,一种新颖的多智能体辩论框架,为大语言模型(LLMs)识别和减轻其不安全行为提供了基础。人工智能安全方法通常依赖于昂贵的人工评估或孤立的单模型评估,两者都受限于可扩展性且容易发生监督失误。RedDebate在多种辩论场景中利用多个LLMs之间的协作论证,使它们能够批判性地评估彼此的推理,并通过全自动红队测试系统地发现不安全故障模式。为此,我们提出设计不同的长期记忆模块,从辩论交互中保留安全相关的见解,并在后续推理中利用它们,促进模型行为的持续改进。在多种模型的安全基准上的实证评估表明,RedDebate显著减少了不安全输出。虽然仅辩论就能让LLMs改进其行为,但添加记忆进一步减少了错误。据我们所知,RedDebate是第一个完全自动化的框架,统一了多智能体辩论和红队测试,以在没有人工干预的情况下逐步增强LLM安全性。

英文摘要

We introduce RedDebate, a novel multi-agent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their unsafe behaviours. AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another's reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. To support this, we propose designing distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behaviour. Empirical evaluation on safety benchmarks across a diverse set of models demonstrates that RedDebate substantially reduces unsafe outputs. While debate alone allows LLMs to refine their behaviour, the addition of memory yields further error reductions. To the best of our knowledge, RedDebate is the first fully automated framework to unify multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

2507.02983 2026-06-02 cs.CL cs.AI 版本更新

Truth, Trust, and Trouble: Medical AI on the Edge

真相、信任与麻烦:边缘上的医疗AI

Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem

发表机构 * Jamia Hamdard(贾迈亚哈姆达德大学) DSEU-Okhla Macquarie University(麦考瑞大学) Center for SDGC, Stanford University(SDGC中心,斯坦福大学)

AI总结 通过一个包含1000多个健康问题的基准测试框架,评估Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B三个模型在诚实、有用性和无害性方面的表现,发现AlpaCare-13B准确率最高(91.7%)且无害性最佳(0.92),而领域微调可提升安全性,少样本提示能提高准确率,但复杂查询下有用性下降。

Comments Accepted at EMNLP 2025 (Industry Track)

详情
AI中文摘要

大型语言模型(LLMs)通过实现自动医疗问答,在转变数字健康方面具有巨大潜力。然而,确保这些模型满足事实准确性、有用性和安全性等关键行业标准仍然是一个挑战,尤其是对于开源解决方案。我们提出了一个严格的基准测试框架,使用超过1000个健康问题的数据集。我们评估了模型在诚实、有用性和无害性方面的性能。我们的结果突出了评估模型——Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B——在事实可靠性和安全性之间的权衡。AlpaCare-13B达到了最高的准确率(91.7%)和无害性(0.92),而BioMistral-7B-DARE中的领域特定微调尽管规模较小,却提高了安全性(0.90)。少样本提示将准确率从78%提高到85%,并且所有模型在复杂查询上的有用性均有所下降,凸显了临床问答中持续存在的挑战。

英文摘要

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

2510.05566 2026-06-02 stat.ML cs.AI cs.CL cs.LG stat.AP 版本更新

Domain-Shift-Aware Conformal Prediction for Large Language Models

领域偏移感知的共形预测用于大型语言模型

Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出领域偏移感知共形预测框架,通过重加权校准样本应对分布偏移,在MMLU基准上提升覆盖可靠性。

Comments Accepted to Forty-Third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

大型语言模型在各种任务中取得了令人印象深刻的性能。然而,它们倾向于产生过度自信且事实不正确的输出,即所谓的幻觉,这在实际应用中带来了风险。共形预测提供了有限样本、无分布假设的覆盖保证,但标准共形预测在领域偏移下会失效,常常导致覆盖不足和不可靠的预测集。我们提出了一种称为领域偏移感知共形预测(DS-CP)的新框架。我们的框架通过根据校准样本与测试提示的接近程度系统地重新加权校准样本,将共形预测适应于领域偏移下的大型语言模型,从而在保持有效性的同时增强适应性。我们的理论分析和在MMLU基准上的实验表明,所提出的方法比标准共形预测提供了更可靠的覆盖,尤其是在显著分布偏移下,同时保持了效率。这为大型语言模型在实际部署中实现可信的不确定性量化迈出了实际的一步。

英文摘要

Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real-world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.

2505.18102 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

CapBencher: Give Your LLM Benchmark a Built-in Alarm for Test-Set Overfitting

CapBencher: 为您的LLM基准测试内置测试集过拟合警报

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 提出CapBencher方法,通过向答案注入随机性(准备多个逻辑正确但仅一个作为解)来降低贝叶斯准确率,从而在公开基准测试时防止测试集过拟合并检测泄露或作弊。

Comments ICML 2026 camera ready version

详情
AI中文摘要

在互联网上发布大型语言模型(LLM)基准测试(尤其是其真实答案)存在污染未来LLM和导致评估作弊的风险:它可能被无意(或有意)用于训练或选择模型,或者在标签可访问时被利用来过拟合和操纵排行榜。常见的缓解措施是保持基准测试私有,并让参与者向组织者提交他们的模型或预测,但这仍然允许通过反馈循环进行测试集过拟合。为了克服这个问题,我们提出了CapBencher,一种在不完全公开真实答案的情况下发布基准测试的方法,同时保持LLM的开放评估。主要思想是通过准备多个逻辑正确的答案,并仅将其中一个作为基准测试中的解,向答案中注入随机性,从而降低最佳可能准确率,即贝叶斯准确率。这不仅掩盖了真实答案,还为泄露或作弊提供了测试:由于即使完全有能力的模型也不应超过贝叶斯准确率,任何超过该准确率的模型都是一个强烈的信号。我们从理论和实验上证明,CapBencher能够在不同的基准测试、模型、训练方法和场景中准确检测测试集过拟合。

英文摘要

Publishing a large language model (LLM) benchmark (especially its ground-truth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose CapBencher, a way to publish benchmarks without fully disclosing the ground-truth answers, while preserving open evaluation of LLMs. The main idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only does this obscure the ground-truth answers, but it also offers a test for leakage or gaming: since even fully capable models should not surpass the Bayes accuracy, any model that does is a strong signal. We show theoretically and empirically that CapBencher accurately detects test-set overfitting across diverse benchmarks, models, training methodologies, and scenarios.

2510.01167 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

同时多目标对齐:可验证与不可验证奖励

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

AI总结 提出MAHALO框架,通过标准化PRM训练、多动作头DPO和PRM引导解码,实现大语言模型在可验证与不可验证奖励上的多目标对齐,减少目标冲突并支持推理时控制。

Comments ICML 2026

详情
AI中文摘要

将大语言模型与人类偏好对齐本质上是多维的,但大多数流水线将异质信号压缩为单一目标。我们试图回答如何同时在多个领域中对齐模型,这些领域包括:可验证奖励、不可验证主观偏好以及复杂交互场景。这种多目标对齐设置常常因各个目标相互冲突而困扰,导致训练效率低下和推理时用户控制有限。为了解决这些问题,我们提出了$ extbf{MAHALO}$(Multi-Action-Head Alignment with PRM-guided Decoding),这是一个统一的框架,它在可验证和不可验证设置下标准化PRM训练以进行步骤级监督,通过多动作头DPO执行向量化多目标对齐,并通过目标特定权重和PRM引导解码实现可控推理。在数学推理、人类价值观对齐和多轮辅导上的实验表明,MAHALO能够以有限的干扰同时联合改善多个目标,同时保持跨领域的泛化性和适应性,并在推理时提供灵活的用户控制。我们的代码可在 https://github.com/pearls-lab/multiobj-align 获取。

英文摘要

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

2505.17630 2026-06-02 cs.CL cs.LG 版本更新

Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation

通过交互感知反向传播修正基于梯度的电路定位

Joakim Edin, Casper L. Christensen, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe

发表机构 * Corti Stanford University(斯坦福大学) Copenhagen University(哥本哈根大学) LUT University(拉普兰大学)

AI总结 针对现有电路定位方法忽略组件交互导致重要性误估的问题,提出GIM技术,通过在反向传播中显式建模特征交互,在机械可解释性基准的电路定位任务上达到最优性能。

详情
AI中文摘要

电路定位方法旨在识别大型语言模型中负责特定行为的模型组件子集,从而实现详细的机制分析。现有方法大多假设组件独立运作,并通过孤立地扰动每个组件来估计重要性。然而,神经网络中的组件之间存在交互,忽略这些交互会导致对组件重要性的系统性误估。我们发现一个特别有问题的交互是注意力自修复,其中softmax重新分配导致有影响力的注意力分数的梯度随着其他具有相似值的位置的补偿而消失。我们引入了梯度交互修改(GIM),这是一种在反向传播过程中显式考虑特征交互的技术。GIM在机械可解释性基准的电路定位任务上达到了最先进的性能,并在多种任务的特征归因上优于现有的基于梯度的方法。通过考虑交互效应并解释为何先前方法低估了组件重要性,GIM使得对大型语言模型进行更忠实的机制分析成为可能。GIM作为Python包可在https://github.com/corticph/gim获取。

英文摘要

Circuit localization methods aim to identify the subset of model components responsible for specific behaviors in large language models, enabling detailed mechanistic analysis. Most existing methods assume components act independently and estimate importance by perturbing each component in isolation. However, components in neural networks interact, and ignoring these interactions leads to systematic misestimation of component importance. We find that one particularly problematic interaction is attention self-repair, in which softmax redistribution causes gradients for influential attention scores to vanish as other positions with similar values compensate. We introduce Gradient Interaction Modifications (GIM), a technique that explicitly accounts for feature interactions during backpropagation. GIM achieves state-of-the-art performance on the circuit localization track of the Mechanistic Interpretability Benchmark and outperforms existing gradient-based methods on feature attribution across diverse tasks. By accounting for interaction effects and explaining why prior methods underestimate component importance, GIM enables more faithful mechanistic analysis of large language models. GIM is available as a Python package at https://github.com/corticph/gim.

2505.18614 2026-06-02 cs.CL cs.LG cs.MM cs.SD eess.AS 版本更新

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

MAVL:面向动画歌曲翻译的多语言音视频歌词数据集

Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学)

AI总结 提出首个多语言多模态歌词翻译基准MAVL,并设计音节约束的音视频大语言模型SylAVL-CoT,利用音视频线索和音节约束提升歌词可唱性和翻译准确性。

Comments Accepted to EMNLP 2025, Project Page: https://k1064190.github.io/papers/paper1.html, our codes and datasets are available at https://github.com/k1064190/MAVL

详情
AI中文摘要

歌词翻译需要同时实现准确的语义传递以及保留音乐节奏、音节结构和诗歌风格。在动画音乐剧中,由于需要与视觉和听觉线索对齐,挑战更加严峻。我们引入了多语言音视频歌词翻译基准(MAVL),这是首个用于可唱歌词翻译的多语言、多模态基准。通过整合文本、音频和视频,MAVL能够比纯文本方法实现更丰富、更具表现力的翻译。在此基础上,我们提出了音节约束的音视频大语言模型SylAVL-CoT,该模型利用音视频线索并施加音节约束,以生成自然流畅的歌词。实验结果表明,SylAVL-CoT在可唱性和上下文准确性方面显著优于基于文本的模型,强调了多模态、多语言方法在歌词翻译中的价值。

英文摘要

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

2505.24621 2026-06-02 cs.CL 版本更新

Benchmarking Large Language Models for Cryptanalysis and Side-Channel Vulnerabilities

大型语言模型在密码分析和侧信道漏洞方面的基准测试

Utsav Maskey, Chencheng Zhu, Usman Naseem

发表机构 * Macquarie University(麦考瑞大学) University of New South Wales(新南威尔士大学)

AI总结 本研究通过零样本和少样本设置及思维链提示,评估大型语言模型在多种加密算法生成的密文上的解密成功率,揭示其在侧信道场景中的能力与局限,并讨论欠泛化相关攻击的风险。

Comments EMNLP'25 Findings

详情
AI中文摘要

大型语言模型(LLMs)的最新进展已经改变了自然语言理解和生成,导致在多样化任务中进行了广泛的基准测试。然而,密码分析——数据安全的关键领域及其与LLMs泛化能力的联系——在LLM评估中仍未得到充分探索。为了弥补这一差距,我们评估了最先进的LLMs在由一系列加密算法生成的密文上的密码分析潜力。我们引入了一个基准数据集,包含来自多个领域、长度、写作风格和主题的多样化明文及其加密版本。使用零样本和少样本设置以及思维链提示,我们评估了LLMs的解密成功率,并讨论了它们的理解能力。我们的研究结果揭示了LLMs在侧信道场景中的优势和局限性的关键见解,并引发了对它们易受欠泛化相关攻击的担忧。这项研究强调了LLMs在安全上下文中的双重用途性质,并为正在进行的AI安全与保障讨论做出了贡献。

英文摘要

Recent advancements in large language models (LLMs) have transformed natural language understanding and generation, leading to extensive benchmarking across diverse tasks. However, cryptanalysis - a critical area for data security and its connection to LLMs' generalization abilities - remains underexplored in LLM evaluations. To address this gap, we evaluate the cryptanalytic potential of state-of-the-art LLMs on ciphertexts produced by a range of cryptographic algorithms. We introduce a benchmark dataset of diverse plaintexts, spanning multiple domains, lengths, writing styles, and topics, paired with their encrypted versions. Using zero-shot and few-shot settings along with chain-of-thought prompting, we assess LLMs' decryption success rate and discuss their comprehension abilities. Our findings reveal key insights into LLMs' strengths and limitations in side-channel scenarios and raise concerns about their susceptibility to under-generalization-related attacks. This research highlights the dual-use nature of LLMs in security contexts and contributes to the ongoing discussion on AI safety and security.

2508.10312 2026-06-02 cs.CL 版本更新

Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation

超越语义理解:在基于LLM的推荐中保留协同频率分量

Minhao Wang, Yunhang He, Cong Xu, Zhangchi Zhu, Shuang Hao, Ning Liu, Wei Zhang

发表机构 * East China Normal University Shanghai China Beijing Jiaotong University Beijing China Shandong University Jinan, Shandong China East China Normal University \& Shanghai Innovation Institute Shanghai China East China Normal University Beijing Jiaotong University Shandong University East China Normal University \& Shanghai Innovation Institute

AI总结 针对基于LLM的推荐系统过度强调语义相关性而削弱协同信号的问题,提出FreLLM4Rec方法,通过全局图低通滤波和逐层时频调制从频谱角度平衡语义与协同信息,在四个基准数据集上NDCG@10提升高达8.00%。

Comments 12 pages, 7 figures

详情
AI中文摘要

与大语言模型(LLM)协同工作的推荐系统为生成语义信息丰富的推荐提供了有前景的途径。然而,基于LLM的推荐器倾向于过度强调用户交互历史中的语义相关性。当以预训练的协同ID嵌入作为输入时,与传统的基于Transformer的序列模型(其中协同信号通常被保留甚至增强以获得最先进的性能)相反,基于LLM的推荐器随着嵌入逐层通过LLM主干网络,逐渐削弱固有的协同信号。为了解决这一局限性,我们引入了FreLLM4Rec,一种从频谱角度平衡语义和协同信息的方法。首先使用全局图低通滤波器(G-LPF)净化同时包含语义和协同信息的物品嵌入,以初步去除无关的高频噪声。然后,时频调制(TFM)逐层主动保留协同信号。注意,TFM的协同保留能力在理论上通过建立最优但难以实现的局部图傅里叶滤波器与次优但计算高效的频域滤波器之间的联系得到保证。在四个基准数据集上的大量实验表明,FreLLM4Rec成功缓解了协同信号衰减,并实现了具有竞争力的性能,在NDCG@10上相比最佳基线提升了高达8.00%。我们的研究结果提供了关于LLM如何处理协同信息的见解,并为改进基于LLM的推荐系统提供了一种原则性方法。

英文摘要

Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users' interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00\% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.

2507.18863 2026-06-02 cs.CV cs.CL 版本更新

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

基于点视觉融合与语言模型重建的音素级视觉语音识别

Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

发表机构 * Kyushu Institute of Technology(九州工业大学)

AI总结 提出一种两阶段音素级视觉语音识别框架,通过融合视觉和面部地标运动特征,并利用LLM模型重建单词,在LRS2和LRS3数据集上分别实现17.4%和21.0%的词错误率。

Comments Accepted at ICASSP 2026. This version corresponds to the camera-ready manuscript

详情
AI中文摘要

视觉自动语音识别(V-ASR)是一项具有挑战性的任务,涉及仅从视觉信息(如唇部运动和面部表情)解释口语。由于缺乏听觉线索以及音素(表现出相似视位——在唇部运动中看起来相同的不同声音)的视觉模糊性,该任务尤为困难。现有方法通常旨在直接从视觉线索预测单词或字符,但由于视位模糊性,它们通常遭受高错误率,并且需要大量预训练数据。我们提出了一种新颖的基于音素的两阶段框架,融合视觉和地标运动特征,随后使用LLM模型进行单词重建以应对这些挑战。第一阶段包括V-ASR,输出预测的音素,从而降低训练复杂度。同时,面部地标特征处理说话者特定的面部特征。第二阶段包括一个编码器-解码器LLM模型NLLB,将输出的音素重建回单词。除了使用大型视觉数据集进行深度学习微调外,我们的PV-ASR方法在LRS2数据集上实现了17.4%的词错误率,在LRS3数据集上实现了21.0%的词错误率,展现出优越性能。

英文摘要

Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

2503.05641 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

基于技能的混合专家模型:通过推断技能实现异构推理的自适应路由

Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出Skill-MoE框架,通过推断查询所需技能进行实例级专家选择,并采用批推理策略降低开销,在单GPU上集成16个专家模型,在多个推理基准上平均提升8.15%。

Comments ICML 2026 (Camera-Ready). The first three authors contributed equally. Project Page: https://skill-moe.github.io/

详情
AI中文摘要

结合现有的预训练大语言模型是处理多样化推理任务的一种有前景的方法。然而,任务级专家选择往往过于粗粒度,因为不同实例可能需要不同的专业知识。为了解决这个问题,我们提出了Skill-MoE,一个符号化的、基于技能的、无梯度的混合专家框架,用于实例级专家选择。Skill-MoE从每个查询中推断技能(例如,数学中的代数),根据技能相关性选择专家,并让每个专家生成自己的推理。然后,由选定的聚合器将得到的k个输出进行综合,该聚合器因其整合多样化响应的能力而被选中。虽然实例级选择显著提高了性能,但朴素实现会因重复的模型加载和卸载而产生巨大开销。我们通过一种批推理策略解决了这个问题,该策略将实例按分配的专家分组,使得每个模型只需加载一次。因此,Skill-MoE在单GPU上集成了16个专家模型,其运行时间与使用4个GPU的先前多智能体基线相当。在多个基准测试(MMLU-Pro、GPQA、AIME和MedMCQA)中,Skill-MoE相比最佳基线实现了平均8.15%的绝对提升。它还能很好地泛化到未见过的任务,并且无需昂贵的多轮交互即可超越基于讨论的方法。

英文摘要

Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise. To address this, we propose Skill-MoE, a symbolic, skill-based, and gradient-free Mixture-of-Experts framework for instance-level expert selection. Skill-MoE infers skills (e.g., algebra in mathematics) from each query, selects experts based on skill relevance, and lets each expert generate its own reasoning. The resulting k outputs are then synthesized by an aggregator chosen for its ability to integrate diverse responses. While instance-level selection substantially improves performance, naively implementing it incurs heavy overhead from repeated model loading and offloading. We address this with a batch inference strategy that groups instances by assigned experts, allowing each model to be loaded only once. As a result, Skill-MoE integrates 16 expert models on a single GPU with runtime comparable to prior multi-agent baselines using 4 GPUs. Across diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), Skill-MoE achieves an average absolute improvement of 8.15% over the best baseline. It also generalizes well to unseen tasks and outperforms discussion-based methods without requiring expensive multi-round interactions.

2506.11903 2026-06-02 cs.CL 版本更新

GeistBERT: Breathing Life into German NLP

GeistBERT: 为德语 NLP 注入活力

Raphael Scheible-Schmitt, Johann Frei

AI总结 通过在大规模德语语料库上使用 RoBERTa 架构和 Whole Word Masking 进行增量预训练,GeistBERT 在多项德语 NLP 任务上取得了领先性能,并在 GermEval 2018 细粒度文本分类中达到新 SOTA。

详情
Journal ref
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, 2025, pp. 42-50
AI中文摘要

基于 Transformer 的语言模型的进展凸显了在高质量语料库上进行特定语言预训练的优势。在此背景下,德语 NLP 有望受益于针对德语语言特征更新的架构和现代数据集。GeistBERT 旨在通过在多样化语料库上进行增量训练并优化模型在多种 NLP 任务上的性能来改进德语语言处理。我们使用 fairseq 预训练 GeistBERT,遵循 RoBERTa 基础配置并采用 Whole Word Masking (WWM),从 GottBERT 权重初始化。模型在 1.3 TB 的德语语料库上训练,采用动态掩码和固定的 512 令牌序列长度。为了评估,我们在标准下游任务上微调模型,包括 NER(CoNLL 2003、GermEval 2014)、文本分类(GermEval 2018 粗粒度/细粒度、10kGNAD)和 NLI(German XNLI),使用 $F_1$ 分数和准确率作为评估指标。GeistBERT 在所有任务上均取得了强劲结果,在基础模型中领先,并在 GermEval 2018 细粒度文本分类中创下新的最先进水平 (SOTA)。它还优于多个更大的模型,尤其是在分类基准测试中。为了支持德语 NLP 研究,我们在 MIT 许可下发布 GeistBERT。

英文摘要

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI (German XNLI), using $F_1$ score and accuracy as evaluation metrics. GeistBERT achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification. It also outperformed several larger models, particularly in classification benchmarks. To support research in German NLP, we release GeistBERT under the MIT license.

2012.02110 2026-06-02 cs.CL cs.LG 版本更新

GottBERT: a pure German Language Model

GottBERT: 一个纯德语语言模型

Raphael Scheible, Johann Frei, Fabian Thomczyk, Henry He, Patric Tippmann, Jochen Knaus, Victor Jaravine, Frank Kramer, Martin Boeker

发表机构 * Institute for AI and Informatics in Medicine, University Hospital rechts der Isar, Technical University Munich(慕尼黑技术大学医学人工智能与信息学研究所,莱茵河右岸大学医院) IT-Infrastructure for Translational Medical Research, Faculty of Applied Computer Science, University of Augsburg(奥格斯堡大学应用计算机科学学院医学转化研究IT基础设施) Data Integration Center, Faculty of Medicine, University of Freiburg(弗赖堡大学医学学院数据整合中心) School of Computation, Information and Technology, Technical University Munich(慕尼黑技术大学计算、信息与技术学院) Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg(弗赖堡大学医学学院医学生物统计学研究所) Freiburg Center for Data Analysis and Modeling, University of Freiburg(弗赖堡大学数据分析与建模中心) Hengrui Europe Biosciences, Zurich(苏黎世亨格里欧生物科学公司)

AI总结 本文提出了首个德语单语言RoBERTa模型GottBERT,在OSCAR德语子集上预训练,并在NER和文本分类任务上展示了竞争性能。

详情
AI中文摘要

预训练语言模型显著推进了自然语言处理(NLP),尤其是BERT及其优化版本RoBERTa的引入。虽然最初的研究集中在英语上,但单语言模型在多语言模型方面在预训练工作量、整体资源效率或下游任务性能上可能具有优势。尽管基于提示的LLM越来越流行,但计算效率更高的类BERT模型仍然高度相关。在这项工作中,我们提出了第一个德语单语言RoBERTa模型GottBERT,该模型仅在OSCAR数据集的德语部分上进行预训练。此外,我们研究了过滤OSCAR语料库的影响。GottBERT使用fairseq和标准超参数进行预训练。我们在两个命名实体识别(NER)任务(Conll 2003和GermEval 2014)和三个文本分类任务(GermEval 2018细粒度和粗粒度,以及10kGNAD)上,与现有的德语BERT模型和两个多语言模型进行了性能评估。性能使用$F_{1}$分数和准确率来衡量。GottBERT base和large模型表现出竞争性能,其中GottBERT在6个任务中的4个中领先于base模型。与我们的预期相反,所应用的过滤并未显著影响结果。为了支持德语NLP研究社区,我们将在MIT许可下发布GottBERT模型。

英文摘要

Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.

2506.05387 2026-06-02 cs.CL cs.AI 版本更新

Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

推进解码策略:局部典型采样在大型语言模型中的增强

Jaydip Sen, Saptarshi Sengupta, Subhasis Dasgupta

发表机构 * Praxis Business School(普拉克斯商业学校) San Jose State University(旧金山州立大学)

AI总结 提出自适应语义感知典型性采样(ASTS)改进局部典型采样算法,通过动态熵阈值、多目标评分和奖惩调整,在保持计算效率的同时提升文本生成的流畅性、多样性和连贯性。

Comments This is the accepted but pre-reviewed version of the chapter that has been accepted for publication in the Springer volume 'Decision-Making in Computational Intelligence-Based Systems,' edited by Witold Pedrycz, Gilberto Rivera, Rose Ma Rodriguez, and Salvador Ibarra Martinez. The chapter is 39 pages long, and it contains 2 figures and 6 tables. This is NOT the final camera-ready version

详情
Journal ref
Recent Advances in Artificial Neural Networks. Intelligent Systems Reference Library, vol 283. Springer, Cham, 2026
AI中文摘要

本章探讨了大型语言模型(LLMs)解码策略的进展,重点关注增强局部典型采样(LTS)算法。传统的解码方法,如top-k和核采样,通常在文本生成中难以平衡流畅性、多样性和连贯性。为应对这些挑战,提出了自适应语义感知典型性采样(ASTS)作为LTS的改进版本,融合了动态熵阈值、多目标评分和奖惩调整。ASTS在保持计算效率的同时,确保上下文连贯且多样的文本生成。其性能在多个基准测试中进行了评估,包括故事生成和抽象摘要,使用了困惑度、MAUVE和多样性分数等指标。实验结果表明,ASTS通过减少重复、增强语义对齐和提高流畅性,优于现有的采样技术。

英文摘要

This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

2504.04718 2026-06-02 cs.CL cs.AI 版本更新

T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models

T1:小语言模型测试时计算扩展的工具集成验证

Minki Kang, Jongwon Jeong, Jaewoong Cho

发表机构 * KAIST(韩国科学技术院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) KRAFTON

AI总结 针对小语言模型在测试时扩展中验证能力不足的问题,提出T1框架,通过外部工具过滤候选输出后由小语言模型进行最终验证,显著提升验证准确率和测试时扩展性能。

Comments ICLR 2026

详情
AI中文摘要

近期研究表明,测试时计算扩展能有效提升小语言模型(sLMs)的性能。然而,先前研究主要利用额外的大模型作为验证器进行测试时计算扩展,而sLMs自身的验证能力尚未被充分探索。本文研究sLMs在测试时扩展中能否可靠地验证输出候选。我们发现,即使从大验证器进行知识蒸馏,sLMs在需要记忆的任务(如数值计算和事实核查)上仍表现不佳。为解决这一局限,我们提出工具集成验证(T1),这是一个两阶段框架:首先用外部工具过滤候选,然后使用sLM进行最终验证,将记忆密集型步骤卸载到代码解释器等工具上。在T1框架内,我们证明卸载到外部工具可减轻sLMs的记忆负担,并提升测试时扩展性能。在MATH基准上的实验表明,采用T1的Llama-3.2 1B模型在测试时扩展下性能优于规模更大的Llama-3.1 8B模型。此外,T1提高了过程奖励模型(PRMs)和评论家模型的验证准确率。我们的发现凸显了工具集成在显著提升sLMs验证能力方面的潜力。

英文摘要

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably verify the output candidates under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated verification (T1), a two-stage framework that first filters candidates with external tools and then uses an sLM for final verification, offloading memorization-heavy steps to tools such as a code interpreter. Within T1, we prove that offloading to external tools reduces the memorization burden on sLMs and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 improves the verification accuracy of both process reward models (PRMs) and critic models. Our findings highlight the potential of tool integration to substantially improve the verification abilities of sLMs.

2503.05500 2026-06-02 cs.CL cs.AI 版本更新

EuroBERT: Scaling Multilingual Encoders for European Languages

EuroBERT:面向欧洲语言的多语言编码器扩展

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo

发表机构 * Artefact Research Center(Artfact研究中心) CNRS(法国国家科学研究中心) ISIA Lab(ISIA实验室) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出EuroBERT系列多语言编码器,通过整合生成式模型的最新进展,在检索、回归和分类等任务上超越现有模型,并原生支持长达8192个token的序列。

Comments 28 pages, 8 figures, 13 tables

详情
AI中文摘要

用于检索、回归和分类的通用多语言向量表示传统上来自双向编码器模型。尽管应用广泛,但编码器最近被生成式仅解码器模型的进步所掩盖。然而,推动这一进展的许多创新并非解码器所独有。在本文中,我们通过这些进展的视角重新审视多语言编码器的发展,并介绍EuroBERT,一个覆盖欧洲及全球广泛使用语言的多语言编码器家族。我们的模型在包括多语言能力、数学和编码在内的多种任务上优于现有替代方案,并原生支持长达8192个token的序列。我们还研究了EuroBERT背后的设计决策,提供了关于数据集组成和训练流程的见解。我们公开发布EuroBERT模型,包括中间训练检查点以及我们的训练框架。

英文摘要

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.

2501.04424 2026-06-02 cs.AI cs.CL 版本更新

NSA: Neuro-symbolic ARC Challenge

NSA: 神经符号 ARC 挑战

Paweł Batorski, Jannik Brinkmann, Paul Swoboda

发表机构 * Heinrich Heine Universität Düsseldorf(杜伊斯堡-艾森大学) University of Mannheim(曼海姆大学)

AI总结 提出一种结合 transformer 提案生成与领域特定语言组合搜索的神经符号方法,在 ARC 评估集上超越现有最优方法 27%。

详情
Journal ref
ESANN 2026 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 99-104, 2026
AI中文摘要

抽象与推理语料库 (ARC) 评估了机器学习模型和组合搜索方法都难以处理的通用推理能力。我们提出了一种神经符号方法,该方法结合了用于提案生成的 transformer 和使用领域特定语言的组合搜索。Transformer 通过提出有希望的搜索方向来缩小搜索空间,从而使组合搜索能够在短时间内找到实际解决方案。我们使用合成生成的数据预训练 transformer。在测试时,我们生成额外的任务特定训练任务并微调我们的模型。我们的结果在 ARC 评估集上比现有最优方法高出 27%,并且在 ARC 训练集上表现良好。我们在 https://github.com/Batorskq/NSA 公开了我们的代码和数据集。

英文摘要

The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.

2410.12325 2026-06-02 cs.CL 版本更新

$M^3$ Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

$M^3$ 缩放定律:优化低资源语言模型的多周期、多语言和多阶段训练

Kosuke Akimoto, Taiki Miyagawa, Masafumi Oyamada

发表机构 * NEC Corporation(日本电报电话株式会社)

AI总结 本文提出 $M^3$ 缩放定律,统一预测模型规模、目标语料周期数、平均目标语言比例和最终阶段目标语言比例对低资源语言模型预训练损失的影响,并推导出最优训练策略的实用指南。

Comments 35 pages, 14 figures, 17 tables

详情
AI中文摘要

在本文中,我们研究了低资源语言环境下预训练大型语言模型(LLMs)的一个基本设计问题。现有工作采用多周期、多语言和多阶段训练来有效利用有限的目标语言语料库,但没有先前的缩放定律可以在相同的计算预算 $C$ 和目标语言语料库大小 $D_T$ 下比较这些方法,导致最优训练设置不明确。为填补这一空白,我们提出了 $M^3$ 缩放定律,这是一个统一的预测模型,参数化为模型规模、目标语料库周期数 $k$、平均目标语言比例 $r$ 和最终阶段目标语言比例 $r_f$,将单语言单阶段、多语言单阶段和多语言多阶段方案置于单一的目标语言损失曲面上。在三个语言对中,它比现有缩放定律更准确地外推到未见过的超参数区域。使用 $M^3$ 作为替代目标,我们为低资源 LLM 预训练推导出两个实用指南:(i) 随着 $D_T$ 减小,最优方案在计算预算相关的阈值处直接从单语言单阶段转变为多语言两阶段训练,而在我们的实验网格中多语言单阶段从未最优;(ii) 最优周期数在稀缺变量 $D_T/D^*(C)$ 上坍缩为一条曲线,其中 $D^*(C) \propto C^{α/(α+β)}$ 是单语言计算最优语料库大小。

英文摘要

In this paper, we study a fundamental design problem in pretraining Large Language Models (LLMs) for low-resource language regimes. Existing works adopt multi-epoch, multi-lingual, and multi-stage training to utilize the limited target-language corpus efficiently, but no prior scaling law can compare recipes spanning these approaches under the same compute budget $C$ and target-language corpus size $D_T$, leaving the optimal training setup unclear. To address this gap, we propose the $M^3$ Scaling Law, a unified predictive model parameterized by the model scale, the number of target-corpus epochs $k$, the average target-language ratio $r$, and the final-stage target-language ratio $r_f$, which places monolingual single-stage, multi-lingual single-stage, and multi-lingual multi-stage recipes on a single target-language loss surface. Across three language pairs, it extrapolates to unseen hyperparameter regions more accurately than existing scaling laws. Using $M^3$ as a surrogate objective, we derive two practical guidelines for low-resource LLM pretraining: (i) as $D_T$ decreases, the optimal recipe shifts directly from monolingual single-stage to multi-lingual two-stage training at a compute-budget-dependent threshold, with multi-lingual single-stage never optimal in our experimental grid; and (ii) the optimal number of epochs collapses onto a single curve in the scarcity variable $D_T/D^*(C)$, where $D^*(C) \propto C^{α/(α+β)}$ is the monolingual compute-optimal corpus size.

2407.01374 2026-06-02 cs.CL 版本更新

Bridging the Gap: Transfer Learning from English PLMs to Malaysian English

弥合差距:从英语PLM到马来西亚英语的迁移学习

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

发表机构 * School of Information Technology, Monash University Malaysia(墨尔本大学马来西亚分校信息科技学院)

AI总结 针对马来西亚英语的低资源克里奥尔语特性,通过微调预训练语言模型MENmBERT和MENBERT,在命名实体识别和关系抽取任务上分别提升1.52%和26.27%的性能。

Comments Accepted in 9th Workshop on Representation Learning for NLP (Rep4NLP) at ACL 2024

详情
AI中文摘要

马来西亚英语是一种低资源克里奥尔语,除了标准英语外,还包含马来语、汉语和泰米尔语的元素。由于其独特的形态句法适应、语义特征和代码切换(混合英语和马来语),命名实体识别(NER)模型在从马来西亚英语文本中捕获实体时表现不佳。考虑到这些差距,我们引入了MENmBERT和MENBERT,这是一种具有上下文理解的预训练语言模型,专门针对马来西亚英语定制。我们使用来自马来西亚英语新闻文章(MEN)数据集的手动注释实体和关系对MENmBERT和MENBERT进行了微调。这一微调过程使PLM能够学习捕捉与NER和RE任务相关的马来西亚英语细微差别的表示。与bert-base-multilingual-cased模型相比,MENmBERT在NER和RE任务上分别提高了1.52%和26.27%。尽管NER的整体性能没有显著提升,但我们的进一步分析表明,在12个实体标签评估时,性能有显著提升。这些发现表明,在特定语言和地理区域的语料库上预训练语言模型是提高低资源环境下NER性能的一种有前景的方法。本文发布的数据集和代码为专注于马来西亚英语的NLP研究工作提供了宝贵的资源。

英文摘要

Malaysian English is a low resource creole language, where it carries the elements of Malay, Chinese, and Tamil languages, in addition to Standard English. Named Entity Recognition (NER) models underperform when capturing entities from Malaysian English text due to its distinctive morphosyntactic adaptations, semantic features and code-switching (mixing English and Malay). Considering these gaps, we introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding, specifically tailored for Malaysian English. We have fine-tuned MENmBERT and MENBERT using manually annotated entities and relations from the Malaysian English News Article (MEN) Dataset. This fine-tuning process allows the PLM to learn representations that capture the nuances of Malaysian English relevant for NER and RE tasks. MENmBERT achieved a 1.52\% and 26.27\% improvement on NER and RE tasks respectively compared to the bert-base-multilingual-cased model. Although the overall performance of NER does not have a significant improvement, our further analysis shows that there is a significant improvement when evaluated by the 12 entity labels. These findings suggest that pre-training language models on language-specific and geographically-focused corpora can be a promising approach for improving NER performance in low-resource settings. The dataset and code published in this paper provide valuable resources for NLP research work focusing on Malaysian English.

2405.14782 2026-06-02 cs.CL 版本更新

Lessons from the Trenches on Reproducible Evaluation of Language Models

关于语言模型可重复评估的前线经验教训

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou

发表机构 * MBZUAI IIIT Hyderabad EleutherAI HiTZ Center - Ixa, UPV/EHU Ivy Natal University of Michigan HubSpot LibrAI Kensho Contextual AI Brown University New York University Amazon Yale University HKUST Sorbonne University CMU

AI总结 本文基于开发Language Model Evaluation Harness框架的三年经验,总结了语言模型评估中面临的方法论挑战、缺乏可重复性和透明度等问题,并提供了改进评估严谨性和信心的建议。

详情
AI中文摘要

语言模型(LMs)的可靠评估仍然是一个未解决的挑战。研究人员和工程师面临方法学问题,例如模型对评估设置的敏感性、不同方法之间难以进行适当比较,以及缺乏可重复性和透明度。关于惯例和常见实践的信息碎片化和隔离加剧了评估困难。在本文中,我们借鉴了作为流行的Language Model Evaluation Harness (lm-eval) (Gao et al., 2023) 框架开发者三年评估大型语言模型(LMs)的经验,为该领域未来的发展提供指导和教训。我们记录了实践者面临的各种挑战,并提供了这些挑战或缺乏最佳实践已产生影响的实例。我们向该领域提出建议,以提高评估的严谨性和信心,并尝试将围绕LM评估的许多隐性或民间知识编纂成文,为未来的发展奠定坚实基础。

英文摘要

Reliable evaluation of language models (LMs) remains an open challenge. Re- searchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. Evaluation difficulties are exacer- bated by the fracturing and siloing of information about conventions and common practices. In this paper we draw on three years of experience in evaluating large lan- guage models (LMs) as developers of the popular Language Model Evaluation Harness (lm-eval) (Gao et al., 2023) framework to provide guidance and lessons for the field moving forward. We document a variety of challenges faced by prac- titioners and provide concrete instances where these challenges or the absence of best practices have come into effect. We make recommendations to the field for improving evaluation rigor and confidence, and attempt to codify much of the tacit or folk knowledge surrounding LM evaluation, for a solid ground to move forward.

2403.07008 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

AutoEval Done Right: Using Synthetic Data for Model Evaluation

AutoEval 的正确做法:使用合成数据进行模型评估

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA(电子工程与计算机科学系,加州大学伯克利分校) Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel(系统免疫学系,魏茨曼科学研究所) Inria, Ecole Normale Supérieure, Paris, France(法国国家信息与自动化技术研究所,巴黎高等师范学院)

AI总结 本文提出高效且统计上无偏的算法,利用AI标记的合成数据减少模型评估所需的人工标注量,在GPT-4实验中有效样本量提升高达50%。

Comments camera-ready paper version

详情
AI中文摘要

使用人工标注的验证数据评估机器学习模型可能成本高昂且耗时。AI标记的合成数据可用于减少此目的所需的人工标注数量,这一过程称为自动评估。我们为此提出了高效且统计上无偏的算法,在保持无偏性的同时提高样本效率。这些算法在GPT-4实验中使有效人工标注样本量增加高达50%。

英文摘要

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.

2405.01930 2026-06-02 cs.CL 版本更新

OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

OARelatedWork:来自开放获取源的大规模相关工作章节数据集及其全文

Martin Docekal, Martin Fajcik, Pavel Smrz

发表机构 * Brno University of Technology(布拉格技术大学)

AI总结 本文提出OARelatedWork数据集,包含全文和完整相关工作章节,用于从开放获取源生成相关工作,并评估了多种模型,发现大型语言模型在处理全文时事实准确性下降,而人类作者常引入无根据的抽象声明。

详情
AI中文摘要

本文介绍了OARelatedWork:一个来自开放获取源的相关工作生成数据集。它是首个用于相关工作生成的大规模多文档摘要数据集,包含完整的工作相关章节和引用论文的全文。其验证集和测试集的构建使得每篇被引论文的全文都可用,从而能够对全文相关工作生成进行受控评估。该数据集包含来自多个领域的94,450篇论文和5,824,689篇唯一引用论文。借助OARelatedWork,我们旨在将该领域从仅基于摘要生成相关工作章节的部分内容,转变为基于所有可用内容生成完整相关工作章节。我们(i)对一系列模型进行了基准测试,强调合成大规模全文上下文即使对现代大型语言模型(LLM)来说仍然是一个挑战:在我们的语句级评判下,GPT-4o-mini的基于证据的真实率从使用摘要时的92.9%下降到使用全文时的83.8%。我们(ii)通过对40篇论文和408个事实陈述进行人工评估,实证分析了人类写作行为,揭示作者经常引入未在局部源文本中扎根的抽象声明;因此,先进的LLM在严格的、基于证据的事实性方面实际上超越了人类基线。最后,我们(iii)进行了细粒度的元评估,揭示标准的基于参考的指标不足以评估这种长格式的结构化输出,并引入了一个稳健的语句级评估框架来弥补这一差距。

英文摘要

This paper introduces OARelatedWork: a dataset for related work generation from open-access sources. It is the first large-scale multi-document summarization dataset for related work generation, containing whole related work sections and full texts of cited papers. Its validation and test splits are constructed so that every cited paper is available in full text, enabling controlled evaluation of full-text related work generation. The dataset includes 94 450 papers and 5 824 689 unique referenced papers from multiple domains. With OARelatedWork, we aim to shift the field from generating parts of related work sections from abstracts only to generating entire related work sections from all available content. We (i) benchmark a wide spectrum of models, highlighting that synthesizing massive full-text contexts remains challenge even for modern Large Language Models (LLMs): under our statement-level judge, GPT-4o-mini's evidence-grounded True rate drops from 92.9% with abstracts to 83.8% with full texts. We (ii) empirically analyze human writing behavior through a human evaluation over 40 papers and 408 factual statements, revealing that authors frequently introduce abstractive claims ungrounded in localized source texts; consequently, advanced LLMs actually surpass human baselines in strict, evidence-grounded factuality. Finally, we (iii) conduct a fine-grained meta-evaluation, revealing that standard reference-based metrics are inadequate for evaluating such long-form structured outputs, and introduce a robust statement-level evaluation framework to address this gap.

2402.14521 2026-06-02 cs.CL 版本更新

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

马来西亚英语新闻解码:用于命名实体和关系抽取的语言资源

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

AI总结 针对马来西亚英语与标准英语的差异导致NLP任务困难的问题,构建了包含200篇新闻的MEN数据集,手动标注实体和关系,并通过微调spaCy NER工具验证了定制数据集能显著提升NER性能。

Comments Accepted at LREC-COLING 2024

详情
AI中文摘要

标准英语和马来西亚英语存在显著差异,这给马来西亚英语的自然语言处理(NLP)任务带来了挑战。不幸的是,现有数据集主要基于标准英语,因此不足以改进马来西亚英语的NLP任务。使用最先进的命名实体识别(NER)解决方案对马来西亚英语新闻文章进行的实验表明,它们无法处理马来西亚英语的形态句法变异。据我们所知,目前没有可用于改进模型的标注数据集。为了解决这些问题,我们构建了马来西亚英语新闻(MEN)数据集,包含200篇新闻文章,并手动标注了实体和关系。然后,我们微调了spaCy NER工具,并验证了拥有为马来西亚英语量身定制的数据集可以显著提高NER在马来西亚英语上的性能。本文介绍了我们在数据获取、标注方法以及标注数据集深入分析方面的工作。为了验证标注质量,我们使用了标注者间一致性检验,随后由领域专家对分歧进行裁决。完成这些任务后,我们成功开发了一个包含6,061个实体和3,268个关系实例的数据集。最后,我们讨论了spaCy微调设置及NER性能分析。这一独特的数据集将极大地推动马来西亚英语NLP研究的发展,使研究人员能够加速进展,特别是在NER和关系抽取方面。该数据集和标注指南已在Github上发布。

英文摘要

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.