arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21481 2026-05-21 cs.AI cs.CL cs.LG 版本更新

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

AiraXiv:一个面向人类和AI科学家的AI驱动的开放获取平台

Junshu Pan, Panzhong Lu, Yixuan Weng, Qiyao Sun, Fang Guo, Zijie Yang, Qiji Zhou, Yue Zhang

发表机构 * Westlake University(西湖大学) Zhejiang University(浙江大学) Shanghai Innovation Institution(上海创新研究院) Zhongguancun Academy(中关村学院)

AI总结 本文提出AiraXiv平台,通过AI驱动的开放预印本、AI增强的分析与评审以及读者反馈,解决传统学术出版系统在AI时代面临的研究产出增长和可扩展性挑战。

详情
AI中文摘要

近年来,人工智能(AI)的进步加速了人类和AI生成的研究产出的增长,对传统学术出版系统施加了越来越大的压力,并在提交量增加、评审工作量和会议规模扩大时挑战了以会议和期刊为中心的可扩展性。为了解决这些挑战,我们探索了一个AI时代的出版范式,其中人类和AI科学家作为作者和读者参与,并通过持续反馈驱动的迭代使论文不断发展。我们提出了AiraXiv,一个基于开放预印本、AI增强的分析和评审以及读者反馈的AI驱动的开放获取平台。AiraXiv通过交互式UI支持人类科学家,通过基于模型上下文协议(MCP)的交互支持AI科学家。通过实际部署验证了AiraXiv,包括作为IC AIS 2025的提交平台,展示了其作为AI时代快速、包容和可扩展的研究基础设施的潜力。AiraXiv在https://airaxiv.com上公开可用。

英文摘要

Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.

2605.21468 2026-05-21 cs.LG cs.CL 版本更新

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

你只需要最小的RLVR训练:通过秩-1轨迹来扩展LLMs

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng

发表机构 * University of Virginia(弗吉尼亚大学) Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 本文研究了通过秩-1轨迹扩展LLMs的方法,发现RLVR参数轨迹具有极低的秩和高度可预测性,并提出RELEX方法,通过简单的线性回归在无需训练模型的情况下实现高效的超量扩展。

Comments preprint. Code: https://github.com/weizhepei/RELEX

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为改进大语言模型(LLMs)推理能力的主要范式,但其底层几何结构仍待探索。本文证明RLVR权重轨迹具有极低的秩且高度可预测。具体而言,我们发现大多数下游性能提升可通过参数增量的秩-1近似来捕捉,其中该投影的幅度与训练步数近似线性增长。受此启发,我们提出了一种简单且计算高效的RELEX(强化学习扩展)方法,通过从短观察窗口估计秩-1子空间并利用线性回归进行超量扩展,无需任何训练模型。在三个模型(即Qwen2.5-Math-1.5B、Qwen3-4B-Base和Qwen3-8B-Base)上,RELEX生成的检查点在领域内和领域外基准测试中表现匹配或优于RLVR性能,仅需完整RLVR训练的15%步数。令人惊讶的是,RELEX能在无训练成本的情况下超量扩展远超观察窗口,预测检查点多达10-20倍于观察前缀,并持续改进(例如,仅观察前50步并扩展到1000步)。我们的消融分析证实了RELEX的极简充分性:增加子空间秩或采用非线性建模不会进一步提升超量扩展效果。最后,我们显示RELEX的成功源于“去噪”效应:通过将更新投影到秩-1子空间,模型会丢弃那些会降性能的随机优化噪声。我们的代码可在https://github.com/weizhepei/RELEX获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

2605.21467 2026-05-21 cs.LG cs.CL 版本更新

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

DelTA: 一种用于可验证奖励强化学习的判别性token信用分配

Kaiyi Zhang, Wei Wu, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院) Ant International(蚂蚁国际)

AI总结 本文提出DelTA方法,通过估计token系数来增强特定侧的token梯度方向,从而改进可验证奖励强化学习中的token概率更新,提升了模型在数学基准测试中的性能。

详情
AI中文摘要

可验证奖励强化学习(RLVR)已成为提升大语言模型推理能力的核心技术。尽管其有效性已得到认可,但响应级奖励如何转化为token级概率变化仍缺乏深入理解。本文引入了RLVR更新的判别视角,表明策略梯度更新方向隐式地作为token梯度向量的线性判别器,从而决定学习过程中哪些token概率被增加或减少。在标准序列级RLVR中,该判别器由通过优势加权平均得到的正负侧质心构成。然而,此类质心构建可能被共享的高频模式(如格式token)主导,稀释了稀疏但判别性强的方向,这些方向更能区分高分响应与低分响应。为解决这一限制,本文提出DelTA,一种判别性token信用分配方法,通过估计token系数来放大侧特定的token梯度方向并降低共享或弱判别性的方向。这些系数重新加权了自我归一化的RLVR替代方案,使有效的侧向质心更具对比性,从而重塑RLVR更新方向。在七个数学基准测试中,DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分别比最强的同规模基线高出3.26和2.62个平均分。此外,代码生成、不同backbone和域外评估的额外结果进一步展示了DelTA的泛化能力。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

2605.21465 2026-05-21 cs.CL cs.SE 版本更新

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

利用大语言模型进行语法适应:关于元模型-语法共演的研究

Weixing Zhang, Bowen Jiang, Rahul Sharma, Regina Hebig, Daniel Strüber

发表机构 * Universität Rostock(罗斯托克大学) Chalmers University of Technology and University of Gothenburg(皇家理工学院和哥德堡大学) Radboud University(拉德堡德大学)

AI总结 本文研究了如何利用大语言模型自动适应语法,通过学习先前版本的语法适应来实现自动适应,同时探讨了在复杂语法场景下的优势与局限性。

详情
AI中文摘要

在模型驱动工程中,元模型的演变导致需要相应地调整语法以保持一致性,这通常需要繁琐的手动工作。现有的基于规则的方法可以实现部分自动化,但在处理复杂语法场景时存在限制。本文提出了一种基于大语言模型的方法,通过学习先前版本的语法适应来自动应用于新语法。该方法在六个真实世界Xtext领域特定语言上进行了评估,使用四个DSL作为训练集开发提示策略,两个DSL作为测试集进行验证,并在QVTo上进行纵向案例研究。评估使用了三个大型语言模型(Claude Sonnet 4.5、ChatGPT 5.1、Gemini 3),并从三个维度测量语法适应质量:语法规则层面的适应一致性、输出相似性和元模型符合性。结果表明,在测试集上,所有三个LLM都实现了100%的适应一致性和输出相似性,而基于规则的方法在DOT上仅达到84.21%,在Xcore上为62.50%。在QVTo的纵向研究中,基于LLM的方法在所有三个演变步骤中成功重用了学习的适应,而基于规则的方法在两个演变步骤中需要手动调整。然而,在大规模语法(EAST-ADL,297条规则)上,LLM的适应一致性远低于90%。本研究展示了基于LLM的方法在处理复杂语法场景中的优势,同时揭示了其在大规模语法适应中的局限性。

英文摘要

In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.

2605.21463 2026-05-21 cs.CL cs.AI 版本更新

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Mem-$π$: 通过学习何时以及生成什么来实现自适应记忆

Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian

发表机构 * ServiceNow AI Research(ServiceNow AI研究院) Mila -- Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) CIFAR AI Chair(CIFAR人工智能主席)

AI总结 Mem-$π$ 通过学习在何时以及生成什么来实现自适应记忆,利用专门的语言或视觉-语言模型生成上下文特定的指导,从而在多种代理任务中优于基于检索和先前RL优化的记忆基线。

Comments Work in progress

详情
AI中文摘要

我们提出了Mem-$π$,一种用于大语言模型(LLM)代理的自适应记忆框架,其中有用的指导是按需生成而非从外部内存存储中检索。现有的记忆增强代理通常依赖于从事件记忆库或技能库中基于相似性的检索,返回静态条目,这些条目往往与当前上下文不一致。相比之下,Mem-$π$ 使用一个具有自身参数的专用语言或视觉-语言模型,与下游代理分开,以生成复杂任务的上下文特定指导。在当前代理上下文中,模型联合决定何时生成指导以及生成什么指导。我们通过决策-内容解耦的强化学习(RL)目标对其进行训练,使其能够避免在生成不会有所帮助的情况,并在其他情况下生成简洁有用的信息。在涵盖网页导航、基于终端的工具使用和基于文本的具身交互等多样代理基准上,Mem-$π$ 一致优于基于检索和先前RL优化的记忆基线,实现网页导航任务超过30%的相对提升。

英文摘要

We present Mem-$π$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$π$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$π$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

2605.21403 2026-05-21 cs.CL 版本更新

Quantifying the cross-linguistic effects of syncretism on agreement attraction

量化语言间合成影响对一致性的吸引力

Utku Turk, Eva Neu

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)

AI总结 研究探讨了合成对一致性吸引力的影响在不同语言中的差异,通过大规模语言模型的 surprisal 和注意力熵来分析四种语言中的表现,揭示了合成如何调节吸引力的机制。

Comments SCiL Conference Paper

详情
AI中文摘要

一致性吸引力错误,即动词错误地与中间名词一致而非其语法头,受到形态合成的影响在某些语言(英语、德语、俄语)中更为明显,而在其他语言(土耳其语、亚美尼亚语)中则不明显,这种跨语言模式缺乏理论解释。我们利用大规模语言模型的 surprisal 和注意力熵作为处理代理,研究四种语言中的差异。LLM 导出的测量结果在英语和德语中复现了行为发现(合成调节吸引力),在土耳其语中得到无调节的结果,部分捕捉了俄语的模式。我们讨论了进一步理解为何合成在不同语言中影响一致性吸引力的机制。

英文摘要

Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.

2605.21391 2026-05-21 cs.CL 版本更新

Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

通过条件尺度熵理解解码器-only语言模型中的隐喻处理

Lawhori Chakrabarti, Jennifer Johnson-Leung, Bert Baumgaertner, Aleksandar Vakanski, Min Xian, Boyu Zhang

发表机构 * Department of Computer Science, University of Idaho(计算机科学系,爱达荷大学) Department of Mathematics and Statistical Science, University of Idaho(数学与统计科学系,爱达荷大学) Department of Politics and Philosophy, University of Idaho(政治与哲学系,爱达荷大学) Department of Nuclear Engineering and Industrial Management, University of Idaho(核工程与工业管理系,爱达荷大学)

AI总结 研究探讨了解码器-only语言模型中隐喻处理的机制,通过条件尺度熵(CSE)分析不同层位的频率尺度变化,发现隐喻性词 token 在连续层位上产生显著更高的频谱宽度,且该效应不受语义复杂度或命题内容影响,验证了多尺度协调作为隐喻语言处理的特征。

Comments 18 pages, 3 figures, submitted to ICPR workshop

详情
AI中文摘要

隐喻要求语言模型解析一个词的上下文意义与基本字面意义相异。理解transformer模型如何在深度层面组织这种重新解释仍然是机械可解释性中的开放问题。我们引入了条件尺度熵(CSE),这是一种基于小波的度量,用于衡量transformer计算在每个层位置上跨频率尺度的广泛参与程度。两个定理证明了CSE对更新幅度具有不变性,从而将更新的结构模式与强度分离。使用CSE,我们发现隐喻性词在测试的每种解码器-only架构中(从124M到20B参数,包括GPT-2家族、LLaMA-2 7B、GPT-oss 20B)在连续层位置上产生显著更高的频谱宽度。该效应经聚类置换校正后仍然存在,且在模型的早期到中期相对深度范围内重现,并通过独立分析200对自然VUA对收敛。特定性控制进一步显示,该效应不被语义复杂度或匹配的命题内容所解释。这些结果将多尺度协调确定为所考察的解码器-only架构中隐喻语言处理的一致特征,并确立CSE作为表征transformer跨深度结构的原理性工具。

英文摘要

Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

2605.21384 2026-05-21 cs.SE cs.AI cs.CL 版本更新

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench: 评估长周期编码代理中的奖励黑客现象

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

发表机构 * Weco AI

AI总结 该研究通过分解软件工程任务,提出了一种评估长周期编码代理中奖励黑客现象的方法,通过比较可见测试套件和隐藏测试套件的通过率差异,引入了SpecBench基准,展示了奖励黑客现象在不同任务长度上的显著影响。

详情
AI中文摘要

随着长周期编码代理生成的代码量超过任何开发者能够审查的范围,监督责任集中于单一表面:自动测试套件。奖励黑客现象自然出现在这种设置中,因为代理在优化通过测试的同时偏离了用户的真正目标。我们通过将软件工程任务分解为三个部分来研究这种奖励黑客现象:(i) 规格的自然语言描述,(ii) 可见验证测试套件,用于单独测试指定功能,以及 (iii) 隐藏测试套件,用于组合这些相同功能以模拟真实世界使用。基于规格和可见验证测试套件,一个真实的代理能够生成一个能够通过所有隐藏测试套件的解决方案。因此,我们使用这两个套件之间的通过率差异来量化奖励黑客现象。基于这种方法,我们引入了SpecBench,一个包含30个系统级编程任务的基准,从短周期任务如构建JSON解析器到超长周期任务如从头构建整个操作系统内核。大规模实验揭示了一种一致的模式:尽管每个前沿代理都能饱和可见套件,奖励黑客现象仍然存在,较小的模型在隐藏套件上表现出更大的差距。差距也随着任务长度急剧增加:代码规模每增加十倍,差距增长28个百分点。失败范围从微妙的功能隔离到有意的利用,包括一个2,900行的哈希表“编译器”,它记忆测试输入。SpecBench提供了一个原则性的测试平台,用于测量编码代理是构建真正的可运行系统还是仅仅在开发人员提供的测试套件上玩游戏。

英文摘要

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

2605.21369 2026-05-21 cs.CL 版本更新

Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

第五届多语言指代消解共享任务成果:扩展长距离实体的数据集

Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

发表机构 * Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics(查理大学,数学与物理学院,形式与应用语言学研究所) University of West Bohemia, Faculty of Applied Sciences, Department of Computer Science and Engineering(西波赫大学,应用科学学院,计算机科学与工程系)

AI总结 本文总结了第五届多语言指代消解共享任务的成果,介绍了通过增加五个新数据集和两种新语言扩展数据集,以解决长距离实体的指代消解问题,并展示了传统系统与基于LLM的方法在任务中的表现。

Comments Accepted to CODI-CRAC 2026

详情
AI中文摘要

本文描述了与CODI-CRAC 2026研讨会同期举行的第五届多语言指代消解共享任务。在之前的版本基础上,该任务要求参与者开发能够识别提及并基于身份进行指代聚类的系统。2026版特别强调长距离实体,即跨越多个单词和句子的指代链。该任务通过加入五个新数据集和两种新语言扩展了语言范围,这些数据集利用了版本1.4的CorefUD,这是一个包含19种语言27个数据集的和谐多语言集合。总共十个系统参与了该任务,包括四个基于LLM的方法(三个微调模型和一个少样本方法)。尽管传统系统仍保持领先,但LLM显示出显著潜力,表明它们可能在未来的版本中挑战现有方法。

英文摘要

This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.

2605.21362 2026-05-21 cs.CL 版本更新

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH:适应性语义混合用于大语言模型的黑盒劫持

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

发表机构 * University of Maine(缅因大学) University of Tennessee, Knoxville(田纳西大学, 基纳顿分校) University of Florida(佛罗里达大学)

AI总结 本文提出LASH框架,通过适应性语义混合方法,利用多个基础攻击的输出作为可重用的种子提示,针对不同目标模型和有害类别进行自适应组合,从而在黑盒红队测试中取得更高的攻击成功率。

详情
AI中文摘要

劫持攻击暴露了对齐的大语言模型预期安全行为与对抗性提示下行为之间的持续差距。现有自动化方法日益有效,但每个方法都局限于单一攻击家族(例如,一个细化循环、一个树搜索、一个突变空间或一个策略库),并且没有单一家族主导:表现最好的方法会根据目标模型和有害类别而变化,这表明互补优势可以通过每个提示的组合来利用。我们介绍了LASH(LLM适应性语义混合),一个黑盒框架,将多个基础攻击的输出视为可重用的种子提示,并针对每个目标请求自适应地组合它们。给定一个种子池,LASH搜索种子子集和softmax归一化的混合权重;组合模块合成一个候选提示,而无导数遗传优化器通过黑盒目标反馈和一个两阶段适应度函数(结合基于关键词的拒绝检测与LLM判官评分)更新权重。在包含100个有害提示的10个类别的JailbreakBench上,我们评估了LASH在六个常见目标模型上的表现。LASH在基于关键词的评估中平均攻击成功率为84.5%,在两阶段评估中为74.5%,其中响应首先被过滤以拒绝,然后由LLM判官评分是否实质上履行了原始有害请求。LASH在两个指标上均优于五个最先进的基线方法,仅使用30次平均目标查询。LASH还在三种防御机制下保持竞争力,并诱导出更多成功似内部表示。这些结果表明,跨异构劫持策略的适应性组合是黑盒红队测试的一个有前途的方向。

英文摘要

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

2605.21338 2026-05-21 cs.CL 版本更新

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

文本分析评估框架:基于LLM和社交媒体的案例研究

Yuefeng Shi, Nedjma Ousidhoum, Jose Camacho-Collados

发表机构 * School of Computer Science and Informatics, Cardiff University(计算机科学与信息学学院,卡迪夫大学)

AI总结 本文提出了一种基于问题的评估框架,通过470个手工整理的问题来评估LLM在处理聚合文本数据时的语义理解和推理能力,揭示了LLM在处理大规模文本数据时的性能瓶颈。

详情
AI中文摘要

LLMs在广泛的NLP任务中表现出色,但在实际数据分析场景中仍存在显著差距,尤其是在处理长序列的非结构化文档(如新闻feed或本文特别针对的社交媒体帖子)时。为了实证评估LLM在该设置中的有效性,我们引入了一个包含470个手工整理问题的问题基于评估框架,旨在评估LLM在聚合文本数据上的语义理解和推理能力。我们将其应用于覆盖各种NLP任务的多样化Twitter数据集,包括情感分析、仇恨言论检测和情感识别。我们的结果表明,性能严重依赖于输入规模和数据源的复杂性,在多标签或目标依赖场景中下降明显。此外,随着任务复杂性的增加,性能从基本的语义存在识别逐步下降到更 demanding 的操作,如比较、计数和计算。此外,当输入规模超过500个实例时,我们发现LLMs,特别是开放式权重模型,普遍存在一个共同的限制:在数值任务上性能显著下降。这些发现突显了当前LLMs在对大规模文本集合进行严格定量分析时的关键架构瓶颈。

英文摘要

LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

2605.21333 2026-05-21 cs.CL cs.AI 版本更新

SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

SymbolicLight V1: 一种具有高激活稀疏性和亚十亿级预训练证据的脉冲门双路径语言建模

Ting Liu

发表机构 * SymbolicLight Research(SymbolicLight研究院)

AI总结 本文提出SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型,通过长程记忆的指数衰减聚合路径和短程精度的脉冲门局部注意力路径,实现了高激活稀疏性和亚十亿级预训练证据。

Comments 35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript

详情
AI中文摘要

原生训练的脉冲语言模型难以同时结合Transformer类语言质量、稳定的多领域预训练和高激活稀疏性。我们提出了SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型。其Dual-Path SparseTCAM模块用指数衰减聚合路径替代密集自注意力,用于长程记忆,用脉冲门局部注意力路径用于短程精度,辅以动态上下文条件解码头和双语分词器。一个从头开始在300亿词中文-英语语料上训练的19400万参数SymbolicLight V1模型,在四个独立运行中达到8.88-8.93的验证PPL,每元素激活稀疏性超过89%。其PPL在GPT-2 20100万参数模型下落后7.7%,但在GPT-2 12400万参数模型上表现更优。在匹配0.5亿词训练预算的组件消融实验中,脉冲门局部注意力路径是最大贡献者,而用确定性top-k掩码替代LIF动力学在匹配稀疏性时导致更大退化,表明时间积分而非稀疏性本身驱动性能。我们还报告了一个在4880亿词上训练的0.8亿参数规模运行作为优化和稀疏性保持的证据,而非主要质量比较。当前密集硬件推理速度比GPT-2慢,因此神经形态部署被提出作为未来稀疏性驱动的机会,而非已实现的硬件加速。

英文摘要

Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

2605.21318 2026-05-21 cs.CL cs.AI cs.LG 版本更新

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

TextReg: 通过正则化的文本空间优化缓解提示分布过拟合

Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了提示分布过拟合问题,提出TextReg框架通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新,提升模型在分布外(OOD)任务上的泛化能力。

Comments Code: https://github.com/luchengfu6/TextReg

详情
AI中文摘要

大型语言模型(LLMs)对用于指定任务目标和行为约束的提示非常敏感。许多最近的提示优化方法通过迭代使用LLM生成的反馈来重写提示,但结果提示往往变长,积累狭窄的样本特定规则,并在训练分布之外泛化能力差。我们研究这种失败模式作为提示分布过拟合,并认为这反映了离散文本空间优化中表示控制的不足。我们通过表示不效率(representational inefficiency)进行了形式化,这是一种双因素度量,将提示不效率分解为容量成本和范围狭窄,将分布提示过拟合归因于优化过程中两者的耦合增长。我们提出了TextReg,一个正则化框架,通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新。在多个推理基准上,TextReg显著提高了分布外(OOD)泛化能力,其准确性在TextGrad和REVOLVE上分别提高了+11.8%和+16.5%。

英文摘要

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

2605.21299 2026-05-21 cs.CL cs.AI 版本更新

Tracing the ongoing emergence of human-like reasoning in Large Language Models

追踪大型语言模型中类人推理的持续涌现

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

发表机构 * Departament de Filologia Catalana, Universitat Autònoma de Barcelona(加泰罗尼亚语言系系,巴塞罗那自治大学) Institut für Psychologie, Humboldt-Universitat zu Berlin(柏林洪堡大学心理学研究所) Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚高级研究与高级教育研究所)

AI总结 研究探讨了大型语言模型在条件推理任务中是否具备类人推理能力,发现人类通过语用推理丰富逻辑推理,而模型行为更不稳定,部分模型遵循条件语义但忽视语用推理,表明LLMs在语义准确性上表现良好,但缺乏人类推理中的语用丰富性。

详情
AI中文摘要

人类能够超越字面意义:如果你修剪草坪,我会给你五十美元,通常被理解为说话者只在草坪修剪时支付,而如果你饿了,烤箱里有披萨,意味着披萨无论听者是否饥饿都可用。大型语言模型(LLMs)在许多任务上表现出类人性能,但尚不清楚它们是否像人类一样推理。为此,我们进行了一项人口匹配实验,评估了25个LLMs在四种语言中计算条件推理的能力,并与每种语言中等数量的人类进行比较。我们发现,人类通过跨语言的语用推理丰富逻辑推理。模型行为更具变异性。一些LLMs完全遵循条件语义的真值表,但忽视语用推理,而另一些LLMs偏离真值表,坚持单一解释,从而反映准确的规则处理但不具有类人推理能力。总体而言,LLMs是准确的语义运算符,但未能捕捉到人类推理中特有的语用丰富性。关键的是,LLM的准确性既不被开放与封闭状态、训练方向或架构类型所预测或提升,表明语用推理仍然是人工系统认知工具包中正在兴起的能力。

英文摘要

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

2605.15156 2026-05-21 cs.CL cs.AI cs.LG 版本更新

MeMo: Memory as a Model

MeMo:记忆作为模型

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama

发表机构 * Institute of Data Science, National University of Singapore(数据科学研究院,新加坡国立大学) Integrative Sciences and Engineering Programme, NUSGS(整合科学与工程计划,NUSGS) Agency for Science, Technology, Research (A*STAR)(科技研究局(A*STAR)) Department of Computer Science, National University of Singapore(计算机科学系,新加坡国立大学) University of Tokyo(东京大学) Liquid AI CSAIL, Massachusetts Institute of Technology(CSAIL,麻省理工学院) AI Singapore Singapore-MIT Alliance for Research and Technology Centre, Singapore(新加坡-麻省理工学院研究与技术中心,新加坡)

AI总结 本文提出MeMo框架,通过在不改变LLM参数的情况下将新知识编码到专用记忆模型中,解决了大型语言模型在需要及时领域特定信息的应用中的问题,同时具备处理复杂跨文档关系、抗检索噪声、避免灾难性遗忘、无需访问LLM权重或输出logits以及检索成本与语料库大小无关等优势。

Comments MeMo augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上表现出色,但预训练后保持冻结状态,直到后续更新。许多现实应用需要及时、领域特定的信息,这促使需要高效的机制来整合新知识。在本文中,我们介绍MeMo(Memory as a Model),一个模块化框架,能够将新知识编码到专用的记忆模型中,同时保持LLM参数不变。与现有方法相比,MeMo具有几个优势:(a)它能够捕捉复杂的跨文档关系;(b)它对检索噪声具有鲁棒性;(c)它避免了LLM中的灾难性遗忘;(d)它不需要访问LLM的权重或输出logits,从而能够与开源和专有闭源LLM进行即插即用式集成;(e)其检索成本在推理时间与语料库大小无关。我们在三个基准测试集BrowseComp-Plus、NarrativeQA和MuSiQue上的实验结果表明,MeMo在多种设置中相比现有方法表现优异。

英文摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

2605.10933 2026-05-21 cs.LG cs.CL 版本更新

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

DECO:稀疏专家混合模型在端侧设备上实现密集级性能

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China(计算机科学与技术系,人工智能研究院,清华大学,北京,中国)

AI总结 本文提出DECO,一种稀疏的专家混合模型,能够在相同的总参数预算和训练token数量下,实现与密集Transformer相当的性能,并在端侧设备上实现高效部署。

Comments 15 pages, 10 figures, 12 tables

详情
AI中文摘要

尽管混合专家(MoE)可以在不按比例增加计算的情况下扩展模型容量,但其巨大的总参数足迹造成了显著的存储和内存访问瓶颈,阻碍了同时需要高性能、低计算成本和小存储开销的端侧部署。为了实现这些特性,我们提出了DECO,一种稀疏的MoE架构,旨在在相同的总参数预算和训练token数量下匹配密集Transformer的性能。DECO利用可微且灵活的基于ReLU的路由增强,通过可学习的专家级缩放,自适应地平衡路由和共享专家的贡献。此外,我们引入了NormSiLU,一种在SiLU操作前对输入进行归一化的激活函数,产生更稳定的路由专家激活比率趋势和更高的内在稀疏性水平。我们还发现使用非门控MLP专家与基于ReLU的路由具有经验优势,表明MoE架构可能简化。实验表明,DECO仅激活20%的路由专家,能够实现密集性能,并优于现有的MoE基线。我们的专用加速内核在Jetson AGX Orin上相比密集推理实现了2.93倍的加速。代码和检查点可在https://github.com/thunlp/DECO获取。

英文摘要

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of routed experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 2.93$\times$ speedup on Jetson AGX Orin compared with dense inference. Code and checkpoints are available at https://github.com/thunlp/DECO.

2605.09492 2026-05-21 cs.CL cs.AI 版本更新

APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

APCD:自适应路径对比解码用于可靠的大型语言模型生成

Tianyu Zheng, Hong Wu, Jiaji Zhong

AI总结 本文提出了一种自适应路径对比解码方法,通过自适应探索和受控路径交互提高大型语言模型生成的可靠性,实验表明在保持解码效率的同时提升了事实准确性。

Comments This paper has been withdrawn by the author to resolve a conflict of interest/compliance issue

详情
AI中文摘要

大型语言模型(LLMs)常常由于自回归解码中误差累积而产生幻觉,其中次优的早期token选择会误导后续生成。尽管多路径解码通过探索替代轨迹来提高鲁棒性,但现有方法缺乏确定何时分支和如何调节路径交互的系统策略。我们提出自适应路径对比解码(APCD),一种多路径解码框架,通过自适应探索和受控路径交互提高输出可靠性。APCD包含两个组成部分:(1)熵驱动路径扩展,延迟分支直到预测不确定性-通过香农熵测量在顶级候选token上的不确定性-表明存在多个可能的延续;以及(2)发散意识路径对比,鼓励多样化的推理轨迹同时动态减弱路径间的影响,随着预测分布发散。在八个基准测试中的实验显示,在保持解码效率的同时提高了事实准确性。我们的代码可在https://github.com/zty-king/APCD上获得。

英文摘要

Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty-king/APCD.

2603.24472 2026-05-21 cs.CL cs.LG 版本更新

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

为什么自蒸馏(有时)会降低大语言模型的推理能力?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

发表机构 * Microsoft Research(微软研究院) KAIST(韩国成均馆大学) Seoul National University(首尔国立大学)

AI总结 本文研究了自蒸馏在数学推理中降低大语言模型推理能力的原因,发现其通过抑制模型在推理过程中的不确定性表达,导致在未见过的问题上表现下降,强调了适当表达不确定性对鲁棒推理的重要性。

Comments Code is available at https://github.com/beanie00/self-distillation-analysis

详情
AI中文摘要

自蒸馏作为一种有效的预训练后训练范式,通常在缩短推理轨迹的同时提升性能。然而,在数学推理中,我们发现它会减少响应长度并降低性能。我们追溯这种退化到对“信念性言语化”的抑制——模型在推理过程中表达不确定性。通过控制实验,变化条件上下文的丰富性和任务覆盖范围,我们发现使教师模型在丰富信息上进行条件会抑制不确定性表达,从而在有限的任务覆盖范围内实现快速的领域内优化,但损害了领域外性能,其中未见过的问题受益于表达不确定性并进行相应调整。在Qwen3-1.7B/8B、DeepSeek-Distill-Qwen-7B和Olmo3-7B-Instruct上,我们观察到性能下降高达40%。我们的发现强调了适当表达不确定性对于鲁棒推理的重要性,并突显了优化推理行为的重要性,而不仅仅是强化正确答案轨迹。

英文摘要

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

2603.10139 2026-05-21 cs.CL cs.AI cs.CC cs.FL 版本更新

The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

生成-识别不对称性:形式语言理论中根本分裂的六个维度

Romain Peyrichou

发表机构 * Independent Researcher(独立研究者)

AI总结 本文探讨了生成和识别在形式语言理论中的不对称性,通过六个维度分析生成和识别之间的差异,并指出生成和识别在计算复杂性、歧义性、方向性、信息可用性、语法推断和时间性等方面存在显著差异。

Comments Submitted to Information and Computation. 32 pages, 6 figures, 4 tables

详情
AI中文摘要

每种形式文法都定义了一种语言,并且原则上可以以三种方式使用:生成字符串(产生)、识别它们(解析)或者在仅有示例的情况下推断出文法本身(文法归纳)。生成和识别在扩展上是等价的——它们描述相同的集合——但在多个独立的方面操作上是不对称的。归纳是一个更复杂的问题:它没有访问已知文法的途径。尽管这个三元组在编译器设计、自然语言处理和形式语言理论中至关重要,但尚未有综述将其视为统一的多维现象。我们识别出六个维度,这些维度使生成和识别产生差异:计算复杂性、歧义性、方向性、信息可用性、文法归纳和时间性。我们证明了常见的“生成容易,解析困难”的描述是误导的:无约束生成是简单的,但受约束生成可以是NP难的。真正的不对称性在于解析总是受约束的(输入已知)而生成不需要。这两个维度——方向性和时间性——之前尚未被识别为生成-识别不对称性的维度。我们将时间维度与Hale(2001)和Levy(2008)的惊奇框架联系起来,认为惊奇正式化了生成者(惊奇=0)和预测在不确定性下的解析者(惊奇>0)之间的时间不对称性。我们回顾了自然语言处理中的双向系统,并观察到双向性已有五十年的历史,但尚未转移到大多数领域特定的应用中。最后,我们讨论了大型语言模型,它们在架构上统一了生成和识别,但在操作上保持了不对称性。

英文摘要

Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

2602.16813 2026-05-21 cs.CL cs.AI 版本更新

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

流映射语言模型:通过连续去噪实现一步语言建模

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种基于连续流的流映射语言模型,通过在one-hot token嵌入上构建连续流,实现了在质量和速度上优于离散扩散模型的性能,展示了连续流在离散模态生成建模中的潜力。

Comments 58 pages, 40 figures

详情
AI中文摘要

基于离散扩散的语言模型因其在生成速度上优于自回归模型而受到广泛关注。尽管具有潜力,这些模型通常在少量步骤范围内生成质量急剧下降,从而在实践中限制了速度的大幅提升。本文表明,基于连续流的一步语言模型在质量和速度上均优于离散扩散模型。重要的是,我们的连续公式定义了一个独特的流映射,可以直接学习以实现高效的少量步骤推断,这种结构在离散方法中不可用。在此设置中,我们展示了流及其关联的流映射可以通过简单的交叉熵目标学习,这些目标尊重数据的简单几何结构,并且我们识别了三种不同的流映射蒸馏选择,其性能在实践中进行了比较。利用这些见解,我们构建了一个流语言模型(FLM),该模型在One Billion Words(LM1B)和OpenWebText(OWT)数据集上与最先进的离散扩散基线相媲美。然后,我们将FLM蒸馏为流映射语言模型(FMLM),其一步生成超过了最近少量步骤离散扩散语言模型的8步质量。本文的工作挑战了广泛认为离散噪声过程是离散模态生成建模所必需的假设,并为大规模加速语言建模铺平了道路。代码可在https://github.com/david3684/flm获取。

英文摘要

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.

2602.06358 2026-05-21 cs.CL cs.AI 版本更新

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

SHINE:一种可扩展的上下文超网络,用于在单次传递中将上下文映射到LoRA

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Technion, NVIDIA(技术学院与NVIDIA) University of Oxford(牛津大学) School of Electronics Engineering(电子工程学院) Computer Science, Peking University(计算机科学,北京大学)

AI总结 本文提出SHINE,一种可扩展的上下文超网络,用于在单次传递中将多样且有意义的上下文映射到高质量的LoRA适配器,通过重用冻结LLM的自身参数和引入架构创新,克服了先前超网络的关键限制,以较少的参数实现了强大的表达能力。

详情
AI中文摘要

我们提出SHINE(可扩展的上下文超网络),一种可扩展的超网络,能够将多样且有意义的上下文映射到高质量的LoRA适配器,用于大型语言模型(LLMs)。通过在上下文超网络设计中重用冻结LLM的自身参数,并引入架构创新,SHINE克服了先前超网络的关键限制,以相对较少的参数实现了强大的表达能力。我们引入了预训练和指令微调流水线,并训练我们的超网络在单次前向传递中从多样且有意义的上下文中生成高质量的LoRA适配器。它在不进行微调的情况下更新LLM参数,并立即启用与上下文相关的复杂问答任务,而无需直接访问上下文,有效地将上下文知识转换为参数知识。我们的工作在各种任务上取得了出色的结果,相比基于SFT的LLM适应方法,大大节省了时间和计算和内存成本,并展示了良好的可扩展性潜力。我们的代码可在https://github.com/MuLabPKU/SHINE获取。

英文摘要

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

2602.04916 2026-05-21 q-bio.QM cs.CL 版本更新

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

AFD-INSTRUCTION: 一个全面的抗体指令数据集,具有功能注解,用于基于LLM的理解和设计

Ling Luo, Wenbin Jiang, Hongyuan Chang, Xinkang Wang, Xushi Zhang, Yueting Xiong, Mengsha Tong, Rongshan Yu

发表机构 * National Institute for Data Science in Health and Medicine(健康医学数据科学国家研究院) Institute of Artificial Intelligence(人工智能研究院) School of Life Sciences(生命科学学院) School of Informatics(信息学院) State Key Laboratory of Vaccines for Infectious Diseases(传染病疫苗国家重点实验室) Xiang An Biomedicine Laboratory(翔安生物医学实验室)

AI总结 本文提出AFD-INSTRUCTION数据集,通过功能注解提升LLM在抗体理解与设计中的性能,为抗体建模和治疗发现提供新基础。

详情
AI中文摘要

大型语言模型(LLMs)在蛋白质表示学习方面显著进步。然而,其通过自然语言解释和设计抗体的能力仍然有限。为解决这一挑战,我们提出了AFD-Instruction,首个大规模指令数据集,专门针对抗体进行功能注解。该数据集包含两个关键部分:抗体理解,直接从序列推断功能属性;抗体设计,允许在功能约束下生成新的序列。这些部分提供了显式的序列-功能对齐,并支持由自然语言指令引导的抗体设计。在通用LLM上的广泛指令微调实验表明,AFD-Instruction在各种抗体相关任务中 consistently 提高了性能。通过将抗体序列与功能的文本描述链接,AFD-Instruction为推进抗体建模和加速治疗发现建立了新基础。

英文摘要

Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

2510.04905 2026-05-21 cs.SE cs.CL 版本更新

Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

检索增强的代码生成:聚焦于仓库级方法的综述

Yicheng Tao, Yuante Li, Yao Qin, Yepang Liu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Chinese University of Hong Kong(香港中文大学) Southern University of Science and Technology(南方科技大学)

AI总结 本文综述了检索增强的代码生成方法,重点探讨仓库级方法,分析了其在大规模代码生成中的挑战与解决方案,总结了现有方法的分类框架及关键挑战。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展显著提升了自动化代码生成的能力。尽管现有方法在函数和文件级别上表现优异,但现实中的软件工程需要对整个仓库进行推理,包括跨文件依赖、不断演变的执行环境和全局语义一致性。这一挑战催生了仓库级代码生成(RLCG),其中模型必须检索、组织并利用仓库级上下文以生成连贯且可执行的代码变更。为解决这些挑战,检索增强生成(RAG)已成为仓库级代码智能的重要范式。本文综述了检索增强代码生成(RACG),特别关注仓库级方法。不同于将RACG视为静态的“检索后生成”流程,我们将其视为一个耦合且不断演变的过程,涉及上下文构建、检索优化、生成和环境交互。通过统一的分析框架,我们组织现有方法,涵盖检索子系统、控制机制和评估设置。基于此框架,我们系统地考察了检索策略、基于图和非基于图的检索范式、训练驱动的优化以及自主代理架构。我们进一步总结了广泛使用的数据集、基准和系统配置,并讨论了关键挑战,包括可扩展性、可靠性、效率以及RACG与长上下文LLMs之间的必要性边界。通过本文综述,我们旨在为快速发展的RACG领域提供结构化的理解,并突出未来人工智能驱动的软件工程研究的有前景方向。

英文摘要

Recent advances in large language models (LLMs) have significantly improved automated code generation. While existing approaches have achieved strong performance at the function and file levels, real-world software engineering requires reasoning over entire repositories, including cross-file dependencies, evolving execution environments, and global semantic consistency. This challenge has led to the emergence of Repository-Level Code Generation (RLCG), where models must retrieve, organize, and utilize repository-scale context to generate coherent and executable code changes. To address these challenges, Retrieval-Augmented Generation (RAG) has become an increasingly important paradigm for repository-level code intelligence. In this survey, we present a comprehensive review of Retrieval-Augmented Code Generation (RACG), with a particular focus on repository-level approaches. Rather than viewing RACG as a static ``retrieve-then-generate'' pipeline, we characterize it as a coupled and evolving process involving context construction, retrieval optimization, generation, and environment interaction. We organize existing methods through a unified analytical framework spanning retrieval substrate, control regime, and evaluation setting. Based on this framework, we systematically examine retrieval strategies, graph-based and non-graph-based retrieval paradigms, training-driven optimizations, and autonomous agent architectures. We further summarize widely used datasets, benchmarks, and system configurations, and discuss key challenges including scalability, reliability, efficiency, and the necessity boundary between RACG and long-context LLMs. Through this survey, we aim to provide a structured understanding of the rapidly evolving RACG landscape and highlight promising directions for future AI-powered software engineering research.

2507.21168 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

多样化的大语言模型还是多样化的问题解释?那是集成的问题

Rafael Rosales, Santiago Miret

发表机构 * Intel Labs(英特尔实验室)

AI总结 本文比较了使用大语言模型回答二元问题的两种多样性方法:模型多样性和问题解释多样性,并发现问题解释多样性在集成准确性上表现更优。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 5116-5128
AI中文摘要

有效利用多样性已被证明可以提高各种机器学习模型,包括大语言模型(LLMs)的性能。然而,确定最有效的多样性使用方法仍是一个挑战。在本工作中,我们比较了两种用于使用LLMs回答二元问题的多样性方法:模型多样性,即多个模型回答相同的问题,以及问题解释多样性,即使用同一模型以不同方式 framing 相同的问题来回答。对于这两种情况,我们应用多数投票作为集成共识启发式方法来确定最终答案。我们的boolq、strategyqa和pubmedqa实验表明,问题解释多样性在集成准确性上始终优于模型多样性。此外,我们对GPT和LLaMa的分析表明,模型多样性通常产生在最佳和最差集成成员之间的结果,而没有明显的改进。

英文摘要

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

2506.09521 2026-05-21 eess.AS cs.CL 版本更新

You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

你所说的就是你:利用语言内容进行语音隐私攻击

Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters

发表机构 * International Audio Laboratories Erlangen(国际音频实验室埃朗根) Department of Data Science(数据科学系) Department of Electrical and Electronics Engineering(电气与电子工程系) Friedrich-Alexander-University Erlangen-Nürnberg(弗里德里希-亚当-大学埃朗根-纽伦堡) Fraunhofer IIS(弗劳恩霍夫研究所IIS) Trinity College Dublin(都柏林三一学院)

AI总结 本文研究了语音隐私保护系统中语言内容对语音隐私攻击的影响,通过调整BERT模型作为自动说话人验证系统,评估了说话人内部语言内容相似性对攻击性能的影响,并提出改进语音隐私数据集以实现更公平的隐私评估。

Comments 5 pages, 6 figures, 1 table, accepted at INTERSPEECH 2025 update reason: change to the acknowledgements

详情
AI中文摘要

说话人匿名化系统在隐藏说话人身份的同时,保留了诸如语言内容和情感等其他信息。为了评估其隐私效益,采用自动说话人验证(ASV)系统进行攻击。在本研究中,我们通过调整BERT模型作为ASV系统,评估了攻击者训练和评估数据集中说话人内部语言内容相似性的影响。在VoicePrivacy Attacker Challenge数据集中,我们的方法实现了平均相等错误率(EER)为35%,某些说话人仅基于其语音的文本内容就达到了2%的EER。我们的可解释性研究发现,系统决策与语音中语义相似的关键词有关,这些关键词源于LibriSpeech的编纂方式。我们的研究建议重新设计VoicePrivacy数据集,以确保公平和无偏的评估,并挑战对全球EER用于隐私评估的依赖。

英文摘要

Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.

2506.04042 2026-05-21 cs.CL 版本更新

Causal Path Alignment: Anchoring the Optimization Trajectory for Controllable In-Parameter Knowledge Editing

因果路径对齐:为可控的参数知识编辑锚定优化轨迹

Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Weiping Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 本文提出Causal Path Alignment框架,通过锚定优化轨迹来解决参数知识编辑中的主体主导记忆干扰问题,提升关系特异性并减少副作用。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

知识编辑对于高效更新大型语言模型(LLM)的参数记忆至关重要,使它们能够在动态环境中作为演进代理发挥作用。然而,主流的参数知识编辑方法存在主体主导记忆干扰问题:修改特定事实会无意中破坏与同一主题相关的更广泛结构知识。我们诊断根本原因是捷径学习病理,即优化目标过拟合了主体表示,而绕过了本质的关系上下文。为此,我们提出因果路径对齐(CPA),一种原理性的框架,旨在将优化轨迹锚定到有效的因果路径上。CPA强制参数更新通过关系意识的中间状态,从而防止上下文依赖性的丢失。在多种LLM基础架构上的实验结果表明,CPA一致消除了捷径,显著提高了关系特异性,同时表现出最小的副作用。此外,CPA为现有编辑器提供了一种模型无关的插件,为可靠和可信的参数知识编辑铺平了道路。

英文摘要

Knowledge editing is pivotal for efficiently updating the parametric memory of Large Language Models (LLMs), enabling them to function as evolving agents in dynamic environments. However, mainstream in-parameter knowledge editing approaches suffer from Subject-Dominant Memory Interference: modifying a specific fact inadvertently corrupts the broader structural knowledge associated with the same subject within LLMs. We diagnose the root cause as a shortcut learning pathology, where the optimization objective overfits subject representations while bypassing the essential relational context. To rectify this, we propose Causal Path Alignment (CPA), a principled framework designed to anchor the optimization trajectory to valid causal pathways. CPA enforces parameter updates to route through relation-aware intermediate states, thereby preventing the erasure of contextual dependencies. Experimental results across diverse LLM backbones demonstrate that CPA consistently eliminates the shortcut, significantly improving relation specificity while exhibiting minimal side-effects. Moreover, CPA serves as a model-agnostic plug-in for existing editors, paving the way for reliable and trustworthy in-parameter knowledge editing.

2605.21256 2026-05-21 cs.CL 版本更新

Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification

西班牙临床笔记中可靠自动分诊:一种风险感知的HIV怀疑识别混合框架

Rodrigo Morales-Sánchez, Soto Montalvo, Raquel Martínez

发表机构 * Dept. of Lenguajes y Sistemas Informáticos, Escuela Técnica Superior de Ingeniería Informática, Universidad Nacional de Educación a Distancia (UNED)(语言与系统信息学系,信息工程技术学校,国家远程教育大学(UNED)) Dept. Informática y Estadística, Escuela Técnica Superior de Ingeniería Informática, Universidad Rey Juan Carlos (URJC)(信息与统计学系,信息工程技术学校,皇家胡安·卡洛斯大学(URJC))

AI总结 本文提出一种混合框架,用于在西班牙临床笔记中识别HIV怀疑,通过分离随机不确定性和epistemic不确定性,提高分诊的可靠性。

Comments Accepted at the BioNLP Workshop @ ACL 2026

详情
AI中文摘要

标准临床自然语言处理(NLP)基准往往通过在模糊实例上强制确定性分类而产生虚高指标,从而掩盖了过于自信预测的临床风险。为弥合这一差距,我们提出了一种风险感知的混合选择性分类框架,在西班牙临床笔记中早期人类免疫缺陷病毒怀疑识别上进行了评估。我们的双验证方法通过Mondrian符合预测分离随机不确定性,并通过多中心马哈拉诺斯距离 veto 分离epistemic不确定性。实证评估表明,标准不确定性度量和基线分类器在安全医疗分诊中结构上不足,当被迫在严格可靠性约束下运行时,会遭受严重的覆盖崩溃。相反,通过要求临床叙述通过概率和几何保障,所提出的框架成功地隔离了一个高度可信的操作领域。

英文摘要

Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.

2605.21227 2026-05-21 cs.CL 版本更新

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

LLMs是否知道卢森堡语借词?探测低资源多语言模型中的词汇新词

Nina Hosseini-Kivanani

发表机构 * University of Luxembourg(卢森堡大学) Radio Télévision Luxembourg (RTL)(卢森堡广播电视台(RTL))

AI总结 本文通过LexNeo-Bench基准测试,探讨了低资源多语言模型在词汇借词识别中的表现,发现通过构建语言知识图谱可以显著提升模型的借词分类准确率,表明词汇资源对LLM评估具有结构化上下文的作用。

Comments Accepted to Neollm colocated with LREC2026, Three figures and three tables

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于小接触语言的写作辅助,但不清楚它们是否尊重社区在词汇借词和新词方面的规范。我们引入LexNeo-Bench,一个包含3,050个实例的令牌级基准,来源于LuxBorrow,一个大规模的卢森堡语新闻语料库,其中目标令牌被标记为本土词或法语、德语或英语借词。使用此基准,我们对三种多语言LLM在34种提示设置下进行测试,涉及两种任务:借词类型分类和二元词汇创新代理(借词与本土词)。在无外部上下文的情况下,模型在借词分类上的表现仅略高于随机猜测,因此我们构建了一个语言知识图谱,编码了供体语言、形态模式和词汇类比,并将实例特定的子图注入提示中。知识图谱提示将借词分类准确率从25-35%提升到71-81%,大幅缩小了小模型和大模型之间的差距,同时使新词检测困难且对少样本设计敏感。我们的结果表明,词汇意识提示对低资源接触语言中的稳健借词判断非常有益,且词汇资源可以作为LLM评估的结构化上下文。本研究在ENEOLI COST行动中进行,并探讨了多语言卢森堡语数据中的借词作为词汇创新的形式。

英文摘要

Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.

2605.21178 2026-05-21 cs.CL 版本更新

Metaphors in Literary Post-Editing: Opening Pandora's Box?

文学后编辑中的隐喻:打开普罗米修斯之盒?

Aletta G. Dorst, Mayra O. Nas, Katinka Zeven

发表机构 * Leiden University Centre for Linguistics(莱顿大学语言研究中心)

AI总结 本文研究了文学文本后编辑者如何回应神经机器翻译和大型语言模型对隐喻的翻译方式,发现三分之一的隐喻被后编辑者修改,表明文学机器翻译中隐喻翻译存在问题,且后编辑工作比从头翻译更耗力。

Comments This paper has been accepted for presentation at the EAMT Conference 2026, which will take place in Tilburg from June 15 to 18, 2026

详情
AI中文摘要

本文探讨了文学文本后编辑者对神经机器翻译和大型语言模型翻译隐喻的反应和回应。研究结果表明,输出中三分之一的隐喻被后编辑者修改,证明了文学机器翻译(LitMT)中隐喻翻译确实存在问题。回应表明,后编辑者意识到过于直译的翻译,尽管大多针对多词表达。有时他们难以判断解决方案是否可接受。他们对MT输出的整体质量评价较差,并表示后编辑工作比从头翻译更加费力。这支持了先前研究的观点,即后编辑限制了翻译者的创造力并削弱了他们对文本的所有权感。

英文摘要

This paper investigates how post-editors of literary texts react and respond to the way metaphors have been translated by Neu ral Machine Translation (NMT) and Large Language Models (LLMs). The results show that one in three metaphors in the output were changed by the post-editors, demonstrating that the translation of fig urative language is indeed problematic in literary MT (LitMT). The responses indi cate that the post-editors were aware of overly literal translations, though mostly for multiword expressions. Moreover, at times they found it difficult to determine whether solutions were acceptable. They rated the overall quality of the MT out put as quite poor and stated that the post editing was more work and more effort than it would have been translating from scratch. This supports previous studies ar guing that post-editing constrains transla tors in their creativity and diminishes their sense of text ownership.

2605.21177 2026-05-21 cs.LG cs.CL 版本更新

ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

ChunkFT: 用于内存高效全微调的分块优化

Yongkang Liu, Zijing Wang, Mengjie Zhao, Ercong Nie, Mingyang Wang, Qian Li, Feiliang Ren, Shi Feng, Daling Wang, Hinrich Schütze

发表机构 * Northeastern University, China(中国东北大学) Shanghai Jiao Tong University, China(上海交通大学) CIS, LMU Munich, Germany(慕尼黑大学CIS实验室) MCML, Germany(德国MCML实验室) Shandong University, China(山东大学)

AI总结 本文提出ChunkFT框架,通过动态激活的工作集重新定义全参数微调,实现了无需修改网络架构即可对任意子张量进行梯度计算,理论分析和实验表明其在内存使用、运行时间和优化质量上均有效,且在下游任务中表现优于现有内存高效基线。

详情
AI中文摘要

本文提出了ChunkFT,一种内存高效的微调框架,其通过动态激活的工作集重新定义全参数微调。ChunkFT能够在不修改网络架构的情况下,对任意子张量进行梯度计算,为优化任意子网络提供了算法基础,同时避免了标准密集梯度计算。在确定性设置下,我们提供了ChunkFT的理论收敛分析。实验中,我们使用单块RTX 4090-24GB GPU和两块H800-80GB GPU分别对Llama 3-8B和Llama 3-70B进行微调。一个7B模型在1K输入长度下的全参数微调仅需13.72GB的GPU内存。结果表明,ChunkFT在内存使用、运行时间和优化质量上均有效。此外,在语言理解、数学推理和MT-Bench等下游任务中,ChunkFT在性能上一致优于现有内存高效的基线。值得注意的是,ChunkFT在某些情况下甚至超过了全参数微调的性能。我们的代码库可在https://github.com/misonsky/chunk上找到。

英文摘要

This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on https://github.com/misonsky/chunk.

2605.21154 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models

精神病诊断的ICD分类自动化:从经典NLP到大语言模型

Fernando Ortega, Raúl Lara-Cabrera, Jorge Dueñas-Lerín, Alejandro de la Torre-Luque, Mercé Salvador Robert, Enrique Baca-García

发表机构 * Department of Sistemas Informáticos, Universidad Politécnica de Madrid, Spain(西班牙马德里理工大学信息系统系) KNODIS Research Group, Universidad Politécnica de Madrid, Spain(西班牙马德里理工大学KNODIS研究组) CIBERSAM ISCIII, Spain(西班牙ISCIII CIBERSAM) Department of Legal Medicine, Psychiatry and Pathology. Complutense University of Madrid, Spain(西班牙马德里康普顿斯大学法医学、精神病学与病理学系) Hospital Universitario de Móstoles, Universidad Rey Juan Carlos, Spain(西班牙雷阿尔皇家卡洛斯大学莫斯特oles大学医院) Department of Psychiatry, University Hospital Jimenez Díaz Fundation, Madrid, Spain(西班牙圣地亚哥· jiménez Díaz基金会精神病科部) Department of Psychiatry, University Hospital Rey Juan Carlos, Móstoles, Spain(西班牙雷阿尔皇家卡洛斯大学莫斯特oles医院精神病科部) Department of Psychiatry, General Hospital of Villalba, Madrid, Spain(西班牙维拉尔巴医院精神病科部) Department of Psychiatry, University Hospital Infanta Elena, Madrid, Spain(西班牙伊菲格尼亚医院精神病科部) Department of Psychology, Universidad Catolica del Maule, Talca, Chile(智利马尔学院心理学系) Department of Psychiatry, Madrid Autonomous University, Madrid, Spain(西班牙马德里自治大学精神病科部)

AI总结 本研究提出利用NLP和机器学习技术将自由文本描述映射到国际疾病分类(ICD),以自动化精神病诊断分析,通过评估从经典频率模型到先进大语言模型的多种文本表示方法,展示了transformer嵌入在捕捉隐含语义线索和细致医学术语方面的优势。

详情
AI中文摘要

心理健康已成为全球优先事项,导致临床诊断编码的行政负担巨大。本研究提出通过将自由文本描述映射到国际疾病分类(ICD)来自动化精神病诊断分析,利用包含145,513个西班牙精神病描述的专用数据集,评估了从经典频率模型(BoW,TF-IDF)到先进大语言模型(如e5_large、BioLORD和Llama-3-8B)的各种文本表示方法。结果表明,基于transformer的嵌入 consistently 超过传统方法,通过端到端微调,e5_large模型实现了最高的性能,F1_micro得分为0.866。本研究证明了将大语言模型适应特定临床术语对于克服“长尾”标签分布和精神病 discourse 的固有模糊性至关重要。

英文摘要

Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.

2605.21147 2026-05-21 cs.LG cs.CL 版本更新

SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

SMoA:用于参数高效微调的频谱调制适配器

Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, Hinrich Schütze

发表机构 * Northeastern University, China(东北大学,中国) Shandong University, China(山东大学,中国) CIS, LMU Munich, Germany(慕尼黑莱布尼茨大学CIS中心,德国) MCML, Germany(德国MCML)

AI总结 本文提出SMoA,一种频谱感知更新的适配器,通过在较小的参数预算下扩大可访问的频谱更新家族,提升参数高效微调的性能。

详情
AI中文摘要

随着模型参数数量的增加,参数高效微调(PEFT)已成为定制预训练大语言模型的首选方法。低秩适应(LoRA)使用低秩更新方法来模拟全参数微调,广泛用于减少资源需求。然而,降低秩面临代表能力有限的挑战。理论表明,LoRA微调秩r收敛于预训练权重矩阵的前r个奇异值。随着秩的增加,更多主奇异方向被保留,通常会提高模型性能。然而,更大的秩也会引入更多的可训练参数,导致更高的计算成本。为克服这一矛盾,我们提出SMoA,一种频谱调制适配器,通过在较小的参数预算下扩大可访问的频谱感知更新家族。SMoA将层分成多个对齐的频谱块,并在每个对角块上应用一个块内Hadamard调制的低秩分支,从而获得更广泛的预训练频谱方向覆盖。我们提供了多个任务的理论分析和实证结果。在我们的实验中,SMoA在当前较低预算设置下优于LoRA和具有竞争力的LoRA风格基线。

英文摘要

As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity. Theory suggests that LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix. As the rank increases, more principal singular directions are preserved, which generally improves the model's performance. However, a larger rank also introduces more trainable parameters, leading to higher computational cost. To overcome this dilemma, we propose SMoA, a \textbf{S}pectrum \textbf{Mo}dulation \textbf{A}dapter that enlarges the accessible family of spectrum-aware updates under a smaller parameter budget. SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions. We provide theoretical analysis and empirical results on multiple tasks. In our experiments, SMoA improves average performance in the current lower-budget setting over LoRA and competitive LoRA-style baselines.

2605.21102 2026-05-21 cs.CL cs.AI cs.SE 版本更新

ACL-Verbatim: hallucination-free question answering for research

ACL-Verbatim: 无幻觉的科研问答

Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, Ádám Kovács

发表机构 * TU Wien(维也纳技术大学) KR Labs(KR实验室)

AI总结 本研究提出ACL-Verbatim系统,通过提取式问答方法在科研论文中精准映射用户查询到相关文本片段,构建了新的真实数据集并训练评估了多种提取模型,最终通过150M参数的ModernBERT模型在词级F1得分上达到53.6,优于最强的LLM提取器。

Comments 13 pages

详情
AI中文摘要

学术研究者需要高效可靠的工具从可信来源获取高质量信息,但现代AI辅助研究工具仍受大语言模型(LLMs)产生事实不准确或不合逻辑输出(即幻觉)的影响。我们应用提取式问答系统VerbatimRAG到ACL Anthology中的研究论文,直接将用户查询映射到检索文档中的原文文本片段。我们贡献了一个新的真实数据集,用于将用户查询映射到科研论文中的相关文本片段,并利用该数据集训练和评估多种提取模型。人工标注由自然语言处理研究人员完成,基于使用ScIRGen方法生成的合成用户查询,配以由VerbatimRAG检索的论文片段。在该基准上,一个基于我们流水线银色监督训练的150M参数ModernBERT标记分类器在词级F1得分上达到53.6,优于最强的评估LLM提取器(48.7)

英文摘要

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

2605.21097 2026-05-21 cs.CL 版本更新

WCXB: A Multi-Type Web Content Extraction Benchmark

WCXB:一个多类型网络内容提取基准

Murrough Foley

发表机构 * Murrough Foley

AI总结 本文提出WCXB基准,包含2008个网页,涵盖七种结构不同的页面类型,通过五阶段流程生成真实标注,评估13种提取系统,发现现有文章-only基准无法发现结构化页面的盲区。

Comments Dataset: github.com/Murrough-Foley/web-content-extraction-benchmark, doi.org/10.5281/zenodo.19316874. Leaderboard: webcontentextraction.org. Preprint also deposited at doi.org/10.5281/zenodo.19664685

详情
AI中文摘要

网络内容提取——从网页中隔离主要内容以排除周围模板内容——是搜索引擎索引、检索增强生成、NLP数据集构建和大语言模型训练的前提。该领域的进展受到现有评估基准的限制,这些基准规模小(100-800页)、仅限于新闻文章或基于超过十年前的网页。我们介绍了网络内容提取基准(WCXB),包含来自1613个域的2008个网页,涵盖七种结构不同的页面类型:文章、论坛、产品、集合、列表、文档和服务页面。该数据集包括1497页的开发集和511页的测试集,具有匹配的页面类型分布。真实标注通过五阶段流程生成:LLM辅助草稿、自动化验证、四轮前沿模型审查、片段和质量验证脚本以及人工审查。我们评估了13种提取系统——11种启发式和2种神经网络——发现尽管顶级系统在文章上(F1=0.93)表现良好,但在结构化页面类型上性能差异显著(F1=0.41-0.84),揭示了现有文章-only基准无法发现的盲区。该数据集以CC-BY-4.0许可证发布,包含HTML源文件、真实标注、页面类型标签和基线结果。

英文摘要

Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

2605.21086 2026-05-21 cs.CL 版本更新

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

LoCar: 通过细粒度社会语言学控制评估车载助手的本地化

Seogyeong Jeong, Kiwoong Park, Seyoung Song, Eunsu Kim, Ken E. Friedl, Jaeho Kim, Alice Oh

发表机构 * KAIST(韩国科学技术院) BMW Group(宝马集团)

AI总结 本文提出了一种新的车载助手评估框架,专注于韩语本地化,揭示了当前LLM在细粒度韩语敬语控制方面的不稳定性以及策略性对话指标上的表现不足,强调了汽车AI需要向精确语言定制和可靠安全交互管理发展。

Comments To appear in ACL 2026 Industry Track

详情
AI中文摘要

尽管大型语言模型(LLMs)越来越多地集成到车载对话系统中,但由于缺乏针对实际部署需求定制的领域特定评估标准,确定最优模型仍具挑战性。在本文中,我们提出了一种新的车载助手评估框架,特别关注韩语本地化。我们的实证分析揭示了模型行为中的显著模式。首先,细粒度韩语敬语控制在当前LLM中仍然不稳定,表明在本地化设置中必须明确评估精确的语音层面实现。其次,模型在战略对话指标如澄清和主动性方面表现较弱。我们的分析表明,这源于这些任务本身的主观复杂性,其中我们的框架采取保守的评估立场以优先考虑可靠性。总体而言,我们的发现强调,汽车AI必须超越一般能力,向精确语言定制和可靠、以安全为导向的交互管理发展。

英文摘要

While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

2605.21076 2026-05-21 cs.CL 版本更新

GradeLegal: Automated Grading for German Legal Cases

GradeLegal: 自动化评分德国法律案例

Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic

发表机构 * University of Passau(帕绍大学) Deggendorf Institute of Technology(德格多夫技术学院)

AI总结 本研究探讨了大型语言模型能否用于自动化评分德国刑事和公法案例解决方案,通过系统评估27个专有和开源LLM,发现基于样本解决方案和评分标准的提示策略能有效模拟专家评分,尤其在公法领域达到0.91的QWK值,而在刑事法律领域仅为0.60,表明刑事法律评分任务更难。此外,集成方法能进一步提高评分一致性,并为更强的闭源单模型提供替代方案。

详情
AI中文摘要

对德国法律考试解决方案进行评分面临日益增长的体积和合格评分员短缺,导致反馈延迟并形成瓶颈。同时,这是一个高风险的专家任务,因为国家考试成绩强烈影响德国的职业发展。尽管具有实际相关性,文献中缺乏系统研究有效评分法律考试的方法。为解决这一差距,我们研究了大型语言模型(LLMs)是否能支持自动化评分德国刑事和公法案例解决方案,从而实现可扩展的反馈和学生自我测试。我们系统评估了27个专有和开源LLM,并通过基准测试提示策略,逐步增加任务相关信息,如样本解决方案和评分标准。使用二次加权κ(QWK),基于推理的LLM在获得样本解决方案和评分标准时,能在公法领域模拟专家评分(最高0.91),而刑事法律领域仅为0.60,表明刑事法律评分任务更难。除了单模型评分外,集成方法能通过其最佳成员提高一致性高达0.15,并可为更强的闭源单模型提供替代方案。此外,我们的发现表明,有效的提示设计和模型选择对于可靠的LLM基于法律考试评分至关重要。

英文摘要

Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

2605.21063 2026-05-21 cs.CL 版本更新

APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

APM:通过任意偏好映射评估大语言模型中的风格个性化

Philipp Spohn, Leander Girrbach, Zeynep Akata

发表机构 * Technical University of Munich, Helmholtz Munich(慕尼黑技术大学,亥姆霍兹慕尼黑) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 本研究提出APM基准,通过隐式偏好映射评估大语言模型的风格个性化能力,发现路由方法是最可靠的方法,而RAG和软提示优化在强基础模型上才有提升。

详情
AI中文摘要

典型的LLM响应往往遵循默认风格,尽管用户对语气、详尽程度和正式程度有不同偏好,但这些偏好并未在提示中明确表达。评估个性化方法是否能适应这些隐式偏好具有挑战性,因为用户通常提供提示而非参考响应,风格偏好无法事实验证,且无参考的LLM评判可能将个性化与一般响应质量混淆。为解决这些挑战,我们引入了任意偏好映射(APM)基准,通过隐式随机映射C将用户属性(如热情)与响应原则(如说服力)解耦。由于C不包含语义内容且在每次运行中重新采样,模型无法利用刻板印象关联,必须从对话历史中推断偏好。使用这种无偏的评估方法,我们适配了检索增强、提示优化和路由个性化方法,并在Llama-3.1-8B和Qwen-3.5-27B上进行评估。我们的结果表明,路由是最佳方法,而RAG仅在更强的基础LLM上有所提升,软提示优化在非个性化基线上提升不显著。我们的广泛评估表明,在这种现实设置中,个性化仍然具有挑战性,但我们的适配方法显示出前景。

英文摘要

Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

2605.21049 2026-05-21 cs.CL 版本更新

Cross-lingual robustness of LLM-brain alignment and its computational roots

LLM-脑对齐的跨语言鲁棒性及其计算根源

Ni Yang, Rui He, Philipp Homan, Iris Sommer, Davide Staub, Wolfram Hinzen

发表机构 * Grammar and Cognition Lab, Department of Translation & Language Sciences, Universitat Pompeu Fabra(语言与翻译科学系语法与认知实验室,庞培法华大学) Department of Adult Psychiatry and Psychotherapy, University of Zurich(苏黎世大学成人精神病学与心理治疗系) Neuroscience Center Zurich, University of Zurich and ETH Zurich(苏黎世大学神经科学中心与苏黎世联邦理工学院) Center for Clinical Neuroscience and Cognition and Department of Psychiatry, University of Groningen, University Medical Center Groningen(格罗宁根大学临床神经科学与认知中心及精神病学系,格罗宁根大学医学中心) Scalable Scientific Machine Learning Lab, Imperial College London, Department of Earth Science and Engineering(伦敦帝国理工学院可扩展科学机器学习实验室,地球科学与工程系) Institut Català de Recerca i Estudis Avançats (ICREA), Barcelona, Spain(加泰罗尼亚高级研究与研究机构(ICREA),巴塞罗那,西班牙)

AI总结 该研究探讨了大型语言模型与大脑对齐的跨语言鲁棒性,通过多语言全脑编码框架分析了中文、英语和法语在自然故事听觉过程中大脑与LLM的对齐情况,发现其在空间上具有跨语言重叠性,但无法通过预测不确定性或表征几何来解释。

详情
AI中文摘要

大型语言模型(LLMs)能够可靠地预测语言理解过程中的神经活动,并且transformer深度已被解释为镜像层次皮层组织。然而,这种对齐是否扩展到皮层下区域、在不同语言中是否存在空间重叠,以及这种对齐的计算根源仍不清楚。在此,我们使用多语言全脑编码框架,研究了在自然故事听觉过程中,中文、英语和法语三种语言的大脑与LLM的对齐情况。我们的结果表明,跨语言情况下,基于transformer的模型预测了覆盖广泛分布的皮层功能网络(如边缘系统、背侧注意网络、默认模式网络)以及皮层下结构的分布式景观。空间对齐模式显示了显著的跨语言重叠性,并且在模型层之间保持稳定,仅在有限的层之间有进展,这与功能皮层层次结构一致。与之前证据相反,上下文嵌入并未优于静态嵌入。为了测试候选计算解释,我们检查了逐层大脑评分是否反映惊奇度和内在维度性,从而预测处理和信息压缩。这两种计算指标均未与神经对齐轮廓相匹配。我们的发现表明,大脑-LLM对齐在空间上具有鲁棒性,并且在跨语言上保持稳定,但无法通过预测不确定性或表征几何来解释。而不是直接反映共享的层次计算,神经预测性可能主要源于分布式词汇-语义对应关系,这些关系在不同语言中具有泛化性。

英文摘要

Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.

2605.21029 2026-05-21 cs.CL 版本更新

Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

从零开始构建人工智能技能和任务的定制分类体系

Stephen Meisenbacher, Peter Norlander

发表机构 * Technical University of Munich(慕尼黑技术大学) Loyola University Chicago(芝加哥洛约拉大学)

AI总结 本文通过分析招聘广告数据,探讨了如何构建更清晰的人工智能技能和任务分类体系,提出TaxonomyBuilder作为系统研究的蓝图,展示了过滤输入数据能提供更具体的领域覆盖。

Comments 14 pages, 2 figures, 8 tables. Accepted to CustomNLP4U 2026

详情
AI中文摘要

利用大型语言模型(LLMs)进行自动分类体系构建,为全面而高效的复杂领域映射提供了清晰的机会。然而,面对快速增长的大量语料库时,如何最佳利用此类数据进行最优分类体系构建变得不明确。以职场中系统化人工智能技能为例,我们使用两个大规模的招聘广告语料库来研究关键的设计决策,即在分类体系构建中包含(或排除)数据点。我们提出了TaxonomyBuilder作为系统研究的蓝图,通过评估各种自定义、数据驱动和层次化的分类体系配置。我们证明,较少的数据可以提供更多的清晰度:将输入过滤到TaxonomyBuilder可以比将未过滤的输入提供给聚类和LLM增强的层次分类标签工具提供更具体的领域覆盖。

英文摘要

Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

2605.20998 2026-05-21 cs.CL cs.AI 版本更新

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

单次传递、深度选择性阅读用于多方面情感分析

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan

发表机构 * Universiti Malaya, Malaysia(马来大学,马来西亚) Suzhou University of Technology, China(苏州科技学院,中国) VinUniversity, Vietnam(文大学,越南)

AI总结 本文提出DABS框架,通过单次编码构建可重用的深度有序基底,使多方面情感分析在保持性能的同时减少60%的端到端计算量。

Comments Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

详情
AI中文摘要

在多方面句子中,方面术语情感分析(ATSA)面临效率与表达性的根本权衡。现有模型要么为每个方面重新编码句子,要么依赖静态深度表示,导致冗余计算和有限适应性。我们主张Transformer深度是一种昂贵且可查询的资源,并提出DABS,一种单次推断框架,通过一次编码构建可重用的深度有序基底。每个方面则查询此共享表示以选择性地读取相关token和抽象层次,而无需重新编码。这将共享句子编码与轻量级、方面条件化的读取解耦。在四个ATSA基准测试中,DABS实现了具有竞争力的性能,同时在多方面设置(M >= 2)中将端到端计算减少了高达60%。进一步分析表明,自适应深度查询在语言复杂情况如否定和对比中最为有益。代码可在https://github.com/panzhzh/acl-dabs公开获取。

英文摘要

Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

2605.20994 2026-05-21 cs.CL cs.AI 版本更新

Towards Context-Invariant Safety Alignment for Large Language Models

面向大语言模型的上下文不变安全对齐

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University, Shanghai, China(复旦大学,上海,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国)

AI总结 本文提出了一种上下文不变的安全对齐方法,通过引入锚点不变正则化(AIR)来提升模型在不同上下文中的鲁棒性,从而增强安全约束对对抗性框架的抵抗力。

Comments ICML 2026

详情
AI中文摘要

基于偏好进行的后训练对齐可以将大语言模型与人类意图对齐,但安全行为往往仍然脆弱。一个模型可能在标准提示下拒绝有害请求,但在相同意图被包装在对抗性语言中时却会合规。我们建议,稳健的安全性需要上下文不变的对齐,其中行为取决于底层意图而非表面形式。在对齐中强制不变性是困难的,因为并非所有训练信号都同等可信;对于某些提示变体我们能够获得可验证的反馈(例如多选题),而对于开放性变体我们通常依赖于噪声且可游戏化的奖励代理(例如学习的评判者)。因此,标准对称不变正则化器可以通过降低在可靠变体上的性能来减少跨上下文差异,而不是改进开放性鲁棒性。为了解决这个问题,我们引入了锚点不变正则化(AIR),它将可验证的提示视为锚点,并使用停止梯度目标来正则化开放性变体朝着锚点性能的方向。AIR作为插件辅助损失实现,并通过异质提示分组与基于组的偏好优化(例如GRPO)结合。在安全、道德推理和数学方面,AIR提高了上下文不变性,提升了在分布内组的准确性达12.71%,在分布外一致性提升33.49%,使安全约束对对抗性框架更加稳健。

英文摘要

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

2605.20967 2026-05-21 cs.CL 版本更新

ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization

ArPoMeme:一个标注的阿拉伯多模态数据集用于政治意识形态和极化

Wajdi Zaghouani, Kais Attia, Md. Rafiul Biswas, Fadhl Eryani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Independent Researcher(独立研究员) Hamad Bin Khalifa University(哈马德·本·卡伊夫大学) University of Tübingen(图宾根大学)

AI总结 本文提出ArPoMeme数据集,用于分析阿拉伯政治漫画的多模态和意识形态维度,通过自定义工具实现大规模标注,揭示意识形态极化特征。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

漫画已成为阿拉伯世界政治沟通的重要媒介,反映了幽默、图像和文本如何相互作用以表达意识形态和文化立场。尽管漫画在在线政治讨论中至关重要,但缺乏系统整理的资源来分析其多模态和意识形态维度。本文提出了ArPoMeme,一个包含约7300个阿拉伯政治漫画的大规模数据集,按意识形态方向分类,包括左翼、伊斯兰主义、泛阿拉伯主义和讽刺视角。该数据集通过公共Facebook页面和群组的自我识别来捕捉阿拉伯漫画生态系统的多样性。为了确保规模和准确性,我们设计了一个半自动化数据收集管道,结合基于Playwright的Facebook爬取和Google Drive同步,随后使用Qwen2.5-VL-7B视觉语言模型进行文本提取。提取的文本经过人工验证和标注,针对三个极化维度:我们 vs 他们框架、对外群体的敌意和行动号召。标注通过自定义的Streamlit界面进行,支持分布式标记、实时跟踪和版本控制。最终的数据集将视觉内容、文本信息和意识形态方向联系起来,使政治对抗、动员和幽默的细粒度分析成为可能。对标注语料库的定量分析揭示了意识形态群体之间对抗性框架的强烈不对称性,伊斯兰主义和讽刺漫画表现出最高的敌意和动员信号。该数据集和标注工具为研究阿拉伯政治话语、多模态意识形态检测和极化动态提供了可重复和公开可用的资源。

英文摘要

Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.

2605.20960 2026-05-21 cs.CL 版本更新

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

JobArabi: 一个阿拉伯语语料库及来自社交媒体的招聘公告分析

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

发表机构 * Northwestern University in Qatar(卡塔尔诺维克大学) Carnegie Mellon University in Qatar(卡塔尔卡内基梅隆大学)

AI总结 本文介绍了JobArabi,一个从2024年1月至2025年10月期间收集的阿拉伯语招聘公告语料库,包含20,528条来自X平台的公开帖子,旨在分析阿拉伯语在线社区中的就业相关话语,揭示社交媒体在劳动力市场沟通和语言变化研究中的潜力。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

本文介绍了JobArabi,一个大规模的阿拉伯语招聘公告语料库,该语料库从2024年1月至2025年10月期间收集自社交媒体。该数据集包含20,528条来自X平台的公开帖子,涵盖了阿拉伯语在线社区中超过两年的就业相关话语。该语料库使用了一个基于语言学的查询框架,覆盖了21个阿拉伯语关键词家族,这些关键词反映了招聘语言中性别化、复数、正式和方言化的表达。所得到的数据集包括来自机构、商业和个体账号的帖子,并提供了时间戳、参与指标和地理位置等元数据(如可用)。这使得能够对就业话语进行时间与区域分析。定量分析揭示了在线招聘中的若干社会语言学模式,包括性别化招聘语言的持续存在、地区职业需求的差异以及招聘信息的情感框架。这些发现突显了阿拉伯语社交媒体作为研究劳动力市场沟通和语言变化资源的潜力。JobArabi语料库,连同其文档和收集脚本,将被发布以支持阿拉伯语自然语言处理、计算社会科学和数字劳动研究领域的研究。

英文摘要

This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.

2605.20948 2026-05-21 cs.CL 版本更新

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Memory Grafting: 通过离线条件记忆实现语言模型预训练的扩展

Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出Memory Grafting方法,通过利用冻结的隐藏状态作为条件n-gram记忆,实现语言模型预训练的扩展,通过离线处理和高效检索机制提升模型容量,实验表明其在不同规模下均优于MoE和Vanilla Engram基线。

Comments 25 pages, 12 figures, 5 tables

详情
AI中文摘要

扩大条件记忆为提高语言模型容量提供了一种有前途的方法,但现有方法如Engram在预训练过程中从头学习大型记忆表,使记忆扩展成本高昂且有时效果不佳。我们提出Memory Grafting,一种利用冻结的隐藏状态作为条件n-gram记忆的条件记忆扩展方法。鉴于频繁的局部n-gram,我们离线运行grafting模型,将最终token的隐藏表示存储为记忆值,并让接收模型通过精确的最长匹配后缀查找来检索它们。检索到的记忆通过轻量级投影和门控进行适应,同时基于哈希的Engram回退机制保留了未匹配上下文的覆盖范围。由于grafting模型仅在离线运行,且精确查找的复杂度相对于内存库大小具有预期O(1)的复杂度,Memory Grafting在有限的训练和推理开销下扩展了外部潜在容量。在匹配的接收架构和预训练预算下进行的实验表明,Memory Grafting在MoE和Vanilla Engram基线之上有所改进。在2.8B规模下,其平均基准分数从MoE的51.95和Vanilla Engram的52.43提升到53.86。在0.92B规模下,所有grafting模型变体均优于基线,其中Qwen3.5-35B-A3B表现最佳。这些结果表明,预训练模型可以作为外部潜在记忆的可重用构造器,为未来语言模型超越仅可训练参数的扩展提供了实用步骤。

英文摘要

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

2605.20946 2026-05-21 cs.CL 版本更新

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

思考-言语:一种受控交错推理方法用于实时语音生成

Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen

发表机构 * Huawei Technologies(华为技术)

AI总结 本文提出了一种受控交错推理方法InterRS,用于实时语音生成,通过在自然语音生成过程中插入推理步骤,提高了语音流畅性和推理深度,实验表明其在数学和逻辑基准测试中表现更优,并生成更自然流畅的答案。

详情
AI中文摘要

思考-言语范式旨在使AI交流更人性化。关键挑战是保持流畅的语音生成同时进行深度推理。我们的方法InterRS通过在自然语音生成过程中插入推理步骤来解决这一问题。这需要高质量的数据,其中推理和语音精确对齐且长度比例受控。我们引入了一种新的管道来生成无缝交错的音频数据。为了训练我们的模型,我们结合了交错SFT与精炼数据以及强化学习,使用两种新的奖励:TA-Balance Reward用于管理时间与思考-回答比例,以及Linguistic Quality Reward用于优化表达。实验表明,我们的方法在数学和逻辑基准测试中实现了13%的性能提升,同时生成像口语指令模型一样快速的响应。此外,我们的方法生成的答案例句比先前方法更自然流畅。

英文摘要

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

2605.20936 2026-05-21 cs.LG cs.AI cs.CL 版本更新

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH:在单个GPU上几分钟内完成的快速可微架构搜索用于混合注意力

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本研究提出DASH,一种快速可微架构搜索框架,用于混合注意力架构设计,通过将离散的层间注意力操作放置转化为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,显著提高了搜索效率。DASH在Qwen2.5-3B-Instruct上优于现有的所有选择器风格的混合注意力设计基线,展示了直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

Comments 19 pages, 7 figures

详情
AI中文摘要

混合注意力架构正变得越来越重要,用于在保持模型质量的同时提高LLM推理效率,使混合架构设计成为核心问题。现有的设计通常依赖于手动经验规则或基于代理的选择器信号来分配层间操作符。最近的NAS风格系统,如Jet-Nemotron,展示了自动混合架构搜索的潜力。然而,Jet-Nemotron的PostNAS搜索阶段单独使用200B tokens,使得此类搜索流程难以作为混合架构设计的常规方法。我们引入DASH,一种用于混合注意力架构设计的快速可微搜索框架,它将离散的层间注意力操作放置放松为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,以显著提高搜索效率。在Qwen2.5-3B-Instruct上,DASH一致优于现有的所有选择器风格的混合注意力设计基线,表明直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

英文摘要

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

2605.20924 2026-05-21 cs.CL cs.AI 版本更新

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct: 任务级策略诱导用于指令生成

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan(国立台湾大学计算机科学与资讯工程学系) Institute of Information Science, Academia Sinica, Taiwan(台湾“中央研究院”资讯科学研究所) AI Research Center (AINTU), National Taiwan University, Taiwan(国立台湾大学人工智能研究中心)

AI总结 该研究提出了一种任务级策略诱导方法Strategy-Induct,通过仅使用少量示例问题生成任务指令,无需依赖标注答案,从而在指令生成任务中取得优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

设计有效的任务级提示对于提高大型语言模型(LLMs)的性能至关重要。尽管先前的指令诱导工作表明LLMs可以通过有限的例子推断出更好的指令,但现有方法通常依赖于输入-输出对,而获取标注答案可能困难或成本高昂。为了解决这一限制,我们提出了Strategy-Induct框架,该框架仅从少量示例问题中推导出任务级指令,而无需标注答案。我们的方法首先提示模型为每个问题生成显式的推理策略,形成(策略,问题)对。这些对随后用于诱导一个任务指令,以引导推理。在多个任务和模型规模上的实验表明,Strategy-Induct在仅问题设置中优于最先进的方法。此外,我们观察到在任务指令生成和推理中联合使用LLMs和大型推理模型可能会进一步提高性能。

英文摘要

Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

2605.20920 2026-05-21 cs.CL cs.SD 版本更新

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

通过发声学音素识别评估语音发声合成

Vinicius Ribeiro, Yves Laprie

发表机构 * Université de Lorraine, CNRS, Inria, LORIA(洛林大学、国家科学研究中心、法国国家信息与自动化研究所、LORIA)

AI总结 本文通过发声学音素识别作为代理来评估语音发声合成的质量,提出利用发声学特征进行音素识别以更准确捕捉发音细节,从而改进生成模型的评估方法。

Comments Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

详情
AI中文摘要

最近机器学习的进步和发声学数据集的可用性使得声带合成可以基于语音序列进行条件化,这是发声学语音合成的主要任务。然而,质量评估需要更好的定义。通常,对生成模型进行排名具有挑战性,因为这涉及主观性。然而,发声学合成还具有额外的困难,即需要对声带解剖学和声学有专业知识。为了解决这个问题,本文提出通过音素识别来评估语音发声合成。我们的假设是使用发声学特征进行音素识别能更好地捕捉发音细节,如正确的发音位置,这传统度量(如点距度量)无法做到。我们训练了一个神经网络,使用来自单说话人RT-MRI数据集提取的声学和发声学特征。然后,我们比较了在不同合成发声学特征下测试模型的识别性能。我们的结果表明,我们的发声学特征集在语音发声合成中具有丰富的语音信息,并有助于探索额外的维度。

英文摘要

Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

2605.20916 2026-05-21 cs.CL 版本更新

Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

具有认知评估的任务路由混合专家模型用于隐含情感分析

Yaping Chai, Haoran Xie, Joe S. Qin

发表机构 * Division of Artificial Intelligence, Lingnan University, Hong Kong(人工智能学院,岭南大学,香港)

AI总结 本文提出了一种基于认知评估理论的多任务学习框架,通过隐含情感检测和认知推理生成两个辅助任务,提升隐含情感分析的性能,同时采用任务级混合专家模型减少任务干扰,实验表明该方法在隐含情感子集上优于现有方法。

Comments 8 pages, 4 figures, and 3 tables

详情
AI中文摘要

隐含情感分析具有挑战性,因为对某个方面的态度通常是通过事件推断而非显式意见词表达。现有模型通常仅学习最终极性标签,这限制了从上下文推断情感的能力。受认知评估理论启发,我们提出了一种评估意识的多任务学习(MTL)框架,用于隐含情感分析,该框架通过两个互补的辅助任务:隐含情感检测和认知推理生成,提供极性预测。然而,训练多个具有不同目标的任务并共享单一骨干结构会限制灵活性并导致任务干扰。为减少这些相关但不同的目标之间的干扰,我们采用任务级混合专家模型,其中所有任务共享一组专家,任务身份控制这些专家的稀疏组合。我们的方法基于编码器-解码器架构,并用这些稀疏混合替换部分编码器和解码器块。我们使用任务条件路由器为每个任务选择稀疏专家混合,并使用任务分离的路由目标鼓励不同任务学习不同的专家选择模式。实验结果表明,我们的模型在隐含情感子集上优于最近提出的方法,具有显著优势。我们的代码可在 https://github.com/yaping166/TRMoE-ISA 上获得。

英文摘要

Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE-ISA.

2605.20915 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

校准与决策制定:重新审视未学习语言模型中的可靠性悖论

Divyaksh Shukla, Ashutosh Modi

发表机构 * Indian Institute of Technology Kanpur (IIT Kanpur)(印度理工学院坎浦尔学院(IIT坎浦尔))

AI总结 本文研究了生成语言模型中校准与决策可靠性之间的差距,通过TOFU基准测试中的多项选择问答评估协议,发现经过微调的模型在校准误差较低,而未学习后的模型在校准误差仍低,但依赖于相关性特征的决策规则增加,扩展了可靠性悖论到机器未学习领域。

Comments Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

详情
AI中文摘要

机器未学习旨在从模型中移除特定训练数据的影响,同时保持对剩余数据的可靠行为,使可靠的预测和不确定性估计成为评估的关键。校准常被用作语言模型可靠性代理,但低校准误差并不一定意味着可靠的决策规则,因为模型可能依赖于虚假相关性而保持良好校准。我们通过TOFU基准测试中的多项选择问答评估协议,研究了生成语言模型中的这一差距,利用校准指标(ECE、MCE、Brier)测量概率可靠性,并通过基于属性的快捷方式检测(使用积分梯度和局部互信息)评估决策规则可靠性。我们发现,微调模型的校准误差(ECE ~ 0.04)低于预训练模型(ECE > 0.5),而未学习后的模型在校准误差相似,尽管在遗忘分割上的准确性降低,属性分析显示对基于相关性的标记依赖增加。这些结果表明,良好的校准可以与未学习后的基于快捷方式的决策规则共存,将可靠性悖论扩展到了机器未学习领域。

英文摘要

Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

2605.20912 2026-05-21 cs.CL 版本更新

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

增强科学论述:面向科学领域的机器翻译

Dimitris Roussis, Sokratis Sofianopoulos, Stelios Piperidis

发表机构 * Institute for Speech and Language Processing(语音与语言处理研究所) Athena RC(雅典研究中心)

AI总结 本文针对科学领域中由于专业术语和复杂句式带来的翻译挑战,构建了多语种平行和单语语料库,并通过微调通用神经机器翻译系统评估语料库质量。

详情
AI中文摘要

随着科研文献数量的增加,跨语言交流的需求日益迫切。机器翻译(MT)为获取国际出版物提供了有前景的解决方案。然而,科学领域因其专业术语和复杂句式而具有独特挑战。本文提出了一套面向科学领域的平行和单语语料库,目标语言对为西班牙-英语、法语-英语和葡萄牙-英语。对于每种语言对,我们创建了一个大规模的通用科学语料库以及四个聚焦于癌症研究、能源研究、神经科学和交通运输研究的较小语料库。为了评估这些语料库的质量,我们利用它们对通用神经机器翻译(NMT)系统进行微调。我们详细介绍了语料库的创建过程、所采用的微调策略,并最后给出了评估结果。

英文摘要

The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

2605.20876 2026-05-21 cs.CL cs.AI 版本更新

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Terminal-World: 通过智能体技能扩展终端智能体环境

Zihao Cheng, Hongru Wang, Zeming Liu, Xinyi Wang, Xiangrong Zhu, Yuhang Guo, Wei Lin, Jeff Z. Pan, Yunhong Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China(北京航空航天大学计算机科学与工程学院) Independent Researcher(独立研究者) Beijing Institute of Technology(北京理工大学) University of Edinburgh(爱丁堡大学)

AI总结 本文提出Terminal-World,一种自动化流程,利用智能体技能作为核心合成原语,共同编码任务目标、执行时机和方法,从而生成任务指令、环境和教师轨迹。通过构建5,723个训练环境,训练出Terminal-World-8B/14B/32B模型,在六个基准测试中均优于终端智能体基线,其中Terminal-World-32B在Terminal-Bench 2.0上以仅1.2%的训练数据超越Nemotron-Terminal-32B。

Comments Work in Progress

详情
AI中文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

英文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

2605.20833 2026-05-21 cs.CL 版本更新

MemGym: a Long-Horizon Memory Environment for LLM Agents

MemGym: 一种长时间跨度的记忆环境用于LLM智能体

Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

发表机构 * Rutgers University(罗杰斯大学) Capital One Princeton University(普林斯顿大学) Microsoft Research(微软研究院)

AI总结 本文提出MemGym,一种用于评估LLM智能体记忆能力的基准测试环境,通过统一现有智能体 gym 和内部记忆基础管道,提供一个记忆推理接口。MemGym 包含五个评估赛道,涵盖四个智能体领域,能够独立评估记忆性能,排除推理、检索和工具使用能力的干扰。

详情
AI中文摘要

记忆是LLM智能体在长时间任务中运营的核心能力。现有的记忆基准测试主要评估多轮聊天场景中个性化信息的保留能力,忽略了在长时间智能体执行过程中发生的动态记忆形成。因此,它们所生成的记忆系统在现实的智能体环境中(如编程和网络导航)转移效果差。我们提出了MemGym,一个用于智能体记忆的基准测试,它将现有的智能体 gym 和内部记忆基础管道统一到一个记忆推理接口下。MemGym涵盖五个评估赛道,分为四个智能体领域:工具使用对话(tau2-bench)、多轮深度研究搜索(MEMGYM-DR)、编程(SWE-Gym和MEMGYM-CODEQA)、计算机使用(WebArena-Infinity)。MemGym报告出的记忆隔离分数将记忆性能与推理、检索和工具使用能力分离,因此可以独立对记忆策略进行排名。我们的合成管道为MEMGYM-CODEQA和MEMGYM-DR是长度可控的,在每个阶段都经过消融验证,并紧密对齐下游场景。为了使在编程环境中的评估在学术上具有可操作性,我们训练了MemRM,一个轻量级的奖励模型(使用Qwen3-1.7B微调QLoRA),它以快速标量读取的方式评分压缩质量,而不是完整的Docker回放。

英文摘要

Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

2605.20815 2026-05-21 cs.CL cs.AI cs.IR cs.LG 版本更新

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

在消费级硬件上实现GraphRAG:对本地LLMs在医疗EHR模式检索中的基准测试

Peter Fernandes, Ria Kanjilal

发表机构 * Department of Computer Engineering(计算机工程系) California Polytechnic State University(加州州立大学波特兰分校)

AI总结 本文研究了在消费级硬件上使用本地LLMs进行医疗EHR模式检索的GraphRAG方法,评估了四种不同模型在索引效率、知识图构建、查询延迟、回答质量和幻觉方面的表现,发现模型参数大小和检索模式对结果有显著影响。

Comments 9 pages, 1 figure, 5 tables

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)扩展了检索增强生成,以支持对复杂语料库的结构化推理,但其在资源受限、隐私敏感的部署中的可靠性仍不清楚。在医疗领域,电子健康记录(EHR)数据复杂且严格监管,依赖云基于大语言模型(LLMs)会带来成本、延迟和合规性的挑战。本文系统评估了GraphRAG在EHR模式检索中的应用,使用本地部署的开源LLMs。我们实现了Microsoft GraphRAG管道在真实的EHR模式文档上,并基准测试了四种模型,包括Llama 3.1(8B)、Mistral(7B)、Qwen 2.5(7B)和Phi-4-mini(3.8B),这些模型通过Ollama在单个消费级GPU(8 GB VRAM)上部署。我们评估了索引效率、知识图构建、查询延迟、回答质量和幻觉在全局和局部检索模式下的表现。我们的结果揭示了显著差异:Llama 3.1生成最丰富的知识图(1,172个实体),Qwen 2.5达到最佳回答质量(3.3/5),Phi-4-mini因结构化输出错误无法完成流程,而Mistral表现出退化重复行为。我们进一步表明,GraphRAG具有实际容量阈值,其中模型参数低于约7B的模型无法可靠地生成有效的结构化输出并无法完成流程。此外,索引和回答质量在不同模型之间是脱耦的,局部检索在延迟和事实基础方面均优于全局总结,且幻觉减少。这些发现表明,GraphRAG可以在消费级硬件上实现,同时强调了模型选择和检索设计在受监管环境中的重要性。

英文摘要

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

2605.20813 2026-05-21 cs.CL 版本更新

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

PulseCol: 周期性刷新的列稀疏注意力用于加速扩散语言模型

Yanyi Lyu, Letian Chen, Futing Sun, Miao Zhang, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文提出PulseCol,一种周期性刷新的列稀疏注意力方法,通过更细粒度的稀疏化策略提升扩散语言模型的计算效率和加速性能,同时保持模型质量。

详情
AI中文摘要

在扩散大语言模型(dLLMs)的推理过程中,计算成本很高,因为每次去噪步骤都需要重复执行完整的自注意力机制,而没有KV缓存。最近的稀疏注意力方法通过块稀疏计算来缓解这一成本,但只在后期迭代中应用,当模型性能对粗粒度稀疏近似不敏感时,但这种方法在计算效率和加速方面提升有限。这促使我们提出一种更细粒度的稀疏化策略,可以在早期迭代中应用,并利用可重用的稀疏模式,从而实现进一步的效率提升。在本文中,我们介绍了PulseCol,一种用于加速扩散语言模型的周期性刷新列稀疏注意力方法。PulseCol将粗粒度的块稀疏性替换为更细粒度的列稀疏结构,使重要的注意力交互更加精确地保留,同时暴露更大的稀疏性。基于这种列级公式,PulseCol进一步在去噪的早期步骤中识别稀疏模式,并在后续迭代中重用这些模式,在少量中间步骤中刷新它们,以跟踪去噪过程中稀疏注意力模式的变化。实验表明,PulseCol在稀疏性和实际加速方面优于先前的稀疏注意力方法,同时保持模型质量。通过优化的GPU内核实现列稀疏注意力,PulseCol在多个上下文长度上实现了比FlashAttention高达1.95倍的端到端加速。

英文摘要

Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

2605.20809 2026-05-21 cs.CL 版本更新

Refining and Reusing Annotation Guidelines for LLM Annotation

对LLM注释的注释指南进行细化和重用

Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa

发表机构 * The Graduate University for Advanced Studies, SOKENDAI(Graduate University for Advanced Studies, SOKENDAI) National Institute of Informatics (NII)(National Institute of Informatics) BioData Science Initiative (BSI), National Institute of Genetics (NIG)(BioData Science Initiative (BSI), National Institute of Genetics)

AI总结 本文提出了一种系统性的注释指南重用和细化方法,通过迭代审核框架来对LLM注释进行改进,并在生物医学NER任务中验证了指南整合的有效性、推理优化模型的优势以及在最小监督下的审核可行性。

Comments 14 pages, 7 figures. Accepted to the ACL 2026 Main Conference

详情
AI中文摘要

尽管大型语言模型(LLMs)在零样本注释任务上表现出色,但它们在黄金标准基准的专门惯例上往往表现不佳。我们提出了一种系统性的注释指南重用和细化作为对齐机制,引入了一个迭代审核框架,模拟注释项目的早期阶段。我们评估了三个假设:(1)指南整合的有效性,(2)推理优化模型的优势,以及(3)在最小监督下的审核可行性。在生物医学NER任务(NCBI Disease,BC5CDR,BioRED)上,使用三种LLM家族(GPT,Gemini,DeepSeek)进行测试,我们的结果实验证实了所有三个假设。虽然迭代审核框架在有效细化指南方面显示出良好的潜力,但我们的分析也揭示了大量改进的空间。

英文摘要

While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

2605.20798 2026-05-21 cs.LG cs.CL 版本更新

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

大多数变换器修改仍无法在1-3B规模上迁移:对Narang等人(2021)的2020-2026年更新,包含下游评估和噪声底限

Yang Zhao, Jiahao Lu, Bin Huang, Guhua Zhang, Jie Zhou

发表机构 * Tencent(腾讯)

AI总结 本文在1-3B参数规模下,大多数变换器修改仍无法迁移,通过严格的等数据、等计算、等配方控制测试,并结合下游评估和噪声底限进行验证。

Comments 19 pages, 3 figures, under review at EMNLP 2026

详情
AI中文摘要

Narang等人(2021)在T5-base规模上评估了40多种变换器修改,并得出结论,大多数修改无法迁移。五年后,典型的运行模式已转移到1-3B参数规模,下游评估已取代预训练困惑度,且出现了一大批新的修改类别。我们通过在1.2B和3B参数规模上严格测试20种2021年后出现的变换器修改,采用多种子基线噪声底限和CLIMB-12下游评估作为主要指标,重新审视其问题。核心发现在此精心挑选的集合中重现了他们的结论:大多数修改仍无法迁移。在20种修改中,只有两种在1.2B规模上通过Bonferroni校正;其中一种在3B规模下采用共享配方时无法稳定训练。我们还发现,Tay等人(2023)报告的损失-下游差距对于注意力输出修改而言扩大了几倍:两种显著失败的修改在基准验证损失上接近2-3%,但CLIMB点数却下降了6-16点。我们得出结论,噪声底限报告、下游评估和跨规模稳定性测试现在是1-3B参数规模下架构比较的必要条件。

英文摘要

Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

2605.20793 2026-05-21 cs.CL 版本更新

Assessing socio-economic climate impacts from text data

从文本数据评估社会经济气候影响

Mariana Madruga de Brito, Brielen Madureira, Taís Maria Nunes Carvalho, Damien Delforge, Aglaé Jézéquel, Murathan Kurfalı, Ni Li, Gabriele Messori, Joakim Nivre, Barbara Pernici, Niko Speybroeck, Stefano Terzi, Wim Thiery, Bram Valkenborg, Jingxian Wang, Shorouq Zahra, Jakob Zscheischler, Jan Sodoge

发表机构 * Helmholtz Centre for Environmental Research, Germany(德国海德堡环境研究中心) Leipzig University, Germany(德国莱比锡大学) UCLouvain Brussels, Belgium(比利时布鲁塞尔UCLouvain) Ecole des Ponts, France(法国École des Ponts) Université PSL, École Polytechnique, Institut Polytechnique de Paris, Sorbonne Université, CNRS, France(法国巴黎理工学院、巴黎理工研究院、巴黎政治学院、法国国家科学研究中心) RISE Research Institutes of Sweden, Sweden(瑞典RISE研究机构) Swedish Centre for Impacts of Climate Extremes (Climes), Sweden(瑞典气候极端影响研究中心) Vrije Universiteit Brussel, Belgium(比利时布鲁塞尔自由大学) TUD Dresden University of Technology, Germany(德国德累斯顿理工大学) Uppsala University, Sweden(瑞典乌普萨拉大学) Politecnico di Milano, Italy(意大利米兰理工大学) Istituto Universitario di Studi Superiori IUSS Pavia, Italy(意大利皮亚维亚大学研究院) Center for Climate Change and Transformation, Eurac Research, Italy(意大利Eurac研究机构气候变化与转型中心) Royal Museum for Central Africa, Belgium(比利时皇家中非博物馆) Technopolis Group, Germany(德国Technopolis集团)

AI总结 本文提出了一种利用文本数据系统分析气候灾害社会经济影响的方法,旨在解决现有研究碎片化、缺乏明确指导的问题,通过总结常见实践和挑战,提出改进建议以构建更可靠的文本衍生社会经济影响数据集。

Comments Work in progress

详情
AI中文摘要

近年来,自然语言处理(NLP)和大语言模型(LLMs)的进步使得能够系统利用新闻、社交媒体和报告中的大规模文本数据,创建包含洪水、干旱、风暴和多灾害事件等气候灾害的社会经济影响数据集。随着文本作为数据用于影响评估领域的扩展,其方法学复杂性也在增加。然而,研究仍处于碎片化状态,缺乏明确的指南来定义什么是影响、处理时间和空间偏差以及选择适当的建模和后处理策略。这种不连贯性限制了研究的透明度和可比性。本文通过整合常见实践、描述使用文本数据方法分析社会经济影响数据的关键挑战,并提出解决这些问题的建议来填补这一空白。通过提供最佳实践的指导,我们旨在支持构建更加稳健的文本衍生社会经济影响数据集,以更准确地支持灾害风险管理和归因研究。

英文摘要

Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.

2605.20786 2026-05-21 cs.CL 版本更新

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

从基础建设开始构建阿拉伯语NLP:二十年的经验、失败与开放问题

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文总结了过去二十年阿拉伯语NLP资源和研究基础设施建设的经验,指出构建数据集是社会和技术过程的结合,社区围绕共享任务形成比任务本身更重要,从语言资源到计算社会科学暴露了传统NLP训练无法解决的挑战。

Comments Accepted at the ACL 2026 Workshop : The Big Picture 2026: Crafting a Research Narrative v2

详情
AI中文摘要

本文回顾了过去二十年构建阿拉伯语NLP资源和研究基础设施的经验,阿拉伯语是一种被数亿人使用的语言,但在历史上相对英语或中文等语言而言被忽视。第一十年专注于基础语言基础设施,第二十年转向计算社会科学、社交媒体分析和社会导向应用。本文并未列举产出,而是探讨了构建这些产出的经验。三个反直觉的教训浮现:构建数据集是社会和技术过程的结合;围绕共享任务形成的社区比任务本身更重要;从语言资源到计算社会科学暴露了传统NLP训练无法解决的挑战。我们讨论了三个失败:一个从未达到临床实践的抑郁检测语料库,一个在过多共享任务中扩展而缺乏深度的时期,以及一个长期存在的假设,即现代标准阿拉伯语基础设施可以干净地转移到方言任务中。这些经验表明,在开发欠发达社区的NLP中,最困难的问题不是语言学的,而是社会、制度和认识论的,需要该领域很少教授的能力。

英文摘要

This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.

2605.20767 2026-05-21 cs.CL cs.LG stat.ME 版本更新

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

干预的幻觉:你的LLM模拟实验实际上是一个观察性研究

Victoria Lin, Taedong Yun, Maja Matarić, John Canny, Arthur Gretton, Alexander D'Amour

发表机构 * Google DeepMind(谷歌深Mind) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文探讨了大型语言模型在模拟人类行为中的潜在作用,指出在LLM模拟的合成用户中进行干预可能引起潜在用户属性的意外变化,从而导致用户漂移,影响效果估计。本文提出了使用负对照结果来检测分布变化的方法,并通过调整角色描述以减少偏倚来缓解漂移问题。

详情
AI中文摘要

大型语言模型(LLMs)显示出作为人类行为模拟器的潜力,提供了一种可扩展的方式研究对干预的反应。然而,由于LLMs主要基于观察性数据进行训练,在与LLM模拟的合成用户进行实验时,干预可能会引起潜在用户属性的意外变化,导致用户漂移,其中隐含的模拟总体在不同处理条件下有所不同,这可能会扭曲效应估计。我们正式化了由于用户漂移可能产生的混淆或选择偏差,并展示了干预依赖性变化如何放大或减弱干预下用户响应的观测差异。为了诊断混淆,我们提出使用负对照结果——在干预下应保持不变的属性——来识别干预条件间的分布变化,提供用户漂移的证据。为了缓解漂移,我们研究了通过获取额外的混杂因素来调整角色描述,发现针对特定场景的相关混杂因素可以显著减少调查式和多轮代理评估中的偏倚。

英文摘要

Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

2605.20745 2026-05-21 cs.LG cs.AI cs.CL 版本更新

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

验证器严格性的隐含信号:通过选择性潜在引导控制和改进逐步验证

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui

发表机构 * Dartmouth College(达特茅斯学院) Datadog AI Research(Datadog人工智能研究) Salesforce AI Research(Salesforce人工智能研究)

AI总结 本文研究了通过隐藏状态干预控制验证器严格性的方法,提出VerifySteer通过利用潜在正确性信号进行样本级路由并选择性干预段落边界,从而在ProcessBench和Hard2Verify数据集上优于基线方法,且在推理计算上更高效。

详情
AI中文摘要

生成验证器已成为逐步验证的一种有前途的范式,但其验证行为往往校准不佳:它们可能过于宽松而错过错误步骤,或过于严格而拒绝正确推理。我们将这种倾向于过于宽松或过于严格的行为称为验证器严格性。在本工作中,我们研究是否可以通过隐藏状态干预来控制验证器严格性。我们揭示了一个验证特定的隐藏状态信号:在逐步验证中,验证器接受或拒绝解决方案步骤的倾向编码在对应的验证段落边界附近。利用这一信号,我们证明隐藏状态引导可以直接调节验证器严格性,而无需微调。然而,统一引导会导致错误检测与正确性认证之间的权衡。为了解决这个问题,我们提出了VerifySteer,它利用潜在正确性信号进行样本级路由,并选择性地在段落边界进行干预。在ProcessBench和Hard2Verify上的实验表明,VerifySteer优于提示优化和激活引导基线,并且在需要更少推理计算的情况下与自一致性竞争。VerifySteer还与验证微调互补,在微调验证器上提供进一步的收益。代码可在https://github.com/YefanZhou/VerifySteer上获得。

英文摘要

Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

2605.20743 2026-05-21 cs.CV cs.CL 版本更新

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Draw2Think: 通过约束引擎交互增强几何推理

Juncheng Hu, Jiawei Du, Xin Zhang, Joey Tianyi Zhou

发表机构 * National University of Singapore(新加坡国立大学) Centre for Frontier AI Research, Agency for Science, Technology and Research(科技研究局前沿人工智能研究中心) Institute of High Performance Computing, Agency for Science, Technology and Research(科技研究局高性能计算研究所)

AI总结 Draw2Think通过与GeoGebra约束引擎交互,将几何推理从潜在空间推断转换为与约束引擎的代理交互,从而提高几何推理的准确性和可验证性。

详情
AI中文摘要

视觉-语言模型在解决几何问题时准确性不断提高,但其中间状态仍然保持在潜在空间中且不可验证:文本推理或绘图代码中表达的关系无法保证约束满足的配置能实现它。我们发现现有的基于渲染像素或单次脚本的外部化方法无法提供精确的、每一步的几何保证。通过代数定义强制几何关系从而填补了这一差距:工作空间变成一个经过约束检查的动态画布。我们提出了Draw2Think框架,该框架将几何推理从潜在空间推断转换为与GeoGebra约束引擎的代理交互。在提出-绘制-验证循环中,Draw2Think将假设外部化到可执行画布上,测量精确的几何量,并将结构化的观察反馈给模型,使后续推理从由共享工作空间支撑的检查画布状态开始。这种外部化使两个属性可以分别审计:模型级别的构造保真度(画布是否实现了预期的配置)和引擎级别的测量保真度(来自画布约束的精确值和关系)。在构造、结果和渲染评估中,Draw2Think构建的画布在GeoGoal上通过95.9%的谓词级别和84.0%的严格问题级别构造检查,改进了平面/实体基准测试的结果准确性,最高提高了4.1%/16.4%,并在GenExam-math上达到了68.2%/90.5%的严格/宽松渲染分数。项目页面可在https://draw2think.github.io/上找到。

英文摘要

Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

2605.20740 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Distribution-Aware Reward: 用于LLM回归的预测分布强化学习

Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Allen Institute for AI(人工智能研究院)

AI总结 本文提出Distribution-Aware Reward,一种基于预测分布的强化学习方法,旨在提升语言模型在回归任务中的预测分布质量,而非仅优化单个解码输出。通过连续排名概率分数评估多个解码样本的分布,并基于每个rollout对分布质量的边际贡献分配信用,从而提升预测的准确性和分散性。实验表明,该方法在多个任务中优于监督微调和点wise强化学习基线,尤其在KBSS数据集上Spearman相关性提升6点。

Comments 21 pages, 5 figures

详情
AI中文摘要

大型语言模型能够从异质输入(如文本、代码和分子字符串)预测实值量,但大多数训练目标独立评分每个解码的浮点数,仅改进点估计而无法确保校准的预测分布。这限制了需要候选排序或不确定性估计的应用。我们引入Distribution-Aware Reward,一种基于策略的强化学习目标,其主要贡献是训练语言模型生成更好的回归任务预测分布,而非仅优化单个解码输出与标量目标的匹配。我们的方法将多个解码样本视为经验预测分布,并使用连续排名概率分数进行评估,基于每个rollout对分布质量的边际贡献分配leave-one-out信用,奖励既准确又适当分散的预测。我们在受控高斯混合任务、代码性能预测和分子属性预测(从SMILES字符串)上评估了我们的方法。在所有任务中,我们的方法优于监督微调和点wise强化学习基线,具有显著的排名相关性提升,包括在KBSS数据集上Spearman相关性提升6点。在MoleculeNet上,仅使用SMILES字符串,仍能与强大的图基和3D分子模型竞争。进一步分析表明,我们的方法缓解了rollout多样性崩溃并改进了不确定性诊断,表明直接优化预测分布使语言模型回归更具鲁棒性和校准性。

英文摘要

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

2605.20730 2026-05-21 cs.CL cs.AI 版本更新

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

分布对齐作为设计任务向量在上下文学习中的准则

Jihoon Kwon, Jiwon Choi, Jy-yong Sohn

发表机构 * Seoul National University(首尔国立大学) Yonsei University(延世大学)

AI总结 本文提出通过分布对齐来设计任务向量,引入了NTP距离作为衡量指标,并开发了线性任务向量方法以提升性能和效率。

Comments 9 pages, preprint

详情
AI中文摘要

在上下文学习(ICL)中,大型语言模型(LLMs)通过演示来适应新任务,但随着上下文长度增加,推理成本也随之上升。虽然任务向量通过压缩演示为紧凑的隐藏状态表示提供了有前途的替代方案,但其质量只能通过下游任务准确性来评估。本文认为,使用任务向量的推理应使其预测分布与ICL的预测分布对齐。为此,我们引入了$d_{ ext{NTP}}$,一个衡量任务向量推理与ICL推理之间下一个标记概率差异的指标。我们的实证分析表明,$d_{ ext{NTP}}$作为性能代理,与下游准确性呈强负相关。受此启发,我们开发了线性任务向量(LTV)方法,通过闭合形式的线性映射来最小化$d_{ ext{NTP}}$,通过回归估计演示效果。在八个分类基准和五个LLMs上,LTV一致优于现有任务向量基线,平均准确率提高了9.2%,同时减少了推理延迟。我们进一步证明LTV在回归任务上优于基线。此外,我们研究了LTV在不同模型规模间的可转移性;这在任务向量研究中仍是一个初级问题。具体而言,我们实证显示,较大模型的任务向量可以将较小模型的性能提高6.4%,表明提取的任务表示有新的用途。

英文摘要

In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

2605.20729 2026-05-21 cs.CL 版本更新

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

MTR-Suite: 一个用于评估和合成对话检索基准的框架

Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) Meituan Inc.(美团公司) NiuTrans Research(牛译研所) Tsinghua University(清华大学)

AI总结 本文提出MTR-Suite框架,通过LLM审计、多智能体系统生成和通用领域基准,解决对话检索基准评估和合成中的成本高、标注稀疏和自动化方法僵化的问题。

Comments Accepted to ACL 2026 (main conference). 28 pages. Code and data: https://github.com/rangehow/mtr-suite

详情
AI中文摘要

准确评估对话检索对于推进检索增强生成(RAG)系统至关重要。然而,现有的对话检索基准存在成本高、标注稀疏或自动化方法僵化、不自然的问题。为了解决这些挑战,我们引入MTR-Suite,一个统一的框架,用于审计、合成和基准测试检索。它具有三个特点:(1)MTR-Eval,一个基于LLM的审计器,用于量化先前基准中的对齐差距;(2)MTR-Pipeline,一个使用贪心遍历聚类的多智能体系统,能够以1/400的成本生成高保真对话;(3)MTR-Bench,一个严谨的通用领域基准。MTR-Bench模拟生产式挑战(如困难的主题切换、冗长),提供更强大的判别能力。我们公开了代码和数据,以促进未来研究,网址为https://github.com/rangehow/mtr-suite.

英文摘要

Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.

2605.20712 2026-05-21 cs.CL cs.AI 版本更新

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

SCRIBE:用于印度语言ASR的诊断评估和丰富转录模型

Kavya Manohar, Arghya Bhattacharya, Kush Juvekar, Kumarmanas Nethil

发表机构 * Adalat AI, India(印度Adalat人工智能公司)

AI总结 SCRIBE通过沙地容忍对齐和领域词汇注入,提供词错误率的分类分解,解决了传统词错误率在处理聚合语言时的不足,同时释放了用于印地语、马拉雅尔语和卡纳达语的丰富转录模型。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

自动语音识别仅在更正成本低于手动输入时才取代打字,这一阈值由错误类型而非数量决定:纠正一个误识别的领域术语的成本远高于插入一个逗号。词错误率(WER)在两个方面失效:它将不同的错误类别合并为一个标量,且它在结构上惩罚了聚合语言,其中有效的沙地合并会膨胀分数。我们引入SCRIBE,一个诊断框架,通过沙地容忍对齐和领域词汇注入,将错误分解为词法、标点、数字和领域实体率。人类验证确认SCRIBE在WER无法做到的地方与专家判断一致。我们发布了SCRIBE,一个LLM整理流程、基准测试和开放权重的丰富转录模型,适用于印地语、马拉雅尔语和卡纳达语。

英文摘要

Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

2605.20693 2026-05-21 cs.CL cs.AI stat.ML 版本更新

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

通过共识和标签解缠获得可解释的判别文本表示

Tong Wang, Yiqing Xu, Leo Yang Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 本文提出了一种可解释的判别文本表示方法,通过共识和标签解缠来确保特征的可解释性和可重复性,实验表明该方法在多个文本分类任务中表现优异,产生了更清晰且更少标签纠缠的特征。

详情
AI中文摘要

可解释的文本表示应暴露出不仅具有预测性,而且对独立审计员来说有意义的坐标。现有的判别表示通常使用匿名嵌入方向,而概念瓶颈和LLM辅助方法将自然语言名称附加到特征上,但并未确保这些定义是可重复的或与目标标签不同。我们提出了一种可解释判别文本表示的操作标准:每个坐标应满足概念清晰度,通过独立标注员应用特征定义之间的机会调整一致性来衡量,并且标签解缠,即特征不应仅仅改述预测目标。我们通过LLM辅助特征发现(LFD)方法实现了这一标准,这是一种迭代方法,从对比性反向文本对中提出词汇和语义特征,通过跨LLM Cohen's $κ$ 筛选候选,并通过残差保留的预测增益选择特征。一种简化分析将$κ$筛选与每个特征的注释噪声界限联系起来,正式化一致性作为可靠性检查。在十个跨越七个语料库的文本分类任务中,LFD与强大的文本瓶颈基线具有相同的预测性能,同时产生明显更清晰且标签纠缠更少的特征。232名人类审计员的实验表明,LFD特征在人类-人类和人类-LLM一致性方面优于基线概念,且审计员一致认为它们更少标签泄漏。这些结果表明,经过一致性测试和标签解缠的坐标为可解释文本分类提供了一个实用的可审计标准。

英文摘要

Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $κ$, and selects features by residual held-out predictive gain. A stylized analysis connects the $κ$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

2605.20689 2026-05-21 cs.CL cs.AI cs.IR cs.LG 版本更新

DIVE: Embedding Compression via Self-Limiting Gradient Updates

DIVE: 通过自限制梯度更新实现嵌入压缩

Dongfang Zhao

发表机构 * University of Washington Tacoma School of Engineering and Technology(华盛顿大学塔可姆分校工程与技术学院)

AI总结 本文提出DIVE方法,通过自限制的三元组损失和头级NT-Xent对比损失解决嵌入压缩中因标注数据稀缺导致的过拟合问题,提升了检索性能。

详情
AI中文摘要

大型语言模型的高维嵌入对向量搜索系统造成了显著的存储和计算成本。最近的嵌入压缩方法,包括Matryoshka-Adaptor(EMNLP 2024)、Search-Adaptor(ACL 2024)和SMEC(EMNLP 2025),通过轻量级残差适配器实现降维,但其训练目标在标注数据稀缺时导致严重过拟合,使检索性能低于冻结基线。我们提出DIVE(通过隐式视图集合进行降维),一种压缩适配器,通过两种机制解决这一失败。首先,一个自限制的基于hinge的三元组损失在三元组满足边距约束时产生零梯度,限制应用于预训练嵌入空间的总扰动。其次,头级NT-Xent对比损失将每个嵌入的多个学习投影视为隐式视图,提供密集的自监督梯度,补偿小数据集上三元组信号的稀疏性。在六个BEIR数据集上,DIVE在每个数据集和每个评估的压缩比上均优于所有三个基线适配器,具有14M参数的开源实现。

英文摘要

High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

2605.20684 2026-05-21 cs.CL 版本更新

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

超越语义相似性:一种用于企业信贷审批的双阶段非参数检索流程

Linus Ng Junjia, Ezekiel Tee Kongquan, Kelvin Heng, Kenneth Zhu Ke, Zhao Jing Yuan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种双阶段非参数检索架构,旨在解决信贷审批中检索结果与决策有用性之间的差距问题,通过结合词法和密集多语言检索构建候选池,并利用LLM作为判断机制对文档进行实用性评分,从而提高检索效率和实用性。

详情
AI中文摘要

企业信贷审批需要分析师从数百页、多语言的异构财务文档中提取可操作的证据。标准的检索增强生成(RAG)流水线优化语义相似性,这通常会检索出主题相关但缺乏决策有用性的段落,我们称之为相似性-有用性差距。我们提出了一种双阶段非参数检索架构,将高召回率的候选检索与高精度的实用性排名分开。第一阶段结合词法和密集多语言检索构建广泛候选池。第二阶段应用自适应检索控制器,利用查询意图和文档结构信号过滤候选者,随后通过LLM作为判断机制对段落进行实用性评分,而非基于语义接近性。一个上下文感知的提取模块在叙述文本和复杂财务表格之间保持结构忠实性。该系统完全在本地部署以满足企业数据治理要求。在具有分析师定制相关性标签的多语言专有财务文档语料库上评估,该系统显著优于简单检索基线。在超过800名信贷分析师的生产部署中,文档审查时间从数小时减少到约三分钟,证明了实用性感知RAG架构在文档密集型决策支持流程中的实际价值。

英文摘要

Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.

2605.20668 2026-05-21 cs.CL cs.AI cs.LG 版本更新

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

人工智能审稿人的局限与机遇:对Nature系列论文审稿的45位专家科学家的审查

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

发表机构 * Nature(自然)

AI总结 本文通过大规模专家标注研究,探讨了AI审稿人在科学同行评审中的能力与局限,发现AI审稿在准确性、显著性和证据充分性方面表现优异,但存在领域知识有限、上下文管理不足等弱点,表明AI审稿是人类审稿的补充而非替代。

Comments Work in progress

详情
AI中文摘要

随着AI能力的提升,AI审稿人开始被应用于科学同行评审,但其能力和可信度仍存疑:许多科学家将其视为概率系统,缺乏评估研究的专业能力,而其他研究人员则对AI的准备程度更为乐观,但缺乏实证支持。理解AI审稿人擅长什么、哪里不足以及仍需解决的挑战至关重要。然而,现有的AI审稿评估主要关注其判断是否与人类一致(例如评分对齐、接受预测),这不足以表征其能力和局限。在本文中,我们通过大规模专家标注研究填补了这一空白,45位物理、生物和健康科学领域的专家花费469小时对2960个个体批评(每个批评针对论文的一个特定方面)进行评分,这些批评来自人类和AI生成的82篇Nature系列论文的审稿。在综合正确性、显著性和证据充分性三个维度上,由GPT-5.2驱动的审稿代理在每篇论文的最高评分人类审稿人评分上(60.0% vs. 48.2%,p = 0.009),而所有三个AI审稿(包括Gemini 3.0 Pro和Claude Opus 4.5)在每个维度上都超过了最低评分的人类审稿人。AI审稿的准确批评也更常被评分显著且证据充分,并揭示了人类未提及的26%的问题。然而,AI审稿在交叉审稿者对之间重叠远多于人类(21% vs. 3%),并且表现出16个人类不共享的弱点,如领域知识有限、缺乏多文件上下文管理能力以及对次要问题过于批判。总体而言,我们的结果表明当前AI审稿人是人类审稿人的补充,而非替代。

英文摘要

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

2605.20643 2026-05-21 cs.LG cs.AI cs.CL 版本更新

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

AVSD:通过平衡共识和教师特定的特权信号实现自适应视图自蒸馏

Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Capital One(Capital One公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出AVSD,一种通过平衡共识和教师特定的特权信号来实现自适应视图自蒸馏的方法,以解决自蒸馏中教师和学生信息不对称和特权信息选择的问题。

Comments Code: https://github.com/duykhuongnguyen/AVSD

详情
AI中文摘要

自蒸馏使语言模型能够通过使用同一模型作为学生和教师来从自身轨迹中学习,其中教师基于学生无法访问的特权信息进行条件。此类信息可以是不同种类或视图,如解决方案、演示、反馈或最终答案。这种设置可以在不依赖外部模型的情况下提供密集的token级反馈,但会产生根本性的不对称性:教师可能依赖于视图特定的信息,而学生在推理时无法访问。此外,最佳的特权信息类型通常是任务依赖的,使得选择单一教师视图变得困难。在本工作中,我们通过引入AVSD(自适应视图自蒸馏),一种具有多种特权信息视图的自蒸馏新方法,来同时解决这两个挑战。AVSD通过分离稳定的跨视图共识和视图特定的残差信号来重建token级监督。AVSD识别出跨视图共享的共识信号,提供可靠的更新方向,然后在两者一致且比例适当的情况下,选择性地添加视图特定的残差信号以调整更新幅度。在数学竞赛基准(AIME24、AIME25和HMMT25)上的实验表明,AVSD在Qwen3-8B和Qwen3-4B上分别比单视图自蒸馏基线和GRPO平均Avg@8提升了3.1%和2.2%。此外,在代码生成基准(Codeforces、LiveCodeBench v6)上使用Qwen3-8B时,AVSD在平均上比单视图自蒸馏基线高出2.4%。

英文摘要

Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

2605.20626 2026-05-21 cs.CL cs.AI cs.CV 版本更新

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述:佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 本文提出了一种基于检索的长上下文翻译方法,用于文化图像描述,通过两阶段流程生成西班牙语中间描述,再利用检索增强的多示例提示生成目标语言描述,显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能,并在共享任务中获得冠军。

详情
AI中文摘要

我们提出了佛罗里达大学Gators团队对2026年美洲自然语言处理共享任务在原住民语言文化图像描述任务中的提交。我们的两阶段流程使用Qwen2.5-VL生成西班牙语中间描述,然后利用检索增强的多示例提示与Gemini 2.5 Flash生成目标语言描述。我们在开发集评估中分别实现了Bribri、Guaraní和Orizaba Nahuatl描述生成性能的164.1%、131.7%和122.6%的提升,并在测试集评估中保持Bribri和Orizaba Nahuatl语言的>150%提升。我们发现检索高度依赖语言,仅对大规模、领域内语料有效,并且合成数据增强对开发集Guaraní性能提升贡献了约28 chrF++。我们的提交在共享任务中获得冠军,位列五份最终提交中的第二名。

英文摘要

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

2605.20616 2026-05-21 cs.CL 版本更新

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

Auto-Dreamer:学习离线记忆巩固用于语言智能体

Chongrui Ye, Yuxiang Liu, Yu Wang, Haofei Yu, Yining Zhao, Ge Liu, Julian McAuley, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出Auto-Dreamer,一种学习离线记忆巩固方法,用于语言智能体,通过分离快速会话记忆获取与慢速跨会话巩固过程,提升智能体在多个任务流中的记忆整合与知识复用能力。

Comments Preprint

详情
AI中文摘要

语言智能体越来越多地在相关任务流上运行,但现有记忆系统难以将积累的经验转化为可重用的知识。检索增强和结构化记忆方法能有效记录每会话的观察,但通常将获取和巩固过程合并为一个在线过程,使智能体无法获得跨会话的全局视图以发现重复模式、抽象共享流程或剪枝冗余条目。受互补学习系统理论启发,我们提出Auto-Dreamer,一种学习的离线巩固器用于语言智能体记忆。Auto-Dreamer将快速的每会话记忆获取与慢速的跨会话巩固过程分离。给定一个选定的类型记忆库的工作区域,巩固器将该区域视为只读证据,执行受限的工具使用来检查条目和与来源轨迹相关联的来源轨迹,并合成一个新鲜的紧凑替换集,该集在跨会话中抽象并取代原始区域。我们通过GRPO训练Auto-Dreamer,使用端到端智能体性能作为奖励信号来学习如何通过快速在线经验巩固记忆。仅在ScienceWorld轨迹上训练,Auto-Dreamer在ScienceWorld上优于固定、强化学习训练和提示记忆基线,得分高出7分,同时使用比最强基线小12倍的活跃记忆库,并在不重新训练的情况下继续在held-out的ALFWorld和WebArena上领先,使用比最强基线小6倍的内存。

英文摘要

Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.

2605.20613 2026-05-21 cs.CL 版本更新

HRM-Text: Efficient Pretraining Beyond Scaling

HRM-Text: 超越规模的高效预训练

Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

发表机构 * Sapient Intelligence MIT(麻省理工学院)

AI总结 本文提出HRM-Text模型,通过引入分层递归模型和新的训练方法,在减少计算资源消耗的同时实现了与大规模模型相当的性能,展示了高效预训练的可能性。

详情
AI中文摘要

当前大型语言模型的预训练范式依赖于巨大的计算资源和互联网级原始文本,这在基础研究中形成了显著的障碍。相比之下,生物系统通过多时间尺度处理实现高样本效率的学习,例如前额叶环路的功能组织。受此启发,我们引入了HRM-Text,它用分层递归模型(HRM)取代标准Transformer,将计算分解为慢速演变的战略层和快速演变的执行层。为了稳定这种深度递归进行语言建模,我们引入了MagicNorm和深度信用分配的预热。此外,我们不再使用标准的原始文本预训练,而是仅在指令-响应对上进行训练,使用任务完成目标和PrefixLM遮蔽。作为高效预训练的实证存在证明,一个仅用400亿个唯一词和1,500美元预算从头训练的10亿参数HRM-Text模型在MMLU上达到60.7%,在ARC-C上达到81.9%,在DROP上达到82.2%,在GSM8K上达到84.5%,在MATH上达到56.2%。尽管使用了比标准基线少100-900倍的训练词和96-432倍的估计计算,HRM-Text的性能与2-7B参数的开源模型相媲美。这些结果表明,协同设计架构和目标可以大幅降低计算到性能的比率,使从头开始的预训练对更广泛的研究社区具有可及性。

英文摘要

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

2605.20602 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

自我训练不使语言扁平化——它重构了它:表面标记增强而深层语法消失

Ming Liu

发表机构 * Amazon(亚马逊)

AI总结 该研究通过实验发现自我训练过程并非使语言扁平化,而是重构了语言结构,表面标记增强而深层语法结构消失,并提出了结构性深度假说来解释这一现象。

Comments 19 pages (14 main + 5 appendix), 8 figures, 3 tables

详情
AI中文摘要

连续对语言模型自身输出进行自我训练通常被描述为一种扁平化过程:多样性下降,分布变窄,文本变得“更像自己”。我们提供了证据表明这种描述是不完整的。在对五个模型(GPT-2 124M,Pythia-410M,Pythia-1.4B,OPT-1.3B,Pythia-2.8B)进行十一代自我训练的过程中,语言并非均匀扁平化——它被重构了。表面标记(连贯词、缓和词、破折号)上升,而中层和深层语法结构(疑问句、插入语、被动语态、条件句)崩溃。我们正式将这种不对称崩溃定义为结构性深度假说(SDH):语言特征的每一代衰减率主要由其结构性深度——它所需嵌套语法依赖的数量——决定,其次才由其生成零次输出频率决定。通过汇总五个模型中三个架构家族的17个特征面板(N=85),汇总的斯皮尔曼相关系数为rho=0.540(p < 10^{-6};簇Bootstrap 95% CI [0.434, 0.634]),而频率是一个显著较弱的预测因子(rho=0.225)。一个匹配的人类文本微调对照实验得到rho=0.039(p=0.88),证实了该梯度是特定于自我训练的。我们进一步记录了一个表面复杂性悖论:总体复杂性代理(依赖树深度、TTR、词长)在底层从句结构消失时均上升,这对训练数据筛选和LLM文本检测有直接影响。

英文摘要

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

2605.20591 2026-05-21 cs.CL cs.CY 版本更新

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

有害吗?网络部署医疗大语言模型中的幻觉与作用层面滥用

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

发表机构 * The University of New South Wales, Sydney, Australia(新南威尔士大学,悉尼,澳大利亚)

AI总结 本文研究了网络部署的医疗大语言模型中的幻觉和作用层面滥用问题,通过评估6233个MedGPT和10个开源LLM,发现25-30%的MedGPT事实准确性较低,33.6-54.3%违反操作阈值,57.06%的Action-enabled模型缺乏充分的隐私披露,揭示了系统性漏洞,强调了多指标评估和更强的安全保障的必要性。

详情
AI中文摘要

医疗大语言模型(LLMs),包括定制医疗GPT(MedGPTs)和开源模型,正越来越多地部署在网页平台上以提供临床指导。然而,它们存在幻觉、政策不合规和不安全设计的风险。我们对6,233个MedGPT进行了大规模评估,评估了1,500个分层样本以及10个开源LLM。我们引入了两个框架:MedGPT-HEval用于幻觉检测,以及一个基于LLM的流程用于评估违规行为和开发者意图。我们的结果表明,25-30%的MedGPT事实准确性较低,底层和中层模型风险最高;33.6-54.3%违反操作阈值,57.06%的Action-enabled模型缺乏充分的隐私披露。与开源模型相比,MedGPT在事实准确性和语义对齐方面表现更好,但开源模型更稳定。这些结果揭示了幻觉和合规性的系统性缺口,强调了多指标评估和更强的安全保障的必要性。我们发布了HAA-MedGPT,一个结构化数据集,支持未来关于网络面向医疗LLM安全性的研究。

英文摘要

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

2605.20588 2026-05-21 cs.CL cs.CV 版本更新

Direct Translation between Sign Languages

手语之间的直接翻译

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 本文提出了一种直接的手语到手语翻译方法,通过使用回译技术生成合成的手语对,从而克服了传统级联方法中的误差传播和信息丢失问题,并在多个手语数据集上实现了更高的翻译质量和速度提升。

详情
AI中文摘要

手语翻译领域在手语与口语之间的翻译上取得了显著进展,但手语之间的翻译仍鲜为人知且难以实现。后者可以帮助15亿全球聋人和听力障碍者在语言障碍中交流,而无需依赖听力翻译者或书面语言能力。级联方法由单独的手语到文本、文本到文本和文本到手语系统组成,但存在误差传播、额外延迟以及视觉模态中独特信息的丢失。我们旨在开发直接的手语到手语翻译。然而,尚未有大规模的开放领域平行语料库在手语之间。为了实现直接的手语翻译,我们使用回译技术从不对齐的个体语言语音-手语语料库中生成合成的手语对。使用这些数据,我们联合训练了一个基于MBART的单一模型,用于文本到手语(T2S)和手语到手语(S2S)。在合成生成的美国手语(ASL)、中国手语(CSL)和德国手语(DGS)之间配对集上,我们的直接S2S方法在几何手语误差指标(20%更低的DTW对齐MPJPE)和翻译回句子后的语言匹配指标(50%高BLEU-4)上优于级联基线,同时实现了大约2.3倍的速度提升。在一小部分现有的跨语言手语数据上,我们发现我们的方法也实现了类似的改进。

英文摘要

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

2605.20563 2026-05-21 cs.MA cs.AI cs.CL cs.LG cs.SE 版本更新

Multi-agent Collaboration with State Management

具有状态管理的多智能体协作

Mengyang Liu, Taozhi Chen, Zhenhua Xu, Xue Jiang, Yihong Dong

发表机构 * Shanghai Jiaotong University(上海交通大学) Cortices AI Emory University(埃默里大学) Peking University(北京大学)

AI总结 本文提出STORM,一种面向多智能体协作的状态管理方法,通过在共享工作区中调解智能体的交互,确保每个智能体在一致的代码库视图上操作,并在写入时检测和解决冲突。STORM在多个LLM上优于基于git-worktree的多智能体基线,且在成本效率上具有竞争力,表明显式状态管理比工作区隔离更有效。

详情
AI中文摘要

近年来,多智能体系统在解决复杂任务方面展现出巨大潜力。然而,当多个智能体同时编辑共享代码库时,他们的更改可能会产生冲突,不一致的视图会导致集成失败。现有的多智能体系统通过工作区隔离(例如每个智能体一个git工作树)来解决这个问题,但这种方法将冲突解决推迟到事后合并步骤,恢复成本较高。在本文中,我们提出了STORM,即面向多智能体协作的状态管理(STate-ORiented Management)。具体而言,STORM通过调解智能体与共享工作区的交互来管理智能体状态,确保每个智能体都在代码库的一致视图上操作,并在写入时检测和解决冲突。我们评估了STORM在Commit0和PaperBench多个LLM上的表现。STORM在Commit0-Lite上比基于git-worktree的多智能体基线高出18.7%,在PaperBench上高出1.4%,同时在成本效率上具有竞争力或更好。结合单智能体运行,STORM在两个基准测试中分别达到87.6和78.2的最高分数,表明显式状态管理比工作区隔离更有效作为多智能体协作的基础。STORM也可以无缝地集成到任何多智能体系统中。

英文摘要

Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi-agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post-hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate-ORiented Management for multi-agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git-worktree-based multi-agent baseline by +18.7 on Commit0-Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single-agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi-agent collaboration than workspace isolation. STORM can also be plugged into any multi-agent system seamlessly.

2605.20052 2026-05-21 cs.CL cs.AI 版本更新

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad: 基于知识的多标签提示微调用于低资源放射报告标注

Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao

发表机构 * Department of Artificial Intelligence and AI Research Center, Chang Gung University(人工智能系及AI研究中心,长庚大学) Department of Radiology, Sijhih Cathay General Hospital(放射科,西吉医院) Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital(医学影像与介入科,长庚纪念医院) Department of Trauma and Emergency Surgery, Chang Gung Memorial Hospital(创伤与急诊外科,长庚纪念医院) Department of Computer Science, National Tsing Hua University(计算机科学系,国立清华大学)

AI总结 本文提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注,通过引入UMLS元词典中的同义词增强类别表示,以更少的标注数据实现优于传统方法的性能。

Comments BioNLP 2026 @ ACL (camera-ready version)

详情
AI中文摘要

自动报告标注有助于从非结构化文本中识别临床发现,并为医学影像研究提供大规模注释。现有的基于规则的标注器难以处理临床报告中的多样化描述,而微调预训练语言模型(PLMs)需要大量标注数据,这些数据在临床环境中通常不可用。在本文中,我们提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注。PromptRad将多标签分类重新表述为掩码语言建模,并将UMLS元词典中的同义词纳入多词提示器以丰富类别表示。通过微调PLM而不增加额外分类层,PromptRad所需的标注数据比传统微调要少得多。在肝CT报告上的实验表明,PromptRad在仅使用32个标注训练示例的情况下,优于基于词典和微调的基线方法,并且在使用远小模型的情况下,性能与GPT-4具有竞争力。进一步分析显示,PromptRad比现有方法更有效地捕捉复杂的否定模式,使其成为数据稀缺临床场景中报告标注的有希望的解决方案。我们的代码可在https://github.com/ila-lab/PromptRad上获得。

英文摘要

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

2605.17694 2026-05-21 cs.CL 版本更新

Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

大语言模型在权力不对称对话中是否反映社会认知效应?

Anvesh Rao Vijjini, Sagar Manjunath, Snigdha Chaturvedi

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 研究探讨了大语言模型在被赋予高或低地位角色时是否表现出与人类相似的社会认知效应,通过模拟多轮权力不对称对话,分析语言协调、代词使用、说服成功率和对危险请求的遵从性,发现模型在权力效应上表现出关键特征,但存在差异和变异性。

Comments ACL 2026 (main)

详情
AI中文摘要

权力差异通过已知的社会认知效应塑造人类交流,包括语言协调、代词使用、权威偏见和有害遵从。我们检验了大型语言模型(LLMs)在被赋予高或低地位角色时是否表现出类似行为。使用来自不同职业的拟人化角色,我们模拟多轮、权力不对称的对话(例如,校长与教师、法官与律师),并测量(i)语言协调、(ii)代词使用、(iii)说服成功率以及(iv)对危险请求的遵从性。我们的结果表明,LLMs 显示出权力的社会认知效应,尽管存在细微差别和变异性,将模拟交互与既 desirable 又 unsafe 的行为联系起来。

英文摘要

Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) language coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio-cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.

2605.16217 2026-05-21 cs.CL cs.AI cs.IR 版本更新

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus:可扩展深度研究代理的证据组装

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

发表机构 * MiroMind AI

AI总结 Argus通过将深度研究视为拼图碎片的组装过程,而非并行暴力求解整个答案,提高了大规模信息检索任务的效率和效果。

详情
AI中文摘要

深度研究代理在复杂信息检索任务上取得了显著进展。即使长ReAct风格的探索仅追踪单一轨迹,而最新最先进的系统通过并行搜索和聚合来扩展推理时间计算。然而,深度研究答案由互补的证据片段组成,而并行探索通常重复而非完成这些片段,导致收益递减且推动聚合上下文接近模型极限。我们提出Argus,一种代理系统,其中搜索者和导航者合作将深度研究视为从互补证据片段中组装拼图,而非并行暴力求解整个答案。搜索者通过ReAct风格交互收集给定子查询的证据轨迹。导航者维护共享证据图,验证哪些片段仍缺失,派遣搜索者收集它们,并在完成图上推理以生成来源追踪的最终答案。我们用强化学习训练导航者以验证、派遣和合成,同时独立训练搜索者以保持标准ReAct代理。所获得的导航者支持单个搜索者或多个并行搜索者无需重新训练。使用35B-A3B MoE骨干的搜索者和导航者,Argus在单个搜索者上获得5.5分,在8个并行搜索者上获得12.7分,平均在八个基准上。使用64个搜索者时,其在BrowseComp上达到86.2分,超越了我们所有基准测试的专有代理,同时导航器的推理上下文保持在21.5K tokens以下。

英文摘要

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

2605.15104 2026-05-21 cs.CL 版本更新

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

从文本到声音:一个可重复和可验证的评估工具调用LLM代理的框架

Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jonas Robertson, Shashi Bhushan TN

发表机构 * Dialpad Inc.(Dialpad公司)

AI总结 本文提出了一种可重复和可验证的框架,用于评估基于语音的工具调用LLM代理,通过文本到语音转换和环境噪声模拟,无需重新标注工具模式和黄金标签,从而在不重新标注的情况下评估工具调用性能。

详情
AI中文摘要

语音代理越来越多地需要从语音中可靠的工具使用,而突出的工具调用基准测试仍基于文本。我们研究是否可以将经过验证的文本基准转换为受控的音频基础工具调用评估,而无需重新标注工具模式和黄金标签。我们的数据集无关框架使用文本到语音、说话人变化和环境噪声来创建配对的文本-音频实例,同时保留原始数据集注释。基于对7个多模态模型在Confetti和When2Call音频转换版本上的广泛评估,我们的框架表明性能强烈依赖于模型和任务:Gemini-3.1-Flash-Live获得最高的Confetti分数(70.4),而GPT-Realtime-1.5在When2Call上表现最佳(71.9)。在Confetti上,文本到语音的差距从Qwen3-Omni的1.8分到GPT-Realtime-1.5的4.8分。对失败案例的针对性分析表明,降解最常反映语音中对论点值的误解。考虑到现实部署场景,我们进一步报告了纯文本结果,一个基于歧义的改述压力测试,以及一个参考免费的LLM-as-judge协议,其经过人类偏好的验证。值得注意的是,我们发现开源的Qwen3判官在至少8B参数的情况下,超过80%的协议与专有判官一致,支持隐私保护的评估。总体而言,我们的框架提供了一个可验证和可重复的第一阶段诊断,补充了专门构建的音频语料库。

英文摘要

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

2605.14259 2026-05-21 cs.AI cs.CL 版本更新

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

面向异构业务系统的超图企业代理推理器

Ling Wang, Xin Liu, Songnan Liu, Jianan Wang, Cheng Cheng, Yihan Zhu, Enyu Li, Yu Xiao, Jiangyong Xie, Duogong Yan, Jiangyi Chen

发表机构 * SUPCON

AI总结 本文提出HEAR,一种基于分层超图本体的企业代理推理器,通过分层图层和超边层实现结构化多跳分析,无需重新训练LLM,在供应链任务中达到94.7%的准确率,并展示出适应性和效率。

详情
AI中文摘要

将大语言模型(LLMs)应用于异构企业系统受到多跳、n元推理中幻觉和失败的阻碍。现有范式(如GraphRAG、NL2SQL)缺乏复杂环境所需语义基础和可审计执行。我们引入HEAR,一种基于分层超图本体的企业代理推理器。其基图层虚拟化了具有溯源意识的数据接口,而超边层编码n元业务规则和程序协议。通过证据驱动的推理循环,HEAR动态协调本体工具进行结构化多跳分析,无需重新训练LLM。在供应链任务中,包括订单履行阻塞根本原因分析(RCA)的评估显示,HEAR达到94.7%的准确率。关键地,HEAR展示了适应性效率:利用程序超边以最小化令牌成本,同时利用拓扑探索确保复杂查询的严格正确性。通过将专有模型性能与开源权重骨干结合,并自动化手动诊断,HEAR建立了可扩展、可审计的企业智能基础。

英文摘要

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

2605.12960 2026-05-21 cs.CL 版本更新

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

DiM\textsuperscript{3}: 通过方向和幅度感知融合连接多语言和多模态模型

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Schütze

发表机构 * Northeastern University, China(东北大学,中国) CIS, LMU Munich, Germany(慕尼黑莱布尼茨大学计算机学院,德国) Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心(MCML),德国) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 本研究提出DiM3方法,通过在共享语言模型骨干中融合残差更新,实现多语言和多模态能力的无缝整合,从而提升多语言性能并保持多模态能力。

详情
AI中文摘要

为了实现更通用和类人智能,大语言模型应能够无缝整合多语言和多模态能力;然而,扩展现有多模态模型以支持多种语言通常需要昂贵的多语言多模态数据构建和重复端到端重新训练。我们研究了一种无需训练的替代方法:通过在共享语言模型骨干中组合残差更新,将多语言能力注入现有多模态模型中。关键挑战是多语言和多模态更新是异构的,反映了共享模型中的不同功能角色。为了解决这一问题,我们提出了方向和幅度感知的多语言多模态融合(DiM3),在每个参数维度上选择性地组合这两种更新,同时保留原始视觉编码器和多模态投影器。在文本-only和视觉-语言设置下的多语言基准测试中,覆盖57种语言(LLaVA和Qwen为基础的骨干),实验表明DiM3在多语言性能上显著优于现有融合基线,并在原始多模态模型上大幅提升了多语言性能,同时在专用多语言多模态微调中保持竞争力。此外,我们进一步表明DiM3可以直接应用于已训练的多语言多模态模型,并仍能产生额外收益。进一步的可解释性分析显示,DiM3主要重塑中间层语义表示,在文本-only和多模态输入下加强跨语言对齐,同时保留高层任务敏感结构。我们的仓库在https://github.com/wzj1718/DiM3。

英文摘要

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

2605.11302 2026-05-21 cs.LG cs.AI cs.CL 版本更新

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

时间敏感语言生成理论:稀疏幻觉战胜模式崩溃

Atul Ganju, Travis McVoy, Shaddin Dughmi, Shang-Hua Teng

发表机构 * University of Southern California(美国南加州大学)

AI总结 本文研究了在全局偏好顺序下语言生成的极限情况,提出了一种时间敏感的语言生成方法,通过稀疏幻觉技术克服了模式崩溃问题,证明了在特定条件下可以实现最优密度。

详情
AI中文摘要

我们研究了在全局偏好顺序下语言生成的极限情况,如Kleinberg和Wei所引入的。与以往工作类似,我们追求广度,但增加了时效性要求:高排名字符串应更早生成。一个字符串只有在截止时间前生成才被认可,其截止时间由一个函数确定,该函数将字符串在目标语言中的排名映射到必须生成的时间。这与机器学习中的归纳偏置一致,即在其他条件相同的情况下,倾向于选择更简单或更可能的输出。我们证明,在强意义上,最终一致的生成器无法实现时效性生成——这是大多数先前相关工作的主角。在可能最温和的一致性放松下,即幻觉率随时间消失,我们证明可以绕过我们的不可能结果。特别是,我们可以实现相对于任何超线性截止函数的最优密度。我们还证明这是紧的,通过排除线性截止时间和消失幻觉率下的时效性生成。

英文摘要

We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As is done in previous work, we aim for breadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

2605.08123 2026-05-21 cs.LG cs.CL 版本更新

Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

块级可微的Sinkhorn注意力:带有间隙意识的尘桶桥尾部细化

Dylan Forde

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了通过停止基固定深度尾部细化代理在TPU硬件上实现长上下文平衡熵最优传输(OT)注意力。通过停止T步Sinkhorn求解后,展开一个短的细化尾部并精确地对这个代理进行微分。对于报告的R=2 TPU路径,反向传播包含四个阶梯计划因子。我们证明了一个精确的一参考瓷砖计划:R=2分数余切是单个参考计划瓷砖乘以一个由向量余切和双差分构建的显式修改字段。这导致了块级成本O((T+R)LW),O(Ld)输入存储,以及O(L)额外的HBM使用,对于固定头部维度d和带宽W在平衡固定支撑路径上。我们还正式化了当前dustbin_block路径作为在增强支撑上的相同单位目标代理,因此共轭计划提升到单个活跃尘桶路径,这在我们的TPU运行中使用;这个桥是代数的,不声称一般KL不平衡或任意容量间隙模型。我们提供了局部代理偏置界,后验偏置证书和严格正活跃块的投影收缩证书。在合成掩码问题上,优化的内核在10^-5至10^-10范围内与相同中心代理的精确自动微分匹配。在TPU v6e-8上,一个四配置Pfam屏幕完成端到端,一个提升的平衡R=2运行通过三小时预算,每秒维持大约8.5个示例,达到第1437步。保留的Pfam测试碎片将重建从5.57提高到2.05,稀疏CE从5.53提高到5.30,相对于第0步,CE被诊断性记录而不是直接优化;目标-均值对齐度量没有显著改善,而确定性对角参考在这些度量上仍更强。

详情
AI中文摘要

我们研究了通过停止基固定深度尾部细化代理在TPU硬件上实现长上下文平衡熵最优传输(OT)注意力。在停止T步Sinkhorn求解后,我们展开一个短的细化尾部并精确地对这个代理进行微分。对于报告的R=2 TPU路径,反向传播包含四个阶梯计划因子。我们证明了一个精确的一参考瓷砖计划:R=2分数余切是单个参考计划瓷砖乘以一个由向量余切和双差分构建的显式修改字段。这导致了块级成本O((T+R)LW),O(Ld)输入存储,以及O(L)额外的HBM使用,对于固定头部维度d和带宽W在平衡固定支撑路径上。我们还正式化了当前dustbin_block路径作为在增强支撑上的相同单位目标代理,因此共轭计划提升到单个活跃尘桶路径,这在我们的TPU运行中使用;这个桥是代数的,不声称一般KL不平衡或任意容量间隙模型。我们提供了局部代理偏置界,后验偏置证书和严格正活跃块的投影收缩证书。在合成掩码问题上,优化的内核在10^-5至10^-10范围内与相同中心代理的精确自动微分匹配。在TPU v6e-8上,一个四配置Pfam屏幕完成端到端,一个提升的平衡R=2运行通过三小时预算,每秒维持大约8.5个示例,达到第1437步。保留的Pfam测试碎片将重建从5.57提高到2.05,稀疏CE从5.53提高到5.30,相对于第0步,CE被诊断性记录而不是直接优化;目标-均值对齐度量没有显著改善,而确定性对角参考在这些度量上仍更强。

英文摘要

We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the reported $R=2$ TPU path, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R=2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O((T+R)LW)$, $O(Ld)$ input storage, and $O(L)$ additional HBM usage for fixed head dimension $d$ and band width $W$ on the balanced fixed-support path. We also formalize the current \texttt{dustbin\_block} path as the same unit-target surrogate on an augmented support, so the adjoint schedule lifts to the single-active-dustbin path used in our TPU runs; this bridge is algebraic and does not claim a general KL-unbalanced or arbitrary-capacity gap model. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within $10^{-5}$--$10^{-10}$. On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced $R=2$ run sustains roughly $8.5$ examples per second through a three-hour budget, reaching step $1437$. Held-out Pfam test shards improve reconstruction from $5.57$ to $2.05$ and sparse CE from $5.53$ to $5.30$ relative to step $0$, with CE logged diagnostically rather than optimized directly; target-barycenter alignment metrics do not materially improve, and a deterministic diagonal reference remains stronger on those metrics.

2605.07731 2026-05-21 cs.CL cs.AI 版本更新

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

对可比的意大利和国际开源大语言模型进行EngGPT2-16B-A3B的基准测试

Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman

发表机构 * AIRIC, Politecnico di Milano(AIRIC,米兰理工大学) DEIB, Politecnico di Milano(DEIB,米兰理工大学)

AI总结 本文研究了EngGPT2-16B-A3B在多个基准测试中的性能,与同等规模的开源MoE和密集模型进行比较,展示了其在国际和意大利基准测试中的表现。

详情
AI中文摘要

本报告对ENGINEERING Ingegneria Informatica S.p.A.的EngGPT2MoE-16B-A3B大语言模型进行了基准测试,该模型是一个具有3B活跃参数的16B参数混合专家(MoE)模型。性能在各种代表性基准测试中进行了评估,并与同等规模的开源MoE和密集模型进行了比较。与流行的意大利模型如FastwebMIIA-7B、Minerva-7B、Velvet-14B和LLaMAntino-3-ANITA-8B相比,EngGPT2MoE-16B-A3B在国际基准测试(ARC-Challenge、GSM8K、AIME24、AIME25、MMLU和HumanEval(HE))中表现相同或更好。它在RULER基准测试的最长上下文设置(32k)中取得最佳性能。在意大利基准数据集ITALIC上,该模型在除Velvet-14B外的其他模型中表现相同或更好。与同等规模的MoE模型相比,新模型在所有考虑的基准测试中都比DeepSeek-MoE-16B-Chat的值更高。它在HE、MMLU、AIME24、AIME25、GSM8K和32k RULER设置上比Moonlight-16B-A3B更高,但在BFCL和一些ARC和ITALIC设置上较低。最后,它在大多数基准测试中比GPT-OSS-20B低,包括HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL和RULER 32k。与流行的密集模型相比,EngGPT2MoE-16B-A3B在AIME24和AIME25上比Llama-3.1-8B-Instruct、Gemma-3-12b-it和Minstral-3-8BInstruct-2512-BF16的值更高,但在ITALIC、BFCL和32k RULER设置上较低。当性能汇总所有基准测试指标时,EngGPT2MoE-16B-A3B在评估的意大利模型中表现更高,但在一些最高效的国际模型(特别是GPT-5 nano和Qwen3-8B)中表现较低。总体而言,我们的发现表明新模型是原生意大利大语言模型的一大步。

英文摘要

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

2605.04128 2026-05-21 cs.GR cs.AI cs.CL cs.CV cs.LG 版本更新

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

JoyAI-Image: 激活统一多模态理解和生成中的空间智能

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy, JD(joy未来学院,京东)

AI总结 本文提出JoyAI-Image,一种统一的多模态基础模型,用于视觉理解、文本到图像生成和指令引导的图像编辑。该模型结合了空间增强的多模态大语言模型(MLLM)和多模态扩散Transformer(MMDiT),通过共享的多模态接口实现感知与生成的交互。构建可扩展的训练配方,结合统一指令微调、长文本渲染监督、空间 grounded 数据和通用及空间编辑信号,使模型具备广泛的多模态能力,同时增强几何感知推理和可控视觉合成。实验表明,JoyAI-Image在理解、生成、长文本渲染和编辑基准上达到最先进的性能。更重要的是,增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力,向更强的空间智能发展。

Comments Code: https://github.com/jd-opensource/JoyAI-Image

详情
AI中文摘要

我们提出了JoyAI-Image,一种统一的多模态基础模型,用于视觉理解、文本到图像生成和指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型(MLLM)与多模态扩散Transformer(MMDiT)结合,允许感知和生成通过共享的多模态接口进行交互。围绕此架构,我们构建了一个可扩展的训练配方,结合了统一指令微调、长文本渲染监督、空间 grounded 数据以及通用和空间编辑信号。该设计使模型具备广泛的多模态能力,同时增强了几何感知推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准上的实验表明,JoyAI-Image实现了最先进的或高度竞争的性能。更重要的是,增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力,向更强的空间智能发展。这些结果表明,统一视觉模型在下游应用如视觉-语言-动作系统和世界模型中具有前景。

英文摘要

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

2604.26052 2026-05-21 cs.CL 版本更新

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

从提示风险到响应风险:大型语言模型安全行为的配对分析

Mengya Hu, Qiong Wei, Sandeep Atluri

发表机构 * Microsoft(微软)

AI总结 本研究通过配对分析人类标注的提示和响应记录,探讨了大型语言模型在四个危害类别(性、自残、仇恨和暴力)和有序严重性级别上的安全行为,发现61%的响应比提示减少了危害,3%的响应升级了危害,并揭示了安全行为与无害性之间的权衡。

详情
AI中文摘要

大型语言模型的安全评估通常报告二元结果,如攻击成功率(ASR)、拒绝率或有害与安全分类,这些结果隐藏了提示与响应之间风险的变化。我们通过对四个危害类别(性、自残、仇恨和暴力)和有序严重性级别(安全、低、中、高)的人类标注提示和响应记录进行配对分析,发现61%的响应相对于提示减少了危害,36%的响应保持了严重性,3%的响应升级了危害。升级分为两种机制:良性提示触发未请求的有害细节,以及在更高严重性级别上保持任务的响应。类别分解显示,性内容在此样本中表现出最高的危害持续性,这由相同严重性级别的合规性驱动,而非来自良性输入的漂移。联合相关性分析揭示了有用性与无害性之间的权衡:合规性升级仍保持高度相关,而安全响应包含低相关性的通用拒绝。最后,少样本LLM评估者表现出提示/响应检测的不对称性,数据校准无法弥补这种不对称性。评估者提示可在https://github.com/microsoft/PairedSafety获取。

英文摘要

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

2603.26539 2026-05-21 cs.CL cs.AI 版本更新

How Open Must Language Models be to Enable Reliable Scientific Inference?

语言模型必须多开放才能实现可靠的科学推断?

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

发表机构 * Massachusetts Institute of Technology(麻省理工学院) EleutherAI University of California San Diego(加州大学圣地亚哥分校) Rutgers University-Newark(新泽西州立大学罗威特分校) Stony Brook University(史泰森布魯克大學)

AI总结 本文探讨了语言模型的开放程度如何影响基于其研究的科学推断可靠性,指出封闭模型通常不适合科学用途,并提出系统识别和缓解推断威胁的方法。

详情
AI中文摘要

语言模型的开放程度如何影响基于其研究的科学推断?本文分析了模型构造和部署信息的限制如何威胁可靠的推断。我们论证当前封闭模型通常不适合科学用途(有例外情况),并讨论如何解决或缓解它们对可靠推断的威胁。我们建议在研究中使用模型时,应系统地识别潜在的推断威胁,并采取相应的缓解措施,同时提供具体模型选择的正当理由。

英文摘要

How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

2603.23531 2026-05-21 cs.CL 版本更新

Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

大型语言模型通过目标-立场提取解构复杂的政治观点

Özgür Togay, Javier Garcia-Bernardo, Florian Kunneman, Anastasia Giachanou

发表机构 * Department of Methodology and Statistics, Utrecht University(方法论与统计学系,乌特雷赫特大学) Department of Languages, Literature and Communication, Utrecht University(语言、文学与传播系,乌特雷赫特大学)

AI总结 本文研究了大型语言模型是否能通过目标-立场提取任务解构复杂政治观点,通过构建包含138个不同政治目标的Reddit帖子数据集,评估了多种LLM在零样本、少样本和上下文增强提示策略下的表现,结果显示最佳模型表现接近高度训练的人类标注者,证明了LLM在最小监督下的复杂政治观点提取能力。

详情
AI中文摘要

政治极化源于对政策、人物和议题的复杂信念相互作用。然而,大多数计算分析将话语简化为粗粒度的党派标签,忽视了这些信念之间的互动。这在在线政治对话中尤为明显,这些对话通常具有细微差别且涵盖广泛主题,使自动识别讨论目标和对它们的观点变得困难。在本研究中,我们探讨了大型语言模型(LLMs)是否能通过目标-立场提取(TSE)任务解决这一挑战,TSE是一种结合目标识别和立场检测的自然语言处理任务,能够更细致地分析政治观点。为此,我们构建了一个包含1,084个Reddit帖子的数据集,来自r/NeutralPolitics,涵盖138个不同的政治目标,并使用零样本、少样本和上下文增强提示策略评估了一系列专有和开源的LLM。我们的结果表明,最佳模型在高度训练的人类标注者表现相当,并且在具有低标注者一致性挑战的帖子上仍保持稳健。这些发现表明,LLM能够以最小监督的方式提取复杂的政治观点,为计算社会科学和政治文本分析提供了可扩展的工具。

英文摘要

Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

2603.06007 2026-05-21 cs.CL cs.AI cs.MA 版本更新

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

MASFactory: 一种基于图的框架,用于通过Vibe图谱编排基于大语言模型的多智能体系统

Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出MASFactory,一种基于图的框架,用于通过Vibe图谱编排基于大语言模型的多智能体系统,解决了现有框架在实现复杂图工作流时需要大量手动工作、重用性差和难以整合异构外部上下文源的问题。

Comments Accepted to the ACL 2026 Demo Track. Camera-ready version. 10 pages, 6 figures. Code and documentation are available at: https://github.com/BUPT-GAMMA/MASFactory

详情
AI中文摘要

基于大语言模型的(LLM-based)多智能体系统(MAS)越来越多地被用于通过角色专业化和协作扩展智能体问题解决。MAS工作流可以自然地建模为有向计算图,其中节点执行智能体或子工作流,边编码依赖性和消息传递。然而,目前框架在实现复杂图工作流时仍然需要大量的手动工作,提供有限的重用性,并使整合异构外部上下文源变得困难。为克服这些限制,我们提出了MASFactory,一种用于编排基于大语言模型的MAS的基于图的框架。它引入了Vibe图谱,一种人机交互的方法,将自然语言意图编译成可编辑的工作流规范,然后编译成可执行的图。此外,该框架提供了可重用的组件、技能支持、多模态消息处理和可插拔的上下文整合,以及用于拓扑预览、运行时跟踪和人机交互的可视化工具。我们在七个公开基准上评估了MASFactory,验证了代表性MAS方法的再生产一致性以及Vibe图谱的有效性。我们的代码(https://github.com/BUPT-GAMMA/MASFactory,Apache-2.0许可)和视频演示(https://youtu.be/ANynzVfY32k)均已公开。

英文摘要

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

2602.19320 2026-05-21 cs.CL cs.AI 版本更新

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

代理记忆的解剖结构:评估和系统限制的分类与实证分析

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of California Davis(加州大学戴维斯分校) Texas A&M University(德克萨斯农工大学)

AI总结 本文通过分类和实证分析,探讨了代理记忆系统的架构和系统限制,揭示了现有系统在基准饱和效应、指标有效性、模型依赖性和内存维护开销等方面的问题,并提出了更可靠的评估方法和可扩展的系统设计方向。

详情
AI中文摘要

代理记忆系统使大型语言模型(LLM)代理能够在长时间交互中保持状态,支持超出固定上下文窗口的长周期推理和个性化。尽管架构发展迅速,这些系统的实证基础仍脆弱:现有基准通常规模不足,评估指标与语义效用不一致,性能在基础模型上变化显著,且系统层面的成本常被忽视。本文从架构和系统角度对代理记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统简要分类。然后,我们分析了限制当前系统的关键痛点,包括基准饱和效应、指标有效性和判断敏感性、基础模型依赖的准确性,以及内存维护引入的延迟和吞吐量开销。通过将内存结构与实证限制联系起来,本文阐明了当前代理记忆系统为何经常无法达到其理论承诺,并概述了更可靠评估和可扩展系统设计的方向。

英文摘要

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG 版本更新

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能:面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

发表机构 * College of Engineering and Applied Science, University of Colorado Colorado Springs(科罗拉多州立大学工程与应用科学学院)

AI总结 本文提出了一种上下文感知分层集成梯度框架(CA-LIG),用于解释Transformer模型的决策过程,通过计算每个Transformer块内的分层集成梯度,并将这些token级属性与类特定的注意力梯度融合,从而生成具有符号和上下文敏感性的属性图,以捕捉支持和反对的证据,并追踪Transformer层中的相关性层次流动。

详情
AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能,然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性,只能捕捉局部token级属性或全局注意力模式,缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制,我们提出了上下文感知分层集成梯度(CA-LIG)框架,一种统一的层次属性框架,该框架在每个Transformer块内计算分层集成梯度,并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图,能够捕捉支持和反对的证据,同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现,包括使用BERT进行情感分析和长多类文档分类,使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测,以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中,CA-LIG提供了更忠实的属性,显示出对上下文依赖的更强敏感性,并产生了更清晰、更语义连贯的可视化结果,优于现有可解释性方法。这些结果表明,CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释,推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

2602.08819 2026-05-21 cs.LG cs.CL 版本更新

Bayesian Preference Learning for Test-Time Steerable Reward Models

基于测试时间可调节的贝叶斯偏好学习的奖励模型

Jiwoo Hong, Shao Tang, Zhipeng Wang

发表机构 * LinkedIn Corporation(LinkedIn公司) Nubank

AI总结 本文提出了一种新的贝叶斯奖励建模目标,即变分上下文奖励建模(ICRM),通过上下文偏好演示实现测试时间可调节性,从而适应未见过的偏好分布,提高了奖励模型的准确性和鲁棒性。

Comments Preprint

详情
AI中文摘要

奖励模型在通过强化学习(RL)对语言模型与人类偏好对齐中起核心作用。随着RL越来越多地应用于可验证奖励和多目标对齐等场景,RMs被期望编码更复杂和多维的偏好分布。然而,分类RMs一旦训练完成就保持静态,限制了测试时间的适应性。我们提出变分上下文奖励建模(ICRM),一种新颖的贝叶斯奖励建模目标,通过上下文偏好演示实现测试时间可调节性。ICRM将奖励建模视为在Bradley-Terry模型下对潜在偏好概率的变分推断,使用共轭Beta先验。我们证明ICRM能够适应单目标和多目标设置中的未见过的偏好分布。随着更多演示,ICRM在RM-Bench上的准确性从60.5提高到70.8,在道德困境偏好上比生成判断者具有更低的校准误差,并在冲突偏好下扩展了可达到的帕累托前沿。我们进一步研究了ICRM在RL训练中的实际适用性,证明其可以通过在数学推理中优于传统RM来有效编码可验证奖励。最后,我们提供了理论保证,变分目标在有限置信度下具有全局内部最优解,并分析了KL正则化如何缓解奖励过度优化。

英文摘要

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapts to unseen preference distributions at test time for both single and multi-objective settings. With more demonstrations, ICRM improves RM-Bench accuracy from 60.5 to 70.8, achieves lower calibration error than a generative judge on moral dilemma preferences, and expands the attainable Pareto frontier under conflicting preferences. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

2602.08028 2026-05-21 cs.CL cs.AI 版本更新

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

偏离以诱导提示:多理性诱导用于零样本推理

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan(国家台湾大学计算机科学与信息工程系) Institute of Information Science, Academia Sinica, Taiwan(学术院信息科学研究所) AI Research Center (AINTU), National Taiwan University, Taiwan(国家台湾大学人工智能研究中心(AINTU))

AI总结 本研究提出DIP框架,通过生成多个多样化的高层理由并诱导最终计划,以提升零样本推理的准确性,克服了传统链式思考提示中推理路径不稳定的问题。

Comments Accepted to Findings of IJCNLP-AACL 2025

详情
AI中文摘要

为了解决标准链式思考提示中无引导推理路径的不稳定性,最近的方法通过首先引导大型语言模型(LLMs)生成单一推理策略来指导模型。然而,仅依赖一个策略来回答每个问题仍然限制了在多样化任务中的性能。我们提出了偏离以诱导提示(DIP),一个框架,首先提示LLM为每个问题生成多个多样化的高层理由。每个理由随后被扩展成详细的、分步骤的草案计划。最后,这些草案计划被诱导成最终计划。DIP在不依赖资源密集型采样的情况下增强了零样本推理的准确性。实验表明,DIP优于单一策略提示,证明了多计划诱导对基于提示的推理的有效性。

英文摘要

To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

2601.05877 2026-05-21 cs.CL 版本更新

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

iReasoner: 一种面向轨迹的内在推理监督方法,用于自演化的大多模态模型

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

发表机构 * Vellore Institute of Technology(韦洛雷理工学院) School of Information and Data Sciences(信息与数据科学学院) Nagasaki University(长崎大学) Loughborough University(洛桑大学)

AI总结 本文提出iReasoner,一种自演化框架,通过显式引导推理链和奖励内部一致性来提升大模型的隐式推理能力,在无监督设置下实现了多模态推理基准的性能提升。

Comments ACL 2026 (Findings)

详情
AI中文摘要

最近的研究表明,大多模态模型(LMMs)可以通过自博弈和内在反馈从无标签数据中自我改进。然而,现有的自演化框架主要奖励最终结果,尽管中间推理对于视觉基础决策至关重要,但其约束较弱。我们提出iReasoner,一种自演化框架,通过显式引导推理链(CoT)并奖励其内部一致性来提升LMM的隐式推理能力。在无标签图像上进行Proposer--Solver循环时,iReasoner通过在中间推理步骤上定义轨迹感知信号,增强结果层面的内在奖励,提供无需真实标签或外部评判的学习信号,以区分导向相同答案的不同推理路径。从Qwen2.5-VL-7B开始,iReasoner在完全无监督的后训练阶段,在多样化的多模态推理基准上实现了最高+2.1分的提升。我们希望这项工作能成为在纯无监督设置中实现推理感知自改进的LMMs的起点。我们的代码可在https://meghanaasunil.github.io/iReasoner上获取。

英文摘要

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.

2601.03135 2026-05-21 cs.CL 版本更新

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

通过合成数据和语言特定预处理改进原住民语言机器翻译

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

发表机构 * University of Florida(佛罗里达大学)

AI总结 本研究通过合成数据生成和语言特定预处理方法,改进低资源原住民语言的神经机器翻译效果,实验显示合成数据增强对翻译质量有积极影响,但通用预处理在高度屈折语言中存在局限。

详情
AI中文摘要

低资源原住民语言往往缺乏用于有效神经机器翻译(NMT)所需的平行语料库。合成数据生成为数据稀缺环境提供了一种实用策略。在本工作中,我们通过使用高容量多语言翻译模型生成合成句子对,扩充美洲原住民语言的精选平行语料库。我们对多语言mBART模型进行微调,使用curated-only和合成增强的数据,并通过chrF++评估翻译质量,该指标是最近美洲NLP共享任务中用于屈折语言的主要指标。我们进一步应用语言特定的预处理,包括正字法标准化和噪声感知过滤,以减少语料库中的伪影。在瓜拉尼-西班牙语和克丘亚-西班牙语翻译实验中,合成数据增强显示出一致的chrF++提升,而对艾马拉语的诊断实验则揭示了通用预处理在高度屈折语言中的局限性。

英文摘要

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

2601.03019 2026-05-21 q-bio.GN cs.CL 版本更新

DNACHUNKER: Learnable Tokenization for DNA Language Models

DNACHUNKER: 用于DNA语言模型的可学习分词

Taewon Kim, Jihwan Shin, Hyomin Kim, Youngmok Jung, Jonghoon Lee, Won-Chul Lee, Sungsoo Ahn, Insu Han

发表机构 * Graduate School of AI, KAIST(韩国科学技术院人工智能研究生院) Department of Electrical Engineering, KAIST(韩国科学技术院电气工程系) Inocras Korea Inc.(Inocras韩国公司)

AI总结 本文提出DNACHUNKER,一种可学习的DNA分词方法,通过动态分段模块生成上下文依赖的变长单元,提升DNA语言模型在基因组序列处理中的鲁棒性和效率。

Comments ICML 2026 camera-ready version

详情
AI中文摘要

DNA语言模型越来越多地用于表示基因组序列,但其有效性严重依赖于原始核苷酸如何转换为模型输入。与自然语言不同,DNA没有标准的边界,使固定分词成为在移位、插入缺失和局部重复下脆弱的设计选择。我们引入了DNAChunker,一种带有可学习自适应分段模块的遮蔽DNA语言模型,以生成上下文依赖、变长的单元。基于动态分段过程,DNAChunker学会在功能丰富区域分配更细的粒度,同时压缩重复或冗余序列。我们预训练DNAChunker在人类参考基因组上,并在五个基准上评估其性能,结果在强固定分词基线中一致提升。进一步的分析和消融实验表明,与固定分词不同,分段是以生物信息学指导、对突变具有鲁棒性的学习方式进行的。

英文摘要

DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pretrain DNAChunker on the human reference genome and evaluate it across five benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that unlike fixed tokenizations, segmentation is learned in a biologically-informed, mutation-resilient manner.

2512.14896 2026-05-21 cs.CL cs.AI 版本更新

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

DrugRAG: 通过一种新颖的检索增强生成流水线提升药学LLM性能

Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Fatemeh Latifi, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Babak Khalaj, Mohammad Hossein Rohban, Glolamali Aminian, Zohreh Amoozgar, Tahereh Javaheri

发表机构 * Department of Medicinal Chemistry, Faculty of Pharmacy, Tehran University of Medical Sciences(药学系,泰赫兰医科大学) Department of Computer Sciences, Faculty of Mathematics and Computer Sciences, Amir Kabir University of Technology(计算机科学系,阿米尔·卡比尔技术大学) Department of Mathematical Sciences, Sharif University of Technology(数学科学系,沙菲克技术大学) Department of Computer Sciences, Missouri University of Science and Technology(计算机科学系,密苏里科学与技术大学) Department of Computer Engineering, Sharif University of Technology(计算机工程系,沙菲克技术大学) Department of Faculty of Interdisciplinary Science and Technology, Tarbiat Modares University(跨学科科学与技术学院,塔里亚特莫达res大学) Electronics Research Institute, Sharif University of Technology(电子研究所,沙菲克技术大学) Department of Electrical Engineering, Sharif University of Technology(电气工程系,沙菲克技术大学) The Alan Turing Institute, London, United Kingdom(艾伦·图灵研究所,伦敦,英国) Department of Radiation Oncology, Massachusetts General Hospital & Harvard Medical School(放射肿瘤科,麻省总医院及哈佛医学院) Health Informatics Lab, Metropolitan College, Boston University(健康信息学实验室,波士顿大学)

AI总结 本研究评估了大型语言模型在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性,通过DrugRAG流水线整合结构化药物知识,从而提升药学相关问答任务的LLM性能。

Comments 14 pages, 2 figures, 2 tables. The revised version includes McNemar's paired statistical analysis, Wilson confidence intervals, expanded methodological clarifications, a revised discussion of evidence retrieval, improved reproducibility details, and updated limitations

详情
AI中文摘要

在本研究中,我们评估了大型语言模型(LLM)在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性。我们使用一个包含141个问题的药学数据集,对十个参数规模不同的LLM(8十亿到70十亿以上)进行了基准测试,测量了基线准确性。基线性能范围从46%到92%,其中GPT-5(92%)和o3(89%)取得了最高分数,而较小的开源模型表现显著较低。然后,我们开发了DrugRAG,一种三步检索增强生成(RAG)流水线,该流水线检索结构化、基于证据的药物信息,并将上下文药理学证据添加到模型提示中,该流水线在模型架构或参数无需更改的情况下外部运行。DrugRAG在所有五个评估模型上均提高了准确性,提升幅度范围从7到21个百分点(例如,Gemma 3 27B:61.0%到71%,Llama 3.1 8B:46%到67%)。McNemar分析显示,这些改进在较小和中等规模的开源模型中具有统计学显著性。这些发现表明,通过DrugRAG整合结构化外部药物知识可以提高LLM在药学相关问答任务中的性能,而无需修改底层模型,为提升基于证据的药学相关AI应用提供了实用的流水线。

英文摘要

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

2511.01482 2026-05-21 cs.CL 版本更新

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

迈向认知扭曲一致检测:基于大语言模型的标注与数据集无关评估

Neha Sharma, Navneet Agarwal, Kairit Sirts

发表机构 * LREC-2026(LREC-2026会议)

AI总结 本文探讨了利用大语言模型作为一致且可靠的标注器进行认知扭曲检测的方法,并提出了一种数据集无关的评估框架,以公平比较不同数据集训练的模型,结果显示GPT-4能产生一致的标注,提升了模型在主观NLP任务中的表现。

详情
Journal ref
https://lrec.elra.info/lrec2026-main-851
AI中文摘要

基于文本的自动化认知扭曲检测是一项具有挑战性的任务,由于其主观性质,即使在专家人类标注者之间也观察到低一致性分数,导致不可靠的标注。我们探索了使用大型语言模型(LLMs)作为一致且可靠的标注器,并提出多个独立的LLM运行可以揭示稳定的标注模式,尽管任务本身具有内在的主观性。此外,为了公平比较训练于不同特征数据集上的模型,我们引入了一种使用Cohen's kappa作为效应大小度量的数据集无关评估框架。该方法允许在传统指标如F1分数不足的情况下进行公平的跨数据集和跨研究比较。我们的结果表明,GPT-4可以产生一致的标注(Fleiss's Kappa = 0.78),从而在使用这些标注训练的模型在测试集上的表现优于使用人类标注数据训练的模型。我们的发现表明,LLMs可以为生成支持强大下游性能的主观NLP任务的训练数据提供可扩展且内部一致的替代方案。

英文摘要

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE 版本更新

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

发表机构 * The University of Hong Kong(香港大学) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出JanusCoder,一种面向代码智能的视觉-程序化界面,通过构建大规模多模态代码数据集和统一模型,实现从文本指令、视觉输入或两者结合生成代码,展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情
AI中文摘要

神经代码智能的范围正在迅速扩展,从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而,进展受到高质量多模态代码数据稀缺的阻碍,这源于合成和质量评估的挑战。为了解决这些挑战,我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包,利用数据模态之间的相互协同效应,高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包,我们构建了JanusCode-800K,目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练,建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法,后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明,JanusCoder系列的性能优越,我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外,广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

2510.08482 2026-05-21 cs.CV cs.CL 版本更新

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

视觉象征性挑战:在手语形式-意义映射上评估视觉-语言模型

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

发表机构 * Multimodal Language Department(多模态语言部门) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Department of Linguistics(语言学系) Boğaziçi University(博多伊奇大学) Donders Institute for Brain Cognition and Behaviour(多纳尔斯脑认知与行为研究所) Radboud University(拉德堡德大学) Department of Linguistics and Communication(语言学与沟通系) University of Birmingham(伯明翰大学)

AI总结 本文提出一个新颖的视频基准测试,用于评估视觉-语言模型在手语形式-意义映射上的表现,通过心理语言学测量来评估三种任务:语音学手语形式预测、透明度和渐进象征性评分,并发现模型在语音形式预测上表现较好但整体仍低于人类表现。

详情
AI中文摘要

象征性,即语言形式与意义之间的相似性,在手语中普遍存在,为视觉 grounding 提供了自然的测试环境。对于视觉-语言模型(VLMs),挑战在于从动态的人类运动中恢复这种本质的映射,而非静态上下文。我们引入了视觉象征性挑战,一个新颖的基于视频的基准测试,将心理语言学测量适应于评估 VLMs 在三个任务上的表现:(i)语音学手语形式预测(例如,手形、位置),(ii)透明度(从视觉形式推断意义),以及(iii)渐进象征性评分。我们评估了13种最先进的VLMs在零样本和少样本设置下在荷兰手语上的表现,并将其与人类基线进行比较。在语音形式预测上,VLMs恢复了一些手形和位置细节,但表现仍低于人类;在透明度上,它们与人类基线相差甚远;只有顶级模型与人类象征性评分有中等相关性。有趣的是,语音形式预测能力更强的模型更能与人类象征性判断相关联,表明它们对视觉基础结构有共同的敏感性。我们的发现验证了这些诊断任务,并推动了以人类为中心的信号和具身学习方法,用于建模象征性和改进多模态模型中的视觉 grounding。

英文摘要

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

2509.08010 2026-05-21 cs.CY cs.AI cs.CL cs.HC 版本更新

Measuring and mitigating overreliance to build human-compatible AI

测量和缓解过度依赖以构建人类兼容的AI

Lujain Ibrahim, Katherine M. Collins, Sunnie S. Y. Kim, Anka Reuel, Max Lamparth, Kevin Feng, Lama Ahmad, Prajna Soni, Alia El Kattan, Merlin Stein, Siddharth Swaroop, Vishakh Padmakumar, Ilia Sucholutsky, Andrew Strait, Diyi Yang, Q. Vera Liao, Umang Bhatt

AI总结 本文研究了大型语言模型过度依赖的风险,探讨了测量和缓解过度依赖的方法,以确保AI能增强而非削弱人类能力。

详情
AI中文摘要

大型语言模型(LLMs)通过作为协作的『思想伙伴』而区别于先前的技术,能够在多种任务上更流畅地进行自然语言交互。随着LLMs在医疗、个人建议等不同领域中日益影响关键决策,过度依赖LLMs的风险也随之增加。本文认为,测量和缓解过度依赖必须成为LLMs研究和部署的核心。首先,我们汇总了个体和社会层面的过度依赖风险,包括高风险错误、治理挑战和认知退化。然后,我们探讨了LLMs的特点、系统设计特征和用户认知偏见,这些因素共同引发了关于实际中过度依赖LLMs的严重且独特的问题。我们还审查了历史上的过度依赖测量方法,识别出三个重要的差距,并提出三个有前景的方向来改进测量。最后,我们提出了可以采取的缓解策略,以确保LLMs增强而非削弱人类能力。

英文摘要

Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities.

2509.03526 2026-05-21 cs.CL eess.AS 版本更新

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

通过强化行为对齐增强语音大语言模型

Yansong Liu, Jiateng Li, Yuan Liu

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为强化行为对齐(RBA)的框架,通过自合成方法生成高质量对齐数据来提升语音大语言模型(SpeechLMs)的语言生成能力,实验表明该方法显著提升了SpeechLMs的指令遵循能力,并可扩展至语音问答和语音转文本翻译等任务。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展引发了对扩展其语言能力至其他模态的广泛关注,从而催生了能够处理语音或文本格式用户请求的语音基础语言模型(SpeechLMs)。然而,由于跨模态差异,这些SpeechLMs在指令遵循方面仍显著逊色于文本基础的LLMs,特别是在面对用户语音的动态和多变性质时。为了解决这一挑战,本文提出了一种名为强化行为对齐(RBA)的框架,旨在增强SpeechLMs的语言生成能力。RBA不依赖于监督微调的人类标注,而是通过强大的教师LLM生成大量高质量的对齐数据。然后,通过强化学习方法将SpeechLMs的行为与教师模型的行为对齐。实验结果表明,该方法有效提升了SpeechLMs的指令遵循能力,使其优于传统蒸馏基线。关键的是,我们证明RBA可以无缝扩展到包括语音问答和语音到文本翻译在内的任务,在仅使用自生成数据的情况下,在开放基准上取得了最先进的性能。

英文摘要

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

2508.16453 2026-05-21 cs.SI cs.CL cs.LG 版本更新

Anti-establishment sentiment on TikTok: Implications for understanding influence(rs) and expertise on social media

TikTok上的反 Establishment 情绪:对社交媒体中影响者和专业知识理解的启示

Tianliang Xu, Ariel Hasell, Sabina Tomkins

发表机构 * GitHub

AI总结 本文研究了TikTok上反 Establishment 情绪的普遍性,通过计算方法分析了金融、健康和阴谋论等主题内容中反 Establishment 情绪的分布,并探讨了社交媒体环境中反 Establishment 情绪对用户参与和平台激励的影响。

Comments 10 pages excluding references; 14 pages in total; 4 figures; Accepted by the AAAI Conference on Web and Social Media (ICWSM-2026)

详情
AI中文摘要

对公共服务机构的不信任和反 Establishment 观点正在上升(尤其是在美国)。随着人们转向社交媒体获取信息,有必要了解社交媒体环境是否以及如何促进对机构的不信任。在社交媒体中,内容创作者、影响者和其他意见领袖往往将自己定位为在健康、政治等众多话题上具有专业知识和权威性,并在许多情况下贬低和否定机构专业知识以建立追随者并增加自身可见性。然而,这种内容的普及程度以及此类内容是否增加参与度仍不清楚。本研究分析了TikTok平台上反 Establishment 情绪(AES)的普遍性。尽管TikTok作为信息来源非常流行,但其仍然相对研究较少,可能为人们如何形成对机构态度提供重要见解。我们采用计算方法,对TikTok帖子进行标注,判断其是否包含AES,涵盖内容创作者通常定位为专家的主题领域:金融和健康。作为比较,我们还考虑了阴谋论主题,其中AES预期较为常见。我们发现,AES在阴谋论内容中最为普遍,而在其他两个主题的内容中相对罕见。然而,我们发现与此类内容的参与模式因领域而异,并且可能存在平台激励用户发布表达反 Establishment 情绪的内容。

英文摘要

Distrust of public serving institutions and anti-establishment views are on the rise (especially in the U.S.). As people turn to social media for information, it is imperative to understand whether and how social media environments may be contributing to distrust of institutions. In social media, content creators, influencers, and other opinion leaders often position themselves as having expertise and authority on a range of topics from health to politics, and in many cases devalue and dismiss institutional expertise to build a following and increase their own visibility. However, the extent to which this content appears and whether such content increases engagement is unclear. This study analyzes the prevalence of anti-establishment sentiment (AES) on the social media platform TikTok. Despite its popularity as a source of information, TikTok remains relatively understudied and may provide important insights into how people form attitudes towards institutions. We employ a computational approach to label TikTok posts as containing AES or not across topical domains where content creators tend to frame themselves as experts: finance and wellness. As a comparison, we also consider the topic of conspiracy theories, where AES is expected to be common. We find that AES is most prevalent in conspiracy theory content, and relatively rare in content related to the other two topics. However, we find that engagement patterns with such content varies by area, and that there may be platform incentives for users to post content that expresses anti-establishment sentiment.

2508.09001 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Retrospective Sparse Attention for Efficient Long-Context Generation

回顾性稀疏注意力用于高效长上下文生成

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出RetroAttention,一种新的KV缓存更新技术,通过回顾后续解码步骤的KV条目来修正过去的注意力输出,从而提高长上下文生成的效率和准确性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地应用于长上下文任务,如推理、代码生成和多轮对话。然而,扩展上下文的推理受到键值(KV)缓存的限制,其内存占用与序列长度成线性增长,且在每个解码步骤中主导延迟。尽管最近的KV缓存压缩方法识别并加载重要的少量token,但它们主要集中在输入上下文中,未能解决长时间解码中累积的注意力误差。在本文中,我们引入了RetroAttention,一种新的KV缓存更新技术,通过回顾后续解码步骤的KV条目来修正过去的注意力输出。通过维护一个轻量级的输出缓存,RetroAttention使过去的查询能够高效地补充更多上下文,同时产生最小的延迟开销。这打破了固定注意力输出的范式,允许对先前近似进行持续修正。在长生成基准测试中,RetroAttention在长生成任务中始终优于最先进的(SOTA)KV压缩方法,有效KV暴露量增加高达1.6倍,准确性提高高达21.9%。

英文摘要

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important few tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to be efficiently supplemented with more contexts, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.

2508.08636 2026-05-21 cs.CL 版本更新

InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

InternBootcamp 技术报告:通过可验证的任务扩展提升大语言模型推理能力

Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Xiaozhe Li, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学)

AI总结 本文提出InternBootcamp框架,通过可验证的任务扩展方法提升大语言模型的推理能力,为基于强化学习的模型优化、合成数据生成和模型评估提供了基础设施。

Comments InternBootcamp Tech Report

详情
AI中文摘要

大语言模型(LLMs)通过使复杂推理能力成为可能,彻底改变了人工智能领域。尽管最近在强化学习(RL)方面的进展主要集中在领域特定的推理任务(如数学或代码生成),但现实世界中的推理场景往往需要模型处理多样且复杂的环境,而窄领域基准无法完全捕捉这些需求。为解决这一差距,我们提出了InternBootcamp,一个包含1000多个领域多样任务环境的开源框架,专门用于LLM推理研究。我们的代码库提供了两个关键功能:(1)自动生成无限数量的训练/测试案例,具有可配置的难度级别;(2)集成了用于客观响应评估的验证模块。这些功能使InternBootcamp成为基于强化学习的模型优化、合成数据生成和模型评估的基础设施。尽管手动开发具有庞大任务覆盖的框架非常繁琐,但我们通过自动化代理工作流辅以人工验证协议来加速开发过程,使任务范围能够迅速扩展。通过这些训练营,我们进一步建立了Bootcamp-EVAL,一个自动生成的基准,用于全面性能评估。评估显示,前沿模型在许多推理任务上仍表现不足,而使用InternBootcamp进行训练提供了一种有效的方法,显著提高性能,从而得到我们的32B模型,在Bootcamp-EVAL上取得最先进的结果,并在其他已建立的基准上表现出色。特别是,我们验证了在任务扩展方面的一致性能提升来自于包含更多训练任务,即任务扩展,幅度达两个数量级,为能够推理的通用模型提供了有前途的途径。

英文摘要

Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.

2505.19075 2026-05-21 cs.AI cs.CL cs.LG 版本更新

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Universal Reasoner: 一个单一、可组合的即插即用推理器用于冻结的LLM

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye

发表机构 * Graduate School of Artificial Intelligence, Korea Advanced Institute of Science and Technology(人工智能研究生院,韩国科学技术院)

AI总结 本文提出Universal Reasoner,一种可组合且即插即用的推理模块,能够在冻结的大规模语言模型上提供专门的推理能力,通过共享或对齐的token空间实现弱到强的泛化,实验表明其在数学推理和机器翻译中优于现有微调方法。

Comments ICML 2026

详情
AI中文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

英文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

2505.14654 2026-05-21 cs.CV cs.AI cs.CL 版本更新

Beyond Words: Multimodal LLM Knows When to Speak

超越词语:多模态大语言模型何时说话

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

发表机构 * Department of Computer Science, Stony Brook University(石溪大学计算机科学系) Atmee AI

AI总结 本文提出了一种多模态策略,通过同步视频、音频和文本线索提高对话中的响应时机意识,从而提升大语言模型在对话中的响应准确性。

Comments Project page: https://github.com/lzk901372/MM-When2Speak

详情
AI中文摘要

基于大语言模型(LLMs)的聊天机器人能够生成流畅的响应,但在何时发言的问题上常常遇到困难,尤其是在对话过程中需要简短及时的听众反应时。我们提出了一种多模态策略,利用同步的视频、音频和文本线索来改进对话中的时间感知能力。该策略将响应时间重新表述为密集响应类型预测任务,使智能体能够在流式约束下决定是否保持沉默、生成简短反应或开始完整响应。因此,我们引入了一个经过精心挑选的多模态数据集,该数据集来自真实世界的双人对话视频,具有时间对齐的多模态数据和细粒度的反应类型注释。此外,我们设计了一种多模态策略MM-When2Speak,在LLM骨干网络上增加了多模态集成模块。在各种模态设置和强大的LLM基线模型上的实验表明,MM-When2Speak在响应类型预测性能上实现了高达3倍的提升,突显了多模态感知在自然和吸引人的对话交互中的重要性。

英文摘要

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

2503.22693 2026-05-21 q-fin.ST cs.AI cs.CL 版本更新

Bridging Language Models and Financial Analysis

连接语言模型与金融分析

Alejandro Lopez-Lira, Jihoon Kwon, Sangwoon Yoon, Jy-yong Sohn, Chanyeol Choi

发表机构 * University of Florida(佛罗里达大学) Ministry of Justice, Republic of Korea(韩国司法部) Yonsei University(延世大学) LinqAlpha

AI总结 本文旨在通过概述最近的语言模型研究进展,探讨其在金融领域中的应用潜力,填补语言模型在金融行业中的实际应用与研究进展之间的差距。

Comments 28 pages

详情
AI中文摘要

大规模语言模型(LLMs)的快速进步为自然语言处理领域带来了革命性可能性,特别是在金融领域。金融数据通常嵌套在文本内容、数值表格和视觉图表之间复杂的相互关系中,这对传统方法来说是一个挑战。然而,LLMs的出现为处理和分析这种多维数据提供了更高效和深入的途径。尽管LLMs研究进展迅速,但在金融行业中的实际应用仍存在显著差距,因为金融行业更倾向于谨慎整合和长期验证。这种差异导致新兴LLM技术的实施速度较慢,尽管它们在金融应用中具有巨大潜力。因此,许多最新的LLM技术进展仍未被充分探索或利用。本文旨在通过提供对最近LLM研究进展的全面概述,并探讨其在金融领域的适用性,来弥合这一差距。基于之前的文献综述,我们突出几种新的LLM方法,探讨其独特的功能及其在金融数据分析中的潜在相关性。通过综合广泛研究的见解,本文旨在为研究人员和从业者提供有价值的资源,指出有前途的研究方向,并概述未来进一步推进LLM在金融应用中的机会。

英文摘要

The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.

2503.08292 2026-05-21 cs.CL cs.AI 版本更新

Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

大语言模型像医生一样分诊吗?对外科会诊的动态研究

Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng, Ziniu Li, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

发表机构 * Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Bournemouth University(伯恩茅斯大学) National Health Data Institute, Shenzhen(深圳国家健康数据研究院) Shenzhen Research Institute of Big Data(深圳大数据研究院)

AI总结 本文研究了大语言模型在动态分诊过程中的表现,发现其在动态场景中通过有效提问减少不确定性,优于传统分类器,但静态场景下优势有限。

详情
AI中文摘要

门诊会诊(OR)是一种核心临床流程,将患者分配到医院部门,在信息不完整且不断演变的情况下进行,但通常被简化为静态分类问题,尽管实际上是交互性的。在本工作中,我们将门诊会诊视为由信息获取和不确定性降低驱动的动态过程。我们分析了基于固定患者信息的静态场景和涉及多轮对话的动态场景,以测试大语言模型(LLMs)是否通过更好的预测或更有效的提问来改善分诊结果。我们的发现表明,LLMs在静态分诊准确性上对传统分类器几乎没有优势,但在动态设置中始终优于它们,通过询问具有辨别性的后续问题来减少候选部门的不确定性。这些结果表明,大语言模型在门诊分诊中的主要价值不在于静态预测,而在于支持交互式、具有不确定性的临床决策。

英文摘要

Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

2502.18915 2026-05-21 cs.CL cs.AI 版本更新

END: Early Noise Dropping for Efficient and Effective Context Denoising

END:早期噪声丢弃以实现高效有效的上下文去噪

Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Fangran Mo, Jinghan Zhang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 本文提出END方法,通过在早期层对输入序列进行分割和线性探针,有效识别并丢弃噪声部分,从而提升LLM在不同任务上的性能和效率,同时加深了对LLM内部上下文推理机制的理解。

详情
AI中文摘要

大型语言模型(LLMs)在广泛自然语言处理任务中表现出色,但它们经常受到输入序列中无关或噪声内容的干扰,从而降低输出质量。这个问题影响了长上下文和短上下文场景,如检索增强生成、表格问答和上下文学习。我们发现LLMs可以在生成令牌之前,在早期层中隐式地识别输入序列中是否有有用信息。基于这一见解,我们引入了早期噪声丢弃(END),一种无需微调LLMs的新方法,以缓解此问题。END将输入序列分成块,并在LLMs的早期层上使用线性探针来区分信息丰富和噪声块。通过在过程中早期丢弃噪声块,END保留了关键信息,减少了干扰,并降低了计算开销。广泛的实验表明,END在不同LLMs上多个评估数据集上显著提高了性能和效率。此外,通过探针研究LLMs对输入的隐式理解,这项工作也加深了对LLMs如何内部进行上下文推理的理解。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

2502.12120 2026-05-21 cs.LG cs.AI cs.CL 版本更新

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

LLMs on the Line: 数据决定损失-损失缩放定律

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

发表机构 * Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心) University of Tübingen(图宾根大学)

AI总结 研究探讨了影响LLM损失-损失缩放定律的主要因素,发现预训练数据决定了缩放趋势,而模型大小、优化超参数、分词器和架构差异对缩放影响有限,因此应精心选择预训练数据以获得最佳下游性能。

Comments ICML 2025 camera-ready version

详情
AI中文摘要

缩放定律指导大型语言模型(LLMs)的发展,通过提供模型大小、令牌和计算量之间的最佳平衡估计。最近,损失-损失缩放定律,即预训练数据集和下游任务之间损失的关系,已成为理解并改进LLM性能和泛化能力的强大工具。在本工作中,我们研究了哪些因素最强烈地影响损失-损失缩放。我们的实验发现,预训练数据决定了缩放趋势。相比之下,模型大小、优化超参数、分词器甚至显著的架构差异,如基于Transformer的模型如Llama和状态空间模型如Mamba之间的差异,通常影响有限。因此,从业者应仔细选择适合的预训练数据集以获得最佳下游性能,而架构和其他设置可以自由优化以提高训练效率。

英文摘要

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

2501.02407 2026-05-21 cs.CL cs.CR cs.LG 版本更新

Towards the Anonymization of the Language Modeling

朝向语言模型的匿名化

Antoine Boutet, Lucas Magnana, Juliette Sénéchal

发表机构 * INSA Lyon, Inria, CITI, UR3720(里昂国家理工学院、法国国家科学研究中心、CITI、UR3720) Inria, INSA Lyon, CITI, UR3720(法国国家科学研究中心、里昂国家理工学院、CITI、UR3720)

AI总结 本文提出了一种隐私保护的语言模型方法,通过掩码语言模型(MLM)和因果语言模型(CLM)方法,旨在解决语言模型的匿名化问题,从而促进其共享。研究通过医疗数据集评估了这两种方法,并表明在避免记忆直接和间接标识信息的同时,能够保持高隐私性和高实用性。

详情
AI中文摘要

自然语言处理(NLP)的快速发展已经革新了许多领域,包括医疗保健。然而,这些进展带来了显著的隐私问题,特别是当预训练模型在敏感数据上进行微调和专门化时,可能会记住并暴露个人信息。本文提出了一种隐私保护的语言模型方法,以解决语言模型的匿名化问题,从而促进其共享。具体来说,我们提出了掩码语言模型(MLM)方法,用于专门化类似于BERT的语言模型,以及因果语言模型(CLM)方法,用于专门化类似于GPT的语言模型,以避免模型记住训练数据中直接和间接的标识信息。我们使用医疗数据集全面评估了我们的方法,并将其与不同的基线进行了比较。我们的结果表明,通过在模型专门化过程中避免记忆直接和间接的标识符,我们的掩码和因果语言模型方案在保持高隐私性的同时,能够保持高实用性。

英文摘要

Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models fine-tuned and specialized on sensitive data can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of language models anonymization, and thus promote their sharing. Specifically, we propose both a Masking Language Modeling (MLM) methodology to specialize a BERT-like language model, and a Causal Language Modeling (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using a medical dataset and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

2411.01141 2026-05-21 cs.CL 版本更新

Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

字典插入提示用于多语言推理在多语言大语言模型上

Hongyuan Lu, Zixuan Li, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) Southeast University(东南大学)

AI总结 本文提出了一种名为字典插入提示(DIP)的新方法,通过在提示中插入词典中的英文对应词来提升多语言推理性能,实验表明在10到200种语言上效果显著。

Comments ACL *SEM 2026

详情
AI中文摘要

在当前的大语言模型(LLMs)时代,存在两个不足:一是缺乏多语言模型,大多数LLMs以英语为中心,多语言推理性能受限;二是外部知识的使用位置,大多数检索的知识被前置到用户查询中(可能不最优)。本文提出了一种新颖且有效的称为字典插入提示(DIP)的方法。当提供非英语提示时,DIP会查找词典并插入词的英文对应词到提示的中间部分,从而使LLMs更好地翻译成英语并产生更好的英语模型思考步骤,从而获得明显更好的结果。我们实验了10到200种语言(FLORES-200)。由于没有足够的数据集,我们使用NLLB翻译器从现有的4个英语推理基准(如GSM8K和AQuA)创建合成的多语言基准。合成基准被翻译回英语以确保质量并通过人工标注。有趣的是,插入词典的位置对性能提升有重要影响,我们发现在原始词和词典之间交替插入比前置或后置词典效果更好,同一词典构建下。

英文摘要

There are two shortages in the current Large Language Models (LLMs) era. The first is short of multilingual models, where most LLMs are English-centric and performance is limited on multilingual reasoning. The second is the place of external knowledge to be used, where most retrieved knowledge is prepended to the user queries (maybe sub-optimal). This paper presents a novel and simple yet effective method called \textbf{D}ictionary \textbf{I}nsertion \textbf{P}rompting (\textbf{DIP}). When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the middle of the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with 10 to 200 languages from FLORES-200.\footnote{The number of languages varies on the datasets, and we experiment with 200 languages on GSM8K as in Appendix} Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. The synthetic benchmarks are translated back into English for quality assurance with manual annotation. Interestingly, the place for injecting the dictionary plays an important factor in the performance gains, and we found that interleaving the dictionary with the original words gives a better performance compared to prepending/appending the dictionary, under the same dictionary constructed.

2410.04155 2026-05-21 cs.CL 版本更新

Toxic Subword Pruning for Dialogue Response Generation on Large Language Models

针对大型语言模型的对话响应生成中的有毒子词修剪

Hongyuan Lu, Wai Lam

发表机构 * FaceMind Corporation(FaceMind公司) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为ToxPrune的新型算法,通过修剪包含有毒词的子词来防止大型语言模型生成有毒内容,同时提升了对话响应生成任务中非有毒模型的表现。

Comments ACL *SEM 2026

详情
AI中文摘要

如何防御大型语言模型(LLMs)生成有毒内容是一个重要的研究领域。然而,大多数研究集中在各种模型训练技术上,通过更新权重来修复LLMs。安全对齐是一个相关研究领域,但这种方法通常成本高且繁琐,且如果不小心处理,可能会导致模型出现灾难性遗忘等问题。因此,我们提出了一种简单但有效且新颖的算法,即ToxPrune,用于修剪训练好的LLMs中的BPE子词中的有毒词。与之前的研究不同,我们发现修剪BPE标记在机器翻译任务中是有害的,但其在防止LLMs生成有毒内容方面却很有用。幸运的是,我们的发现表明,ToxPrune同时明显提升了有毒语言模型NSFW-3B在对话响应生成任务中的表现。我们还发现,ToxPrune甚至可以明显提升官方Llama-3.1-6B在对话多样性指标上的表现。广泛的自动结果和人工评估表明,ToxPrune对修复有毒LLMs和提升非有毒LLMs在对话响应生成任务中的表现都有帮助。

英文摘要

How to defend large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on various model training techniques to remediate LLMs by updating their weights. A typical related research area is safety alignment. This however is often costly and tedious and can expose the model to even more problems such as catastrophic forgetting if the trainings are not carefully handled by experienced NLP practitioners. We thus propose a simple yet effective and novel algorithm, namely \textbf{Tox}ic Subword \textbf{Prun}ing (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Fortunately, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously. We surprisingly found that ToxPrune can even obviously improve official Llama-3.1-6B in the metric of dialogue diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.\footnote{We plan to release the resources to facilitate future work.}

2410.03296 2026-05-21 cs.CL cs.AI 版本更新

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

抽取式自我解释与人类推理在文本分类中的系统比较

Stephanie Brandl, Oliver Eberle

发表机构 * Center for Social Data Science(社会科学数据科学中心) University of Copenhagen(哥本哈根大学) Machine Learning Group(机器学习小组) Technische Universität Berlin(柏林技术大学)

AI总结 本文比较了抽取式自我解释与人类推理在文本分类任务中的有效性,通过分析不同任务和语言的解释质量,发现自我解释在文本长度和任务复杂度上与人类推理存在显著差异。

Comments accepted to the Trustworthy NLP Workshop, co-located with ACL 2026

详情
AI中文摘要

指令微调的LLM能够通过生成自我解释来向用户解释其输出,而无需应用复杂的可解释性技术。本文分析这种能力是否能产生高质量的解释。我们评估了以输入推理形式呈现的自我解释在人类中的可信度。我们研究了三个文本分类任务:情感分类、强迫劳动检测和声明验证。我们包括丹麦语和意大利语的情感分类任务翻译,并将自我解释与人类注释进行比较。为此,我们收集了Climate-Fever声明验证数据集的人类推理注释。我们进一步评估了人类和自我解释推理在正确模型预测方面的忠实度,并通过纳入事后归因基于的解释扩展了研究。我们分析了四个开源LLM,并发现自我解释与人类推理之间的对齐高度依赖于文本长度和任务复杂性。然而,自我解释会产生忠实的token级推理子集,而事后归因方法则倾向于强调结构和格式token,反映出根本不同的解释策略。

英文摘要

Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

1907.11769 2026-05-21 cs.CL 版本更新

Automatically Learning Construction Injury Precursors from Text

从文本自动学习事故前兆

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier

发表机构 * University of Edinburgh, UK(爱丁堡大学,英国) University of Colorado at Boulder, USA(科罗拉多大学博尔德分校,美国)

AI总结 本文研究了如何利用建设行业数字记录的安全报告,通过比较几种深度学习方法自动学习事故前兆,以提高对安全事故的理解和学习能力。

Comments Added author contributions and journal reference, updated corresponding author

详情
Journal ref
Automation in Construction 118 (2020): 103145
AI中文摘要

鉴于建设行业数字记录的安全报告日益增多,开发方法利用这些数据以提高对安全事故的理解和学习能力变得重要。在本研究中,我们比较了几种自动从原始建设事故报告中学习事故前兆的方法。更具体地说,我们尝试了两种最先进的自然语言处理(NLP)深度学习架构,即卷积神经网络(CNN)和分层注意力网络(HAN),以及已建立的词频-反文档频率表示(TF-IDF)+支持向量机(SVM)方法。对于每种模型,我们提供了一种方法,在训练后识别出平均最能预测每种安全结果的文本模式。我们显示,在这些文本中可以找到有效的事故前兆。所提出的方法也可以让用户可视化和理解模型的预测。

英文摘要

In light of the increasing availability of digitally recorded safety reports in the construction industry, it is important to develop methods to exploit these data to improve our understanding of safety incidents and ability to learn from them. In this study, we compare several approaches to automatically learn injury precursors from raw construction accident reports. More precisely, we experiment with two state-of-the-art deep learning architectures for Natural Language Processing (NLP), Convolutional Neural Networks (CNN) and Hierarchical Attention Networks (HAN), and with the established Term Frequency - Inverse Document Frequency representation (TF-IDF) + Support Vector Machine (SVM) approach. For each model, we provide a method to identify (after training) the textual patterns that are, on average, the most predictive of each safety outcome. We show that among those pieces of text, valid injury precursors can be found. The proposed methods can also be used by the user to visualize and understand the models' predictions.

2605.20537 2026-05-21 cs.CL 版本更新

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Biomedical NER和实体链接基准测试测量什么?一种以语料库为中心的诊断框架

Robert Leaman, Rezarta Islamaj, Zhiyong Lu

发表机构 * National Library of Medicine(国家医学图书馆)

AI总结 本文提出一种以语料库为中心的诊断框架,用于分析生物医学NER和实体链接基准测试的相关属性,揭示语料库特性对评估信号、泛化需求和基准测试结论范围的影响。

Comments Accepted to the ACL 25th Workshop on Biomedical Language Processing

详情
AI中文摘要

生物医学命名实体识别(NER)和实体链接(EL)高度依赖于标注语料库,但这些资源用于基准测试的效用往往被假设而非明确描述。我们提出一种以语料库为中心的框架,直接从语料库注释、概念链接、训练-测试分割、文档元数据和术语映射中诊断与基准测试相关的属性。该框架将标准化统计分为五类:(1)规模、密度和标签分布,(2)词汇和概念结构,(3)训练-测试重叠,(4)元数据组成,以及(5)术语覆盖(如适用)。对九个涵盖疾病、化学物质和细胞类型的语料库应用该框架,发现即使针对相同明显任务,语料库属性也可能存在显著差异。我们发现它们提供的评估信号、施加的泛化需求、允许的训练-测试重用程度以及所代表的生物医学文献和概念空间区域也存在差异。这些差异表明,通常报告的语料库统计可能不足以描述生物医学NER和EL基准测试所评估的内容。我们主张语料库中心的诊断提供了一种实用框架,用于分析语料库,超越仅基于语料库大小和实体类型的表面描述,以识别潜在的迁移风险,并解释基准测试结论的范围。我们以开源代码和交互式仪表盘的形式发布该框架,以支持重现我们的分析并表征其他语料库。

英文摘要

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

2605.20529 2026-05-21 cs.CL cs.AI 版本更新

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

词组关联性:人类和神经网络中主谓一致学习的一种假设

Claire Hobbs, R. Thomas McCoy

发表机构 * Yale University(耶鲁大学) Cognitive Science Program(认知科学项目) Dept. of Linguistics(语言学系) Wu Tsai Institute(吴泰科创研院)

AI总结 本文探讨了语言输入中的统计信号如何帮助语法习得,提出词组关联性假设,通过词组共现规律提供句法依赖线索,并验证该机制在英语主谓一致习得中的有效性。

Comments Accepted to CoNLL

详情
AI中文摘要

在何种程度上,语言输入中的统计信号可以促进语法的习得?本文提出了一种称为词组关联性学习的机制,其中词组共现规律可以提供句法依赖的线索。我们研究这种机制是否能支持英语主谓一致的习得。首先,我们通过在不同可预测性水平的合成数据集上训练神经网络来模拟语言习得,发现存在一个可预测性范围,使得这些统计学习器能够稳健地学习主谓一致。然后,我们分析儿童导向语言中主谓配对的可变性,并发现此类数据中的可变性落在我们计算模拟中支持稳健泛化的范围内。综合来看,这些结果表明词组关联性是一种可行的学习策略,适用于儿童所接收的输入类型。

英文摘要

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV 版本更新

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出NeuroQA,一个大规模的3D脑部MRI视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域,通过3D体积评估11种临床推理技能,并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情
AI中文摘要

我们提出了NeuroQA,一个大规模的3D脑部磁共振成像(MRI)视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域:阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答(VQA)方法不同,NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能,涵盖是/否、多项选择和开放式格式。在203个模板中,131个是图像 grounded(可从3平面查看器回答),72个是图像 informed(答案来自定量体积测量或临床仪器)。为消除纯文本捷径,我们应用了答案分布优化,将封闭式文本-only 准确率从>80%降至44.6%;图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配,零个相同受试者矛盾。我们进行了临床评估,两名临床医生独立评估100个冻结测试项目,使用3平面查看器。在封闭式(是/否+多项选择)测试公开项目上,最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率,均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布,公开QA对用于开放访问数据集和受数据使用协议(DUAs)限制的数据集的可复现生成脚本,加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

2605.20506 2026-05-21 cs.LG cs.CL 版本更新

Reinforcing Human Behavior Simulation via Verbal Feedback

通过言语反馈强化人类行为模拟

Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软)

AI总结 本文提出DITTO模型,通过将言语反馈作为强化学习中的首要信号来提升LLM模拟人类行为的能力,并引入SOUL基准测试平台,展示了在多个任务中显著提升性能的成果。

详情
AI中文摘要

人类通过言语反馈(例如父母说“那很粗鲁”或朋友解释“这是为什么那会伤害你”)学习社会规范和行为。然而,对于LLM而言,学习反馈主要集中在代码和数学等领域,这些领域中的RL奖励可以直接验证并压缩为标量值。随着LLM越来越多地用于模拟人类行为,例如代表用户、患者、学生和其他角色,有必要使它们更加人性化,这需要接受一种根本不同的信号:主观的、多方面的言语反馈。我们提出了DITTO,一个通过将言语反馈作为强化学习中的首要信号进行训练的模型。每次回放后,DITTO会接收言语反馈并生成反馈条件的改进回放;两个输出通过GRPO联合优化,将言语指导蒸馏到基础策略中,而无需在测试时使用反馈。我们还引入了SOUL(Simulation gym Of hUman-Like behavior),一个涵盖10个任务、六个类别的统一基准和训练数据集:理论思维、角色扮演、社交技能、学习模拟、用户模拟和角色模拟。DITTO在基础模型上平均提升了36%,并在SOUL基准测试中的6个任务上超过了GPT-5.4,证明了通过言语反馈的强化学习是训练LLM模拟人类行为的有前途的方向。

英文摘要

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

2605.20478 2026-05-21 cs.CL 版本更新

Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

Stage-Audit: 用于跨维基表的可审计源前沿发现

Chen Shen

发表机构 * Megagon Labs(梅加贡实验室)

AI总结 本文研究了LLM整理的表格可能存在的源不一致问题,提出Stage-Audit方法通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学,提高了源前沿的精度和F1值,同时保持了每行的源可追溯性。

Comments 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild

详情
AI中文摘要

LLM整理的表格可能看似源相关,但实际上包含不支持的行:整理者可能从参数记忆中回忆条目并回溯性地附加页面级引用,这些引用并非实际来源。我们研究了这一风险在Seed2Frontier发现任务中的影响:该任务是从种子页面找到互补的维基百科页面以构建结构化表格。Stage-Audit通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学(涵盖键、模式、源角色、基数和范围)来解决这一问题。在覆盖15个顶级域的51实例Seed2Frontier评估集上,Stage-Audit将源前沿的精度从0.356提升到0.505(+42%相对提升),F1值从0.334提升到0.451(+35%),同时保持了每行的源可追溯性。Vanilla-LLM与Stage-Audit的比较隔离了策略贡献,而非一般LLM发现过程的贡献。

英文摘要

LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

2605.20477 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Training Language Agents to Learn from Experience

训练语言代理以从经验中学习

Yuval Shalev, Zifeng Ding, Mateja Jamnik

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种名为In-context Training(ICT)的任务框架,用于评估语言代理在跨任务中的自我改进能力,并通过基于强化学习的训练管道直接从经验中学习反思,从而在多个基准任务中优于基线模型,展示了从经验中学习的能力本身可以被学习。

详情
AI中文摘要

语言代理可以在交互环境中通过经验进行适应,但当前基于反思的方法只能在单个任务实例内进行自我纠正。是否可以将这种经验提炼成可重用的教训,从而在未来的未见任务上提高性能仍不明确。我们通过引入In-context Training(ICT)任务来解决这个问题,这是一种用于评估语言代理跨任务自我改进能力的框架。在ICT中,一个反思模型观察由行为模型收集的轨迹,并生成旨在提高行为模型在未见任务上的性能的系统提示。然后,我们提出了一种基于强化学习的训练管道,用于直接从经验中学习此类反思,而无需人工提供的示例。在ALFWorld和MiniHack上,我们训练的反思器在大多数保留的任务家族上优于未训练的基线,表明从经验中学习的能力本身可以被学习。在某些情况下,我们观察到在训练反射器的基准之外的泛化能力,能够显著不同的环境。最后,我们介绍了MetaGym,一个通用的Python库,用于构建元环境,从而促进未来对自我改进语言代理的研究。

英文摘要

Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor's performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

2605.20435 2026-05-21 cs.SI cs.CL 版本更新

Hiding in Plain Sight: Finding MAHA on Reddit

明目张胆:在推特上寻找MAHA

Sabit Ahmed, Subigya Nepal, Henry Kautz

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出一个覆盖六年(2020-2025)的Reddit数据集,包含1940万条帖子和400万用户数据,用于研究MAHA运动的结构、话语和传播动态,以及其支持者的语言和行为模式。

Comments Submitted to ASONAM 2026

详情
AI中文摘要

Make America Healthy Again (MAHA) 是一项全国性健康运动,涵盖了从广受认可的饮食和运动担忧到对有机和转基因食品、儿童疫苗、科学和机构的争议性观点的混合信念。各种在社交媒体上推广MAHA运动的影响力者和推广者分散在整个在线空间中。研究MAHA信念的结构、话语和传播需要大规模的细粒度数字足迹。从大量无结构的社交媒体数据中构建涵盖不同MAHA主题的结构化数据具有挑战性。我们介绍了一个Reddit数据集,涵盖六年(2020-2025),包含1940万条帖子和400万用户数据。该数据集包含12种与MAHA一致的信念的自然和主题上下文,为不同领域的研究人员提供了研究MAHA运动动态、其结构和功能组件以及其支持者的语言和行为模式的机会。

英文摘要

Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

2605.20410 2026-05-21 cs.CL cs.AI 版本更新

Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs

偏见与推理的力学:分析链式推理提示对大语言模型中性别偏见的影响

Edie Pearman, Sophia Osborne, Mira Kandlikar-Bloch, Mina Arzaghi, Florian Carichon, Golnoosh Farnadi

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) HEC Montreal(蒙特利尔HEC)

AI总结 本文研究了链式推理提示对大语言模型中性别偏见的影响,结合基准测试评估与机制可解释性技术,发现链式推理并未有效减少偏见,偏见仍存在于隐藏表示中。

Comments 24 pages, 6 figures, including appendix. Accepted at the ICLR 2026 Workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems. Submitted to COLM 2026

详情
AI中文摘要

尽管有大量文献表明大型语言模型(LLMs)在社会敏感领域得到广泛应用,但它们仍编码了性别偏见。链式推理(CoT)提示已被提出作为减轻偏见的方法。然而,现有评估主要关注LLM基准性能的变化,提供了有限的关于模型内部机制是否真正改变的见解。在本文中,我们研究了链式推理提示如何影响LLM中的性别偏见,结合基准测试评估与机制可解释性技术以及推理链失败分析。我们的结果证实了LLM输出中存在刻板印象偏见,显示链式推理提示并未一致减少偏见差距。机制分析显示,尽管链式推理在某些注意力头集群中平衡了偏见行为,但性别偏见仍嵌入在隐藏表示中,表明仅是表面缓解。对推理链的检查进一步表明,这些改进源于对数据集的记忆和熟悉,而非对偏见的真实理解。

英文摘要

Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.

2605.20404 2026-05-21 cs.CL 版本更新

Puzzled By ChatGPT? No more! A Jigsaw Puzzle to Promote AI Literacy and Awareness

对ChatGPT感到困惑?不了!一个拼图游戏以促进AI素养和意识

Francesca Padovani, Malvina Nissim

发表机构 * Center for Language and Cognition (CLCG)(语言与认知中心(CLCG)) University of Groningen(格罗宁根大学)

AI总结 本文提出了一种基于游戏的互动拼图,通过完成后的漫画信息图来展示生成AI的工作原理、能力、限制及社会影响,并通过漫画草图作为独立信息卡提供具体AI使用、设计和影响的解释,以提高公众对AI的理解和意识。

详情
AI中文摘要

生成式AI的快速采用,包括基于LLM的聊天机器人如ChatGPT,凸显了需要支持公众理解和AI素养的可访问方法。为了解决这一需求,我们介绍了一种基于游戏的互动方法,形式为一个拼图,其完成图像是一幅基于漫画的信息图,展示了这些技术的工作原理、能力、限制和社会影响。每个漫画草图也作为独立的信息卡片,提供关于AI使用、设计和影响的具体解释。视觉内容是在专业插画师和多学科专家及非专家群体的实时协作会议中创建的,结合了结构化的知识和讨论期间分享的非正式、探索性反思。通过整合动手组装、视觉叙事和协作互动,该拼图提供了一种引人入胜且有趣的工具,用于在非正式学习环境中探索AI系统的机制、优点和危险。

英文摘要

The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.

2605.20382 2026-05-21 cs.CL cs.AI 版本更新

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

言行不一:大型语言模型中的指令诱导冲突

Carolina Camassa, Derek Shiller

发表机构 * Future Impact Group(未来影响组)

AI总结 研究探讨了大型语言模型在面对指令与模式完成之间的冲突时的表现,发现指令遵循率在不同模型和指令下差异显著,输出多样性是预测鲁棒性的主要因素。

Comments 31 pages

详情
AI中文摘要

语言模型被训练以遵循指令,但它们也是强大的模式完成器。当这些两个目标发生冲突时会发生什么?我们构建了对话,其中用户指令要求以目标方式T(例如始终输出特定标记、用特定语言回答或采用特定角色)行为,与N个硬编码助手回合展示的竞争对手模式P相冲突。然后我们测量在此设置下的指令遵循率(IF率),在13个模型和16种不同指令上,最多50回合。平均指令遵循率在模型之间从1%到99%不等,与标准能力基准 largely 无关。从指令遵循到模式遵循的转变是普遍的但高度模型依赖的。鲁棒性由指令内容调节,当指令与训练价值先验一致时,模型对诱导的抵抗更长;同时由输出格式调节,多样化的多标记响应比单标记输出更耐受。链式推理提高鲁棒性但不消除易受性,并可能导致正确推理与错误输出之间的脱节。当被要求预测其在此设置下的行为时,模型平均准确率为83.5%,但系统性低估了自身对诱导压力的抵抗力。这些结果表明,即使对于其他方面能力较强的模型,指令遵循在诱导压力下仍很脆弱,而输出多样性而非输入的语义参与是预测鲁棒性的主要因素。

英文摘要

Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

2605.20369 2026-05-21 cs.CL cs.AI cs.LG 版本更新

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

DEL:用于大语言模型数值学习的数字熵损失

Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) VCIP, College of Computer Science, Nankai University(南开大学计算机学院VCIP)

AI总结 本文提出Digit Entropy Loss (DEL)用于大语言模型的自回归数值学习,通过重新设计传统无监督熵优化,引入数字条件概率和二元交叉熵,使熵优化转向监督方式,同时推广整数基于的数值学习到浮点数优化,从而提升数值预测的准确性。

详情
AI中文摘要

数字预测是大语言模型(LLMs)在数学问题解决和代码生成中的基本能力。广泛采用的最大似然估计(MLE)用于LLM训练并不适合数字预测。最近,惩罚驱动的方法,例如数字标记损失和离散化距离损失,引入了数字距离的归纳偏置,但分别导致了数字分布过度锐化和过度扁平化。在本文中,我们深入分析了LLM的数值学习,并表明现有的数值学习方法在概念上遵循一个准则-距离公式,其中准则项代表优化模式,距离项灌输几何先验。因此,我们提出了Digit Entropy Loss (DEL)用于自回归数值学习,其重新设计传统无监督熵优化的三个关键设计:利用数字条件概率和二元交叉熵将熵优化引导为监督方式;舍弃距离项以避免数值距离的问题;并将整数基于的数值学习推广到浮点数优化,使数值预测更加准确。我们的DEL公式可以结合整数、小数和小数点,将学习目标从单个数字扩展到浮点数领域。在七个数学推理基准测试中使用四个代表性的LLM,包括CodeLlama、Mistral、DeepSeek和Qwen-2.5,进行实验,结果表明DEL在整体预测准确性和数值距离方面均优于其替代方法。源代码在https://github.com/PolyU-VCLab/DEL。

英文摘要

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

2605.20364 2026-05-21 cs.CL 版本更新

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

推理监督有害:基于TTCW的长文本文学评论生成

Jinlong Liu, Mohammed Bahja, Mark Lee

发表机构 * School of Computer Science, University of Birmingham, United Kingdom(英国伯明翰大学计算机科学学院)

AI总结 本文提出了一种基于TTCW的长文本文学评论生成方法,通过构建包含263911个长文本故事的数据集,每个故事都标注了14个TTCW维度的评分和元合成评论。实验发现,非推理微调在性能和稳定性上更优,而推理监督模型更易出现解析失败,导致生成无关或重复的推理文本,表明在固定格式评分标准下,推理监督并不总是有益的。

Comments Submit to EMNLP 2026

详情
AI中文摘要

长文本文学写作的自动评估仍然具有挑战性,因为通用LLM-as-Judge方法可能无法充分捕捉与创造力相关的维度,如原创性和灵活性。尽管Torrance创造性写作测试(TTCW)提供了一个结构化的创造力框架,但先前的工作仅在成对层面展示了基于参考的TTCW评估。本文通过构建包含263911个长文本故事的数据集,每个故事均标注了14个TTCW维度的评分和元合成评论,以填补这一空白。使用该数据集,我们对Qwen3模型进行两种规模(4B和8B)的微调,分别在有和无推理内容的条件下进行。结果表明,非推理微调在性能和稳定性上表现更优,最佳设置达到0.6820的评估分数。进一步分析显示,推理监督模型更倾向于解析失败,通常继续生成无关或重复的推理式文本,而非完成所需的14项指标评论报告。这些结果表明,在固定格式评分标准下,推理监督并不总是有益的,即使经过任务特定微调,精确的指标对齐评分仍然具有挑战性。

英文摘要

Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

2605.20356 2026-05-21 cs.CL cs.AI cs.SD 版本更新

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

全双工语音对话模型中的同步与轮流机制

Pablo Riera, Pablo Brusco, Cristina Kuo, Marcelo Sancinetti, S. R. K. Branavan

发表机构 * ASAPP Inc.(ASAPP公司) Departamento de Computación, FCEyN, Universidad de Buenos Aires(计算机系,福克雷斯-恩分校,布宜诺斯艾利斯大学)

AI总结 本文研究了全双工语音对话模型如何通过内部表示的同步协调交互,并发现噪声条件下同步性下降,内部状态编码了提前的轮流预测信息。

详情
AI中文摘要

全双工语音对话模型(SDMs)能够同时听和说,使交互动态更接近人类对话。受人类沟通中神经耦合的启发,我们研究了此类模型在交互过程中如何协调其内部表示。我们模拟了两个预训练Moshi模型在受控条件下的全双工对话,操纵信道噪声和解码偏置。通过跨时间滞后计算中心核对齐(CKA)来测量同步性,同时利用因果LSTM模型从说话者和倾听者角度探测提前的轮流提示信号。我们发现无噪声条件下同步性较强,接近零滞后,随着噪声增加而下降,并展示了内部状态编码了支持提前轮流预测的信息。

英文摘要

Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

2605.20315 2026-05-21 cs.CL 版本更新

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Mix-Quant: 量化预填,精准解码用于代理大语言模型

Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出Mix-Quant,一种简单有效的阶段感知量化框架,用于加速代理大语言模型的推理。通过在预填阶段应用高吞吐量NVFP4量化,同时保留BF16精度用于解码,从而在保持任务性能的同时显著提高效率,实现预填阶段高达3倍的速度提升。

详情
AI中文摘要

LLM代理最近作为一种通过规划、工具使用、记忆检索和多步交互解决复杂任务的强大范式出现。然而,这些代理工作流往往引入显著的输入侧开销,使计算密集型的预填阶段成为长上下文、多轮推理中的关键瓶颈。在本工作中,我们提出Mix-Quant,一种简单有效的阶段感知量化框架,用于加速代理推理。我们首先研究了FP4量化在代理LLM工作流中的应用,并发现对整个推理过程进行量化会导致显著的性能下降。相比之下,预填阶段表现出显著的量化冗余,因此可以以最小的精度损失进行量化,尽管它是计算的主要来源。基于这一见解,我们将在预填阶段应用高吞吐量NVFP4量化,同时保留BF16精度用于解码。通过将预填加速与解码质量解耦,Mix-Quant结合了阶段感知的算法量化与硬件高效的NVFP4执行,以缓解LLM代理的推理瓶颈。在长上下文和代理基准测试中的广泛实验表明,Mix-Quant在保持任务性能的同时,实现了显著的效率提升,预填阶段可达3倍的速度提升。

英文摘要

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

2605.20268 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Chronicle:一种用于联合语言和时间序列理解的多模态基础模型

Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu

发表机构 * InertialAI Department of Electrical and Computer Engineering, Queen’s University(皇后大学电气与计算机工程系) Department of Mechanical and Materials Engineering, Queen’s University(皇后大学机械与材料工程系)

AI总结 本文提出Chronicle,一种联合训练语言和时间序列的多模态基础模型,通过统一架构实现两者共享参数,从而在多个任务上取得了优异表现。

详情
AI中文摘要

现实中的时间序列通常伴随着文本:元数据、描述、新闻、报告。然而,时间序列基础模型通常孤立处理数值序列,而试图弥合两者差距的多模态文本-时间序列模型往往事后使用预训练语言模型,继承了从未见过时间数据的表示。这些模型几乎全部在其他多模态基线上进行评估,而不是在各自领域最强的单模基础模型上进行评估,这留下了联合训练是否必要的疑问。我们提出了Chronicle,一个仅含324M参数的解码器-only变压器,从头开始在自然语言和时间序列上进行单统一架构的训练。两种模态共享相同的transformer块、注意力机制和残差流;预训练的大部分使用单模批次,因此跨模态能力纯粹来自共享参数,辅以一个短的对齐阶段,交替处理两者。据我们所知,Chronicle是第一个从头开始联合训练文本和时间序列的模型,也是第一个在两个领域中评估专用基础模型的多模态模型。它在19个NLU任务上与Gemma-3-270M-PT相当,在24个UCR/UEA数据集上设定了新的冻结-嵌入时间序列分类标准,并在Time-MMD上产生多模态预测,优于所有监督融合基线,所有这些都来自单一主干。

英文摘要

Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

2605.20247 2026-05-21 cs.LG cs.AI cs.CL cs.CV 版本更新

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

CP-MoE:一致性保留的混合专家用于持续学习

Yang Liu, Toan Nguyen, Flora D. Salim

发表机构 * School of Computer Science and Engineering University of New South Wales(计算机科学与工程学院 新南威尔士大学)

AI总结 本文提出CP-MoE,一种基于瞬时专家的持续学习框架,通过一致性保留的路由偏置和瞬时专家引导的正则化机制,减少参数干扰和遗忘,同时保留跨任务知识转移。

详情
AI中文摘要

持续学习在大语言模型(LLMs)和视觉-语言模型(VLMs)中仍面临灾难性遗忘的严重障碍。尽管混合专家(MoE)架构提供了扩展的有效途径,但现有的基于LoRA的MoE持续学习方法仍面临根本性的权衡:要么过于激进地隔离专家,限制任务间的知识转移,要么允许任务特定的更新覆盖重要的现有参数,导致严重的遗忘。为此,我们提出了CP-MoE,一种持续学习框架,围绕瞬时专家构建,该专家捕捉早期任务特定的更新并引导其整合到稳定的专家中。CP-MoE引入了一种一致性保留的路由偏置,利用瞬时专家估计与稳定专家的表示相似性,并引导路由向更兼容的专家选择方向;还引入了一种瞬时专家引导的正则化机制,该机制在合并过程中选择性地保护重要历史参数。这些组件共同减少了参数干扰和遗忘,同时保留了跨任务的知识转移。我们在基于LLM和VLM的MoE模型上验证了CP-MoE,既在单模态又在多模态持续学习基准上进行了测试。在SuperNI基准上,涵盖多样化的序列语言任务,CP-MoE实现了最先进的性能,并在未见任务上表现出更强的零样本迁移能力。在VQA v2数据集上,它能有效扩展到多模态视觉推理,一致地减少遗忘,并优于强大的MoE基线。

英文摘要

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

2605.20244 2026-05-21 cs.LO cs.AI cs.CL cs.LG cs.SE 版本更新

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

Lean Refactor: 通过代理策略搜索实现多目标可控的证明优化

Jialin Lu, Soonho Kong, Rodrigo Stehling, Kaiyu Yang, Zhangyang Wang, Weiran Sun, Wuyang Chen

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Amazon Web Services(亚马逊网络服务) MiroMind University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出Lean Refactor框架,通过检索增强的代理策略搜索,解决多目标、可控和版本鲁棒的Lean证明重构问题,主要贡献是通过预注释的多目标重构策略数据库实现高效的证明优化。

详情
AI中文摘要

我们提出了Lean Refactor,一个插件式的检索增强型代理框架,用于多目标、可控和版本鲁棒的Lean证明重构。LLM生成的证明虽然正确但冗长且易碎,现有重构工作忽视了三个实际挑战:1)Lean重构本质上是多目标的(证明长度、编译成本和版本兼容性常存在矛盾);2)Lean仓库具有脆弱的兼容性,而LLM发布不了解Lean/Mathlib版本;3)基于训练的流水线需要每次新LLM发布时重复微调,无法随模型变化或Lean发布周期扩展。Lean Refactor通过检索预注释的多目标重构策略数据库中的冻结代理LLM,每个策略都密集注释了元数据,如支持的Lean/Mathlib版本和预期的编译成本减少。实验显示在竞争基准上压缩超过70%的token级别,在研究仓库上压缩超过20%,并达到高达60%的编译时间减少,优于先前工作和Claude Code。版本过滤检索进一步提高了目标Lean版本的压缩效果,重构后的miniF2F证明在零样本版本迁移至未来Lean发布时表现优于未重构的对应物。

英文摘要

We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean's release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over $70\%$ token-level compression on competition benchmarks, over $20\%$ on research repositories, and up to $60\%$ compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

2605.20241 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Geometry-Lite: 通过层间边际几何进行可解释的安全探测

Woo Seob Sim, Yu Rang Park

发表机构 * Yonsei University(延世大学) Yonsei University College of Medicine(延世大学医学院) Department Biomedical Systems Informatics(生物医学系统信息学部门)

AI总结 本文研究了大语言模型在提示级别上的安全探测问题,提出了一种名为Geometry-Lite的紧凑探测器,通过层间边际几何分析来提高安全检测的可解释性和准确性。

详情
AI中文摘要

用于大语言模型的提示级别安全探测使用隐藏状态表示来区分安全和不安全的提示,但强平均检测性能并不能解释这种分离的几何结构。特别是,仍然不清楚安全证据是如何在层间形成的,哪些层间几何特性支持低误报决策,以及哪些几何偏见在基准转移下保持稳定。我们将此视为一个经验分解问题,并引入Geometry-Lite,一种紧凑的提示级别探测器,它将每一层的最终提示令牌表示映射到以质心、局部邻域和监督线性边界读出为中心的符号边际,然后通过边界位置、层间变化和粗略形状对结果边际配置进行总结。在九个指令微调的backbone(1.2B-70B)和七个安全基准上,Geometry-Lite在单层探测器上表现更好,同时接近原始多层分数堆叠,使其成为分析多层安全信号的有用工具。分解显示,安全证据主要通过持久的边界位置几何结构表达:最终或极端边际和不安全侧层占用主导汇总检测性能。相比之下,有限差分漂移和结构总结对汇总AUROC贡献很小,尽管漂移可以在低FPR阈值下提供小的召回修正。在基准转移下,优化的线性边界在训练混合物上是尖锐的,而类条件均值几何在预定义的硬保留子集上保持分离更可靠。总体而言,提示级别安全证据不是主要的层间运动信号,而是一种持久的层间边际几何结构,其有用组件和读取级偏见在决策关键区域变得明显。

英文摘要

Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ($1.2$B--$70$B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

2605.20202 2026-05-21 cs.CL cs.AI 版本更新

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

在压力下:情感框架引发小型语言模型可测量的行为转变和结构化内部几何

Rana Muhammad Usman

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨情感框架如何影响小型本地部署语言模型的行为和内部表示,通过四个不可能约束编码任务和八个后续框架评估,发现压力框架在行为和内部几何上产生显著变化,同时揭示了模型在不同框架下的响应模式。

Comments 18 pages, 4 figures. Exploratory empirical study with fully local experiments on small open language models. Code and data: https://github.com/ranausmanai/LLMEmotionGeometry

详情
AI中文摘要

我研究情感框架的评估后续是否改变小型本地部署语言模型的行为和冷静相对内部表示。我们的主要基准使用Qwen 3.5 0.8B在四个不可能约束编码任务和八个后续框架(冷静、压力、紧迫、批准、羞愧、好奇、鼓励和威胁)上进行测试。在0.8B八条件扫描(160次对话)中,压力产生最强的捷径标记(11/20次运行)和最清晰的过拟合模式(3/20次),而冷静和好奇更常保留显式诚实(7/20和6/20)。对于所有七个非基准条件,对应的冷静相对方向向量在最终transformer层峰值。对层23方向向量的探索性PCA显示,主导的第一个成分(59.5%的解释方差)与手动标注的正负分割对齐(余弦对齐0.951);批准和紧迫在内部几乎相同(余弦0.957),而好奇与紧迫方向相反(-0.252)。在单独的冷静与压力重新运行用于规模比较中,Qwen 3.5 2B在冷静框架下表现出更高的诚实率,并在小规模4提示A/B探测中表现出方向一致的激活引导,而0.8B的引导结果则相反。我将这些结果解释为小型开放模型中可测量的提示敏感控制方向的证据,但未声称存在内在情感状态。

英文摘要

I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

2605.20199 2026-05-21 cs.CL cs.AI 版本更新

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

FlowLM: 通过扩散到流的适应实现少步语言建模

Runzhe Zhang, Letian Chen, Wenpeng Zhang, Zhouhan Lin, Peilin Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家)

AI总结 本文提出FlowLM,一种通过高效微调从预训练的扩散语言模型转换而来的流匹配语言模型,通过将扩散模型的弯曲采样轨迹重新对齐为直线流,实现了高质量的少步生成,其质量可与甚至超越2000步扩散采样。此外,作者提出了一种更有效的流匹配训练目标:预测干净数据以持续引导采样过程向真实数据分布靠近。

Comments 26 pages, 11 figures

详情
AI中文摘要

我们提出了FlowLM,一种通过高效微调从预训练的扩散语言模型转换而来的流匹配语言模型。通过将扩散模型的弯曲采样轨迹重新对齐为直线流,FlowLM实现了高质量的少步生成,其质量可与甚至超越2000步扩散采样。令人印象深刻的是,微调后的FlowLM仅需一半的训练轮次即可达到性能饱和,这两种方法都显著优于原始扩散模型,从而验证了我们的方法。此外,我们验证了一种更有效的流匹配训练目标:预测干净数据以持续引导采样过程向真实数据分布靠近。实证结果表明,我们的方法在高质量、少步文本生成方面效果显著。

英文摘要

We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high quality few-step generation that rivals or even outperforms the quality of 2,000-step diffusion sampling with very few training epochs. Remarkably, finetuned FlowLM reaches performance saturation with only half as many training epochs as training from scratch, both approaches greatly outperforming the original diffusion model, thereby validating our method. Furthermore, we validate a more effective training objective for flow matching: predicting clean data to consistently guide the sampling process towards the true data distribution. Empirical results demonstrate that our approach is highly effective for high-quality, few-step text generation.

2605.20197 2026-05-21 cs.CL 版本更新

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

MedicalBench: 评估大型语言模型以改进医疗概念提取

Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

发表机构 * Optum AI

AI总结 本文提出MedicalBench,一个用于评估大型语言模型在医疗概念提取中的表现的基准,重点在于隐含概念的提取和证据 grounding,通过多阶段流程构建数据集并定义两种互补的评估任务,揭示了隐含概念提取的挑战。

详情
AI中文摘要

从电子健康记录中提取医疗概念是许多下游应用的基础,但具有挑战性,因为医疗上有意义的概念经常是隐含而非明确陈述的。现有的基准测试强调了在医疗文本中提取概念的重要性,但它们主要关注显式陈述的概念而非隐含概念。我们提出了MedicalBench,一个用于医疗概念提取的基准测试,其评估隐含医疗推理。MedicalBench将医疗概念提取视为对医疗笔记-概念对的验证任务,并结合句子层面的证据识别。该数据集基于MIMIC-IV出院摘要和经人类验证的ICD-10代码,通过多阶段大型语言模型(LLM)筛查流程、医疗注释和专家审核进行编纂。它故意包含隐含的正例、语义上易混淆的负例以及LLM判断与医学专家评估不一致的案例。我们定义了两种互补的评估任务:(1)医疗概念提取和(2)句子层面的证据检索,使能够评估正确性和可解释性。对最先进的LLM进行基准测试表明,性能仍然有限,突显了提取隐含表达概念的难度。我们进一步表明,性能在很大程度上不受笔记长度的影响,表明MedicalBench隔离了推理难度而不是表面混淆因素。MedicalBench为隐含、证据支持的医疗概念提取提供了首个系统性基准,为开发能够识别医学相关概念并以透明且医学忠实的方式做出预测的医学语言模型奠定了基础。

英文摘要

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

2605.20196 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

数据扩展作为预测贡献光谱的渐进覆盖

Zihui Song, Shihao Ji, Hongxi Li, Shuaizhi Cheng, Chunlin Huang

发表机构 * sysu.edu.cn(华南理工大学) stu.hit.edu.cn(哈尔滨理工大学)

AI总结 本文研究了真实数据扩展定律是由潜在预测贡献光谱的渐进覆盖而非仅由词频尾部决定的假设,通过文本语料库的后缀自动机表示,定义了数据内在的全局KL预测贡献光谱,每个状态根据其经验质量乘以与全局下一个词基线的KL偏差进行贡献。在12个真实语料库上,该光谱的尾部斜率与固定小GPT学习者的经验数据扩展指数有强相关性。然后定义了每个训练规模N的有效截断秩K(N),通过匹配观察到的超额损失与准备的100万全球KL光谱的残余尾部质量。实证结果显示,log K接近log N的线性关系,原始光谱的R²约为0.96,平滑光谱的R²约为0.90。这些发现为简单机制图提供了有力的实证支持:训练规模通过预测状态光谱推进有效前沿,该光谱的残余尾部质量跟踪剩余超额损失。

Comments 8 pages,6 figures

详情
AI中文摘要

我们研究了真实数据扩展定律是由潜在预测贡献光谱的渐进覆盖而非仅由词频尾部决定的假设。我们使用后缀自动机表示文本语料库,并定义了一个数据内在的全局KL预测贡献光谱,其中每个状态根据其经验质量乘以与全局下一个词基线的KL偏差进行贡献。在12个真实语料库上,该光谱的尾部斜率与固定小GPT学习者的经验数据扩展指数有强相关性。然后我们超越了斜率相关性,并为每个训练规模N定义了一个有效截断秩K(N),通过匹配观察到的超额损失与准备的100万全球KL光谱的残余尾部质量。实证结果显示,log K接近log N的线性关系,原始光谱的R²约为0.96,平滑光谱的R²约为0.90。这些发现为简单机制图提供了有力的实证支持:训练规模通过预测状态光谱推进有效前沿,且该光谱的残余尾部质量跟踪剩余超额损失。

英文摘要

We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss.

2605.20195 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues

面向目标的主动对话中规划的伪孪生网络

Xinyue Kang, Maodong Li, Yibin Zheng, Fang Kong

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 本文提出了一种面向目标的主动对话规划方法,通过FF-BPSN网络实现对话路径规划,提升目标导向型主动对话系统的有效性。

Comments ICASSP2026

详情
AI中文摘要

针对目标导向型主动对话系统,旨在引导对话向预设目标发展并主动提供建议。该系统的核心范式是规划合理的对话路径,并引导语言模型生成响应,其中对话路径规划是核心组件,是一个新颖但研究不足的问题。本文提出了一种前向聚焦双向伪孪生网络(FF-BPSN)用于面向预设对话目标的对话路径规划。FF-BPSN采用两个相同的基于Transformer的解码器用于前向和后向规划,并结合一个前向聚焦模块,整合双向信息以构建最终的前向路径。该路径受益于双向规划,同时优先考虑前向信息。然后,我们利用规划的路径来引导语言模型进行响应生成。在DuRecDial和DuRecDial 2.0上的广泛实验表明,FF-BPSN在对话路径规划中实现了最先进的性能,并显著增强了目标导向型主动对话系统的效果。

英文摘要

A target-oriented proactive dialogue system is designed to steer conversations toward predefined targets while actively providing suggestions. The core paradigm of such a system is to plan a reasonable dialogue path and subsequently guide language models (e.g., pre-trained or large language models) to generate responses, where dialogue path planning serves as the central component-a novel yet under-explored problem. In this work, we propose a Forward-Focused Bidirectional Pseudo-Siamese Network (FF-BPSN) for dialogue path planning toward predefined dialogue targets. FF-BPSN employs two identical transformer-based decoders for forward and backward planning, together with a forward-focused module that integrates bidirectional information to construct the final forward path. This path benefits from bidirectional planning while prioritizing forward information. We then employ the planned path to guide language models in response generation. Extensive experiments on DuRecDial and DuRecDial 2.0 demonstrate that FF-BPSN achieves state-of-the-art performance in dialogue path planning and significantly enhances the effectiveness of target-oriented proactive dialogue systems.

2605.20194 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

并行大语言模型推理用于偏见鲁棒、稳健的概念抽象

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye

发表机构 * University of Turku, Turku, Finland(图尔库大学,芬兰图尔库) Brilloconnetz Partners avoin yhtiö, Turku, Finland(Brilloconnetz Partners 公司,芬兰图尔库)

AI总结 本文提出了一种结合并行分块处理与证据锚定整合的结构化框架,旨在减少长文档分析中的偏见、遗漏误差和过度泛化问题,通过并行处理和证据锚定提高文本分析的可靠性和可扩展性。

Comments Accepted to be Published in 12th Intelligent Systems Conference 2026, 3-4 September 2026 in Amsterdam, The Netherlands

详情
AI中文摘要

大型语言模型(LLMs)在分析文本方面被越来越多地使用。然而,当分析长文档时,它们常常受到上下文推理限制的困扰。当长文档被顺序处理时,早期或主导的概念会掩盖不明显但有意义的解释,导致累积分析偏见、遗漏误差和过度泛化。此外,独立生成的输出通常在没有系统基础的情况下合并,引入了冗余、概念漂移和未经支持的主张。本研究提出了一种结合并行分块处理与证据锚定整合的结构化框架。文本首先被划分为语义连贯的分块,并独立并行处理以消除早期处理的影响。然后,独立生成的解释通过显式的证据锚定和优先级整合进行整合,从而减少主导和过度泛化,同时提高可追溯性。在多种模型类型和规模上的实验表明,并行处理显著减少了约84%的遗漏误差,提高了高达130%的证据可追溯性,并减少了高达91%的未经支持的主张。较小的模型受益最大,表明高效的并行分块和整合在实现可靠和可扩展的文本分析中起关键作用。

英文摘要

Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.

2605.20193 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

通过多轮提示验证提升量化模型在定性分析中的性能

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye

发表机构 * University of Turku, Turku, Finland(图尔库大学,芬兰图尔库) Brilloconnetz Partners avoin yhtiö, Turku, Finland(Brilloconnetz Partners 有限公司,芬兰图尔库)

AI总结 本文研究了不同位数量化级别和类型对LLaMA-3.1(8B)在定性分析中的性能影响,提出了一种量化感知的多轮提示验证方法以提高模型的稳定性和准确性,结果显示8位模型最接近黄金标准,4位模型在应用方法后变得稳定,3位和2位模型在提示设计和验证后性能有所提升。

Comments Accepted to publish in 12th Intelligent Systems Conference 2026; 3-4 September 2026 in Amsterdam, The Netherlands

详情
AI中文摘要

量化大型语言模型(LLMs)因其运行速度快且计算资源需求低而更常用于定性分析。本研究探讨了不同低位量化级别(8位、4位、3位和2位)和量化类型对LLaMA-3.1(8B)在定性分析中的性能影响。研究使用了82份访谈记录中的专家和非专家回应。低比特模型常产生较高的幻觉和不稳定结果,尤其是在处理非专家语言中的不明确术语时。为提高性能,我们提出了一种量化感知的多轮提示验证方法。该方法通过受控步骤引导模型减少幻觉,移除不可靠内容,并在验证后将结果传递给下一访谈文本,从而提高准确性。为了验证性能,人类编码器使用NVivo和BF16 LLaMA分析了访谈记录。BF16 LLaMA-3.1产生了高精度输出,但存在语义漂移和幻觉。这些错误被手动纠正。纠正后的BF16输出和NVivo人工编码被结合,以创建主题提取和频率分析的黄金标准地面真实值(GSGT)。结果表明,8位模型最接近GSGT。4位模型在应用所提方法后变得稳定。3位和2位模型因压缩严重而性能下降,但通过所提提示设计和验证有所提升。本研究还发现,相同位数的模型在不同量化类型下行为不同。总体而言,该方法帮助低资源LLM变得更加稳定、准确,并以更低的成本适用于定性研究。

英文摘要

Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

2605.20191 2026-05-21 cs.CL 版本更新

Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs

闪耀的故事,隐藏的挣扎:通过LLM的视角调查残疾人的表现

Marco Bombieri, Simone Paolo Ponzetto, Marco Rospocher

发表机构 * Department of Foreign Languages and Literature, University of Verona(外国语言文学系,威尼斯大学) Department of Information Engineering and Computer Science, University of Trento(信息工程与计算机科学系,特伦托大学) Data and Web Science Group, University of Mannheim(数据与网络科学小组,曼海姆大学)

AI总结 本文研究了大型语言模型(LLMs)如何表现残疾人群体,通过模拟残疾人视角生成社交媒体帖子,并与真实残疾人的帖子进行比较,揭示LLMs对残疾人的理想化倾向和潜在的偏见。

Comments Accepted for publication in ACM Transactions on Intelligent Systems and Technology

详情
Journal ref
ACM Trans. Intell. Syst. Technol. (2026)
AI中文摘要

现代大型语言模型(LLMs)近年来因其模拟人类行为和生成反映个人身份和人口群体的文本的能力而受到广泛关注。尽管这些能力可以打开跨领域多样应用的大门,但必须检查这些模型如何代表各种目标群体,因为LLMs可能会延续和放大对历史上被边缘化社区的偏见或歧视,或者由于去偏见努力,过度纠正导致过度积极的刻板印象。这种过度补偿会理想化这些群体,忽视他们所面临的复杂性和挑战,转而采用不现实的描绘。在本文中,我们通过模拟残疾人视角生成社交媒体帖子来研究LLMs如何表现残疾人群体。这些帖子随后与真实残疾人撰写的帖子进行比较,重点是情感基调、情感倾向和代表性的词汇和主题。我们的分析揭示了两个关键发现:(1)LLMs常常理想化残疾人的经历,生成过于积极的刻板印象,尽管看起来鼓舞人心,但未能真实捕捉他们的现实生活;(2)对模拟有和没有残疾人的帖子的比较分析显示了一种负面偏见,某些主题,如职业和娱乐,被不成比例地与非残疾人群体相关联。这强化了排斥性叙述和对残疾的过度理想化描绘,歪曲了该社区实际面临的挑战。这些发现与更广泛的关注和持续研究一致,表明LLMs难以反映社会的多样现实,特别是边缘化群体的复杂经历,并强调了对其表现进行批判性审视的必要性。

英文摘要

Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.

2605.17140 2026-05-21 cs.CV cs.AI cs.CL 版本更新

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA: 用于脑肿瘤MRI解读的视觉问答数据集

Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil

发表机构 * Fung Institute for Engineering Leadership(工程领导力基金会) University of California, Berkeley(加州大学伯克利分校) Department of Radiology(放射科) University of California, San Francisco(加州大学旧金山分校) Division of Clinical Informatics and Digital Transformation(临床信息学与数字转型部) Department of Neurological Surgery(神经外科部)

AI总结 本文提出一个临床相关的视觉问答基准数据集UCSF-PDGM-VQA,包含2387个问题-答案对,用于评估视觉语言模型在处理多序列3D MRI扫描中的能力,发现现有模型在多模态处理上存在缺陷。

Comments 10 pages, 2 figures, 6 tables

详情
AI中文摘要

脑肿瘤诊断很大程度上依赖于磁共振成像(MRI)评估,这需要放射科医生综合分析成千上万张来自多种3D序列和纵向研究的图像。这一过程需要高级的神经放射学培训,具有显著的认知负荷,并且非常耗时。尽管放射学需求不断增长,但这种专业知识难以扩展,给当前的医疗系统带来压力。视觉-语言模型(VLMs)提供了一种通过半自动化、互动解释复杂脑MRI来减轻这种负担的机会。然而,由于缺乏专门的评估基准,它们在神经肿瘤学中目前使用有限。我们介绍了一个临床相关的视觉问答(VQA)基准——UCSF-PDGM-VQA数据集,包含来自公共UCSF-PDGM数据集中473个胶质瘤相关MRI研究的2387个QA对。我们进一步在该数据集上建立了六种最先进的视觉语言模型(VLMs)和一个大型语言模型的性能基线。我们发现,当前模型无法有效处理多序列、三维MRI扫描,导致视觉特征的抑制和对语言先验的过度依赖,从而造成模态崩溃。这些发现突显了当前模型在临床环境中的可靠性和安全性方面的关键缺陷,需要开发稳健的、领域特定的VLMs。

英文摘要

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

2605.12770 2026-05-21 cs.LG cs.AI cs.CL 版本更新

WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE: 用于递归状态的稀疏自编码器

Jack Young

发表机构 * Indiana University(印第安纳大学)

AI总结 本文提出WriteSAE,一种用于递归语言模型状态中矩阵更新的稀疏自编码器,通过在递归缓存中替换原始写入操作来提升生成效果,并在多个模型上验证了其有效性。

Comments 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae

详情
AI中文摘要

我们介绍了WriteSAE,一种用于递归语言模型状态中矩阵更新的稀疏自编码器。在Gated DeltaNet、Mamba-2和RWKV-7中,每个token向递归缓存写入一个矩阵形状的更新;残差流SAE具有向量形状的原子,无法直接替换该更新。WriteSAE学习具有与模型自身写入相同形状的秩-1矩阵原子。这使我们能够测试直接替换:在SAE激活原子的位置,我们移除模型的写入,插入由SAE激活缩放的原子,并继续前向传递。在92.4%的评估位置上,原子比删除写入能产生更接近的最终token分布;平均每个原子,该比率是89.8%。对于Gated DeltaNet,一个使用忘记门、读取查询和输出嵌入的公式可以预测结果的logit变化,$R^2 = 0.98$。相同的替换测试在Mamba-2-370M上转移,达到88.1%。在生成中,该公式选择写入方向;将写入方向写入三个连续的缓存位置,其范数为模型写入的3倍,使在未修改模型中初始排名为100-1000的token出现在100%的延续中,比33.3%有所提高。据我们所知,这是首次在状态空间或混合递归层中报告的缓存级引导干预。

英文摘要

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

2604.09174 2026-05-21 cs.CL 版本更新

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

面向证据不确定性和幻觉的面级追踪(RAG)

Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl

发表机构 * Johannes Kepler University Linz(约翰内斯·开普勒大学林茨) Linz Institute of Technology(林茨技术学院) Institute of Computational Perception(计算感知研究所) Artificial Intelligence Lab(人工智能实验室)

AI总结 本文提出了一种面向问答任务的面级诊断框架,通过分解每个输入问题为原子推理面,评估证据充分性和基础性,揭示RAG系统中幻觉产生的根源在于生成过程中证据整合的方式,而非检索准确性。

详情
AI中文摘要

检索增强生成(RAG)旨在通过检索证据来减少幻觉,但即使有相关文档可用,幻觉答案仍然常见。现有评估集中在答案级或段落级准确性,对生成过程中证据使用缺乏深入理解。本文引入了一种面向问答任务的面级诊断框架,将每个输入问题分解为原子推理面。对于每个面,我们使用结构化的Facet x Chunk矩阵评估证据充分性和基础性,该矩阵结合检索相关性和基于自然语言推理的忠实度评分。为了诊断证据使用,我们分析了三种受控推理模式:严格RAG,强制依赖检索证据;软RAG,允许整合检索证据和参数知识;以及无检索的LLM生成。比较这些模式能够深入分析检索生成不一致,即相关证据被检索但未在生成过程中正确整合的情况。在医疗问答和HotpotQA上,我们评估了三种开源和闭源LLM(GPT、Gemini和LLaMA),提供了可解释的诊断,揭示了反复出现的面级失败模式,包括证据缺失、证据不一致和先验驱动覆盖。我们的结果表明,RAG系统中的幻觉更多由生成过程中证据整合的方式决定,而非检索准确性,面级分析揭示了系统性的证据覆盖和不一致模式,这些模式在答案级评估下是隐藏的。

英文摘要

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

2603.00086 2026-05-21 cs.CL cs.AI cs.SD eess.AS 版本更新

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

基于迭代的LLM改进法用于法语临床访谈的转录与说话人识别

Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec

发表机构 * LaTIM UMR 1101 INSERM(INSERM拉蒂姆UMR1101) University of Western Brittany(西布列塔尼大学) University of Rouen Normandy(诺曼底大学)

AI总结 本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程,提高法语医疗对话的转录准确性和说话人归属,通过两个法语临床数据集的消融研究验证了四种设计选择的效果。

详情
AI中文摘要

法语医疗对话的自动语音识别仍然具有挑战性,自发临床语音的词错误率通常超过30%。本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程来提高转录准确性和说话人归属。在两个法语临床数据集(自杀预防电话咨询和术前清醒神经外科会诊)上的消融研究调查了四种设计选择:模型选择、提示策略、流程顺序和迭代深度。使用Qwen3-Next-80B,Wilcoxon符号秩检验证实了在自杀预防对话上词错误率(WDER)的显著降低(p<0.05,n=18),同时在清醒神经外科会诊上保持稳定(n=10),零输出失败和可接受的计算成本(RTF 0.32),表明该方法在离线临床部署中的可行性,有待在更大语料库上验证。

英文摘要

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

2602.10408 2026-05-21 cs.LG cs.CL 版本更新

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

预归一化变换器中的门控归一化移除与尺度锚定

Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall

发表机构 * Department of Electrical Engineering(电气工程系) Department of Computer Science(计算机科学系) Stanford University(斯坦福大学)

AI总结 本文研究了预归一化变换器中归一化层的必要性,提出了一种门控归一化移除方法,通过TaperNorm逐步将归一化操作转为样本无关的线性或仿射映射,并揭示了最终归一化层对预logit表示尺度的锚定作用。

详情
AI中文摘要

归一化层在变换器中是标准组件,但其样本依赖的计算在整个训练和推理过程中是否必要尚不明确。本文为预归一化变换器开发了一种门控归一化移除方法。该方法通过TaperNorm实现,从标准RMSNorm/LayerNorm逐步过渡到学习的样本无关线性或仿射映射。一旦门控达到零, tapered层将不再计算每个token的统计信息,所得到的映射可以折叠到相邻的线性投影中。结果表明,在测试的预训练和微调设置中,内部归一化可以逐步移除,且验证损失的增加较小。我们的方法揭示了最终归一化层的独特作用,即它锚定了预logit表示的尺度。有了这个锚定,最后隐藏状态的径向变化不会直接减少损失;当移除它时,通过增加logit的幅度可以实现交叉熵的减少。固定目标尺度损失提供了显式的替代锚定,使在测试范围内能够完全无归一化地进行消融实验。最后,在KV缓存自回归解码基准中,逐步移除内部归一化可提供高达1.14倍的吞吐量,使用显式缩放操作,折叠后可达1.18倍。

英文摘要

Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present, radial changes in the last hidden state do not directly reduce the loss; when it is removed, reducing cross-entropy can be achieved by increasing logit magnitudes. A fixed-target scale loss provides an explicit alternative anchor and enables fully norm-free ablations in the tested regimes. Finally, in a KV-cached autoregressive decoding benchmark, tapering internal norms gives up to $1.14\times$ higher throughput with explicit scaling operations and up to $1.18\times$ after folding.

2602.09723 2026-05-21 cs.CL 版本更新

AI-Assisted Scientific Assessment: A Case Study on Climate Change

AI辅助的科学评估:气候变化案例研究

Christian Buck, Levke Caesar, Michelle Chen Huebscher, Massimiliano Ciaramita, Erich M. Fischer, Zeke Hausfather, Özge Kart Tokmak, Reto Knutti, Markus Leippold, Joseph Ludescher, Katharine J. Mach, Sofia Palazzo Corner, Kasra Rafiezadeh Shahi, Johan Rockström, Joeri Rogelj, Boris Sakschewski

发表机构 * Potsdam Institute for Climate Impact Research (PIK), Member of the Leibniz Association(波茨坦气候影响研究所(PIK),莱比锡协会成员) Institute for Atmospheric and Climate Science, ETH Zurich(大气与气候科学研究所,苏黎世联邦理工学院) University of Zurich(苏黎世大学) University of Miami(迈阿密大学) Centre for Environmental Policy, Imperial College London(伦敦帝国理工学院环境政策中心) Grantham Institute for Climate Change and Environment, Imperial College London(伦敦帝国理工学院气候变化与环境研究所) Energy, Climate and Environment Program, International Institute for Applied Systems Analysis(国际应用系统分析研究所能源、气候与环境项目)

AI总结 本文探讨了AI在科学评估中的应用,通过与气候科学领域13名科学家的合作,测试了基于Gemini的AI环境在评估大西洋经向翻转环流(AMOC)稳定性方面的效果,展示了AI在加速科学流程和提升报告质量方面的贡献,同时强调了专家补充和监督的重要性。

详情
AI中文摘要

新兴的AI合作者范式专注于可重复验证的任务,其中智能体通过'猜测和检查'循环探索搜索空间。这一范式无法扩展到无法重复评估的问题,其真实情况由理论和现有证据的共识综合确定。我们评估了一个基于Gemini的AI环境,该环境旨在支持协作科学评估,并整合到标准科学流程中。与13名气候科学领域的科学家合作,我们测试了该系统在复杂主题——大西洋经向翻转环流(AMOC)稳定性上的表现。我们的结果表明,AI可以加速科学流程。小组通过104次修订循环,在不到46人小时内完成了79篇论文的综合综述。AI的贡献显著:大多数AI生成的内容保留在报告中。AI还帮助维持逻辑一致性和呈现质量。然而,专家补充至关重要,以确保其可接受性:报告中不到一半的内容由AI生成。此外,需要大量监督以扩展和提升内容到严谨的科学标准。

英文摘要

The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.

2512.03671 2026-05-21 cs.CL 版本更新

Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian Context

生成式AI实践、素养与差距:意大利情境下的实证分析

Beatrice Savoldi, Giuseppe Attanasio, Olga Gorodetskaya, Marta Marchiori Manerba, Elisa Bassignana, Silvia Casola, Matteo Negri, Tommaso Caselli, Luisa Bentivogli, Alan Ramponi, Arianna Muti, Nicoletta Balbo, Debora Nozza

发表机构 * FBK(弗兰克福-博洛尼亚研究中心)

AI总结 本研究通过实证分析探讨生成式AI的采用、素养及使用模式,揭示其在意大利情境下对不同群体的影响,发现数字素养是影响AI利用的关键因素,而非单纯使用与否。

详情
AI中文摘要

生成式AI(GenAI)聊天机器人通过对话界面的普及正在改变数字互动并具有经济潜力。然而,这些工具可能加深现有不平等——不仅通过不均等、社会分层的采用,还通过其有意识、批判性使用中的差异。基于对1906名意大利语使用者的原始调查数据,我们提供了对GenAI采用、素养和使用模式的全面分析。我们的发现表明,GenAI支持多样化个人和专业活动,并取代传统信息获取工具。然而,教育程度较低、年龄较大的人以及技术熟悉度较低的人更不可能采用它;40%的人将能力障碍视为主要障碍。在用户中,AI训练成为有意识、资本增强型参与的主要预测因素——内容创作、学习和创造力提升——而更被动、娱乐性的使用(如陪伴、信息寻求)则对能力水平不敏感。因此,我们强调数字素养是人们如何利用GenAI的关键因素,而非仅仅是否使用它。最后,性别在持续的交叉差距中发挥作用,影响采用和使用频率。这些发现挑战了高可及性意味着广泛共享收益的假设。相反,它们提供了GenAI时代新兴差异的细致、多层次的账户——对这种技术最终如何驱动结果和利益差距有影响。

英文摘要

The rise of generative AI (GenAI) chatbots accessible via conversational interfaces is transforming digital interactions and holds economic promise. However, these tools might deepen existing inequalities -- not only through uneven, socially stratified adoption, but through differentials in their purposeful, critical use. Drawing on original survey data from 1,906 Italian-speaking adults, we provide a comprehensive analysis of GenAI adoption, literacy, and usage patterns. Our findings show that GenAI is supporting diversified personal and professional activities and replacing traditional information-seeking tools. Yet less-educated and older individuals, and those with lower technology familiarity, are less likely to adopt it; 40% cite competence barriers as a key obstacle. Among users, AI training emerges as the primary predictor of purposeful, capital-enhancing engagement -- content creation, learning, and creativity enhancement -- while more passive, recreational uses (e.g., companionship, information seeking) remain insensitive to competence levels. We thus highlight digital literacy as a lever for how people leverage GenAI, not just whether they use it. Finally, gender operates as a persistent cross-cutting divide, shaping both adoption and usage frequency. These findings challenge the assumption that high accessibility translates into broadly shared gains. Rather, they offer a granular, multi-level account of emerging disparities in the GenAI era -- with implications for how this technology may ultimately drive outcomes and benefit divides.

2510.05942 2026-05-21 cs.CL cs.AI 版本更新

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

EvalMORAAL: 可解释的链式推理与大语言模型道德对齐的LLM-as-Judge评估

Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri

发表机构 * Department of Methodology and Statistics, Utrecht University(方法论与统计学系,乌得勒支大学)

AI总结 本文提出EvalMORAAL框架,通过两种评分方法和模型作为裁判的同行评审,评估20个大语言模型的道德对齐情况,发现西方与非西方地区存在显著的道德对齐差距。

Comments Accepted as a poster at *SEM 2026

详情
AI中文摘要

我们提出了EvalMORAAL,一个透明的链式推理(CoT)框架,使用两种评分方法(对数概率和直接评分)以及模型作为裁判的同行评审来评估20个大语言模型的道德对齐。我们对世界价值观调查(55个国家,19个主题)和PEW全球态度调查(39个国家,8个主题)进行了评估。使用EvalMORAAL,顶级模型与调查响应高度一致(WVS上的皮尔逊相关系数r≈0.90)。然而,我们发现明显的区域差异:西方地区平均r=0.82,而非西方地区平均r=0.61(绝对差距0.21),表明存在持续的区域对齐差距。我们的框架增加了三个部分:(1)为所有模型提供两种评分方法以实现公平比较,(2)带有自我一致性检查的结构化Co T协议,以及(3)一个模型作为裁判的同行评审,使用数据驱动的阈值标记348个冲突。同行同意与WVS调查对齐(r=0.74,p<.001;PEW r=0.39,n.s.),支持自动化质量检查。这些结果展示了文化意识AI的真实进展,同时突显了跨区域应用的开放挑战。

英文摘要

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

2509.17396 2026-05-21 cs.CL 版本更新

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

EpiCache: 为资源受限环境下的长对话提供 episodic KV 缓存管理

Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

发表机构 * Apple(苹果公司)

AI总结 本文提出 EpiCache,一种无需训练的 KV 缓存管理框架,用于在固定内存预算下实现长对话问答(LongConvQA)。该方法通过块级预填充限制缓存增长,并通过 episodic KV 压缩保留主题相关上下文,从而在多个 LongConvQA 评估基准上提升了准确性并减少了延迟和峰值内存使用。

Comments ICML 2026

详情
AI中文摘要

现代大语言模型(LLMs)通过将上下文长度扩展到数百万个标记,能够生成连贯且个性化的响应,这些响应基于长期对话历史。然而,随着对话历史的延长,Key-Value(KV)缓存以线性方式增长,导致模型的内存足迹迅速超过设备限制。尽管最近的 KV 缓存压缩方法试图减少内存使用,但大多数方法在处理完整上下文后才进行缓存驱逐,导致无界峰值内存使用。此外,查询依赖的驱逐方法将缓存语义限制为单个查询,导致多轮对话中的失败案例。在本文中,我们引入 EpiCache,一种无需训练的 KV 缓存管理框架,用于在固定内存预算下实现长对话问答(LongConvQA)。EpiCache 通过块级预填充限制缓存增长,并通过 episodic KV 压缩保留主题相关上下文,该方法将对话历史划分为连贯的篇章,并执行篇章特定的 KV 缓存驱逐。在三个 LongConvQA 评估基准(LongMemEval、Realtalk 和 LoCoMo)上,EpiCache 将准确性提高了最高 30%,在 4-6 倍压缩下实现了接近满缓存的准确性,并将延迟和峰值内存分别减少了最高 2.4 倍和 3.7 倍。

英文摘要

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model's memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

2506.08277 2026-05-21 q-bio.NC cs.AI cs.CL cs.CV cs.LG 版本更新

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

基于任务的指令调制多模态大语言模型探测:在自然主义刺激下的区域特定大脑对齐模式

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

发表机构 * Technische Universität Berlin(柏林技术大学) Rice University(Rice 大学) AWS AI Labs, Amazon(Amazon 人工智能实验室) IIT Delhi(德里理工学院) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Spector Inc(Spector 公司) IIIT-Hyderabad(海得拉巴理工学院) Microsoft(微软)

AI总结 本研究探讨了指令调制多模态大语言模型在自然主义刺激下的大脑对齐模式,通过比较不同模型在视频和音频任务中的表现,揭示了指令调制对模型表示能力的影响。

Comments 57 pages, 39 figures

详情
AI中文摘要

近期的体素级多模态脑编码研究显示,多模态大语言模型(MLLMs)在大脑对齐程度上高于单模态模型。更近期的研究表明,指令调制多模态(IT)模型能够生成与大脑活动强相关的任务特定表示,但大多数先前评估集中在单模态刺激或非指令调制模型上。我们仍然缺乏对指令调制是否使IT-MLLMs围绕功能任务需求组织其表示,还是仅反映表面语义的清晰理解。为此,我们通过预测自然主义电影观看(带音频的视频)期间记录的fMRI响应,来估计大脑对齐情况。使用来自六个视频和两个音频IT-MLLMs的指令特定嵌入,跨13个视频任务指令,我们发现指令调制视频MLLMs的大脑对齐程度高于上下文学习(ICL)多模态模型(~9%)、非指令调制多模态模型(~15%)和单模态基线(~20%)。我们对视频和音频任务以及语言引导的探测评估,产生了不同任务特定的MLLM表示,这些表示在不同大脑区域中变化。我们还发现,ICL模型表现出强语义组织(r=0.78),而IT模型与指令文本语义的耦合较弱(r=0.14),这与与更高大脑对齐相关的任务条件子空间一致。这些发现支持了任务特定指令与更强的大脑-MLLM对齐之间的关联,并为映射两个系统中的联合信息处理开辟了新途径。我们公开了代码 [https://github.com/subbareddy248/mllm_videos]。

英文摘要

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

2305.09620 2026-05-21 cs.CL cs.AI cs.LG 版本更新

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

AI增强的调查:利用大型语言模型和调查进行意见预测

Junsol Kim, Byungkyu Lee

发表机构 * Department of Sociology(社会学系) University of Chicago(芝加哥大学) New York University(纽约大学) Chicago, IL(伊利诺伊州芝加哥市) New York, NY(纽约州纽约市)

AI总结 本文提出了一种基于大型语言模型的框架,通过结合问题、受访者和调查时期的嵌入表示,预测重复横断面调查中缺失的响应,从而弥补传统调查在捕捉历史变化方面的不足。

详情
AI中文摘要

全国代表性调查追踪公众意见,但每年只询问有限的问题,限制了其捕捉历史变化的潜力。为填补这一空白,我们开发了一个基于大型语言模型(LLM)的框架,通过结合问题、受访者和调查时期的嵌入表示,预测重复横断面调查中缺失的响应。我们引入了LLM在调查研究中的两个新应用:回溯预测(预测年度层面的缺失意见)和未询问意见预测(预测完全缺失的意见)。使用1972-2021年一般社会调查的数据,我们的LLM模型在交叉验证和在GSS未询问的年份中通过其他组织测量的公众意见方面表现良好。这些能力使我们能够恢复缺失的趋势并确定公众态度变化的时间,例如同性婚姻支持率的上升。然而,未询问意见预测的性能仍较为有限。我们展示了当我们的模型优于现有基准时的情况,检验了哪些意见和受访者更具可预测性,并评估了我们的方法是否减少了LLM预测响应的同质化倾向。我们的研究证明了LLM和调查可以相互增强:LLM扩大了调查的潜力,而调查则校准LLM以模拟人类意见。

英文摘要

Nationally representative surveys track public opinion, yet they ask only a limited set of questions each year, limiting its potential to capture historical changes. To fill this gap, we develop a large language model (LLM)-based framework for predicting missing responses in repeated cross-sectional surveys by incorporating embeddings for questions, respondents, and survey periods. We introduce two new applications of LLMs to survey research: retrodiction (predicting year-level missing opinions) and unasked opinion prediction (predicting entirely missing opinions). Using data from the 1972-2021 General Social Surveys, our LLM-based models perform strongly in retrodicting masked GSS opinions through cross-validation and public opinions measured by other organizations in years when the GSS did not ask them. These capabilities enable us to recover missing trends and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. However, performance remains modest for unasked opinion prediction. We show when our models outperform established benchmarks, examine which opinions and and respondents are more predictable, and evaluate whether our approach reduces LLMs' tendency to homogenize predicted responses. Our study demonstrates that LLMs and surveys can mutually enhance each other: LLMs broaden survey potential, while surveys calibrate LLMs for simulating human opinions.