arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.21481 2026-05-21 cs.AI cs.CL cs.LG 版本更新

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

AiraXiv：一个面向人类和AI科学家的AI驱动的开放获取平台

Junshu Pan, Panzhong Lu, Yixuan Weng, Qiyao Sun, Fang Guo, Zijie Yang, Qiji Zhou, Yue Zhang

发表机构 * Westlake University（西湖大学）； Zhejiang University（浙江大学）； Shanghai Innovation Institution（上海创新研究院）； Zhongguancun Academy（中关村学院）

AI总结本文提出AiraXiv平台，通过AI驱动的开放预印本、AI增强的分析与评审以及读者反馈，解决传统学术出版系统在AI时代面临的研究产出增长和可扩展性挑战。

详情

AI中文摘要

近年来，人工智能（AI）的进步加速了人类和AI生成的研究产出的增长，对传统学术出版系统施加了越来越大的压力，并在提交量增加、评审工作量和会议规模扩大时挑战了以会议和期刊为中心的可扩展性。为了解决这些挑战，我们探索了一个AI时代的出版范式，其中人类和AI科学家作为作者和读者参与，并通过持续反馈驱动的迭代使论文不断发展。我们提出了AiraXiv，一个基于开放预印本、AI增强的分析和评审以及读者反馈的AI驱动的开放获取平台。AiraXiv通过交互式UI支持人类科学家，通过基于模型上下文协议（MCP）的交互支持AI科学家。通过实际部署验证了AiraXiv，包括作为IC AIS 2025的提交平台，展示了其作为AI时代快速、包容和可扩展的研究基础设施的潜力。AiraXiv在https://airaxiv.com上公开可用。

英文摘要

Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.

URL PDF HTML ☆

赞 0 踩 0

2605.21468 2026-05-21 cs.LG cs.CL 版本更新

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

你只需要最小的RLVR训练：通过秩-1轨迹来扩展LLMs

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng

发表机构 * University of Virginia（弗吉尼亚大学）； Washington University in St. Louis（华盛顿大学圣路易斯分校）

AI总结本文研究了通过秩-1轨迹扩展LLMs的方法，发现RLVR参数轨迹具有极低的秩和高度可预测性，并提出RELEX方法，通过简单的线性回归在无需训练模型的情况下实现高效的超量扩展。

Comments preprint. Code: https://github.com/weizhepei/RELEX

详情

AI中文摘要

可验证奖励的强化学习（RLVR）已成为改进大语言模型（LLMs）推理能力的主要范式，但其底层几何结构仍待探索。本文证明RLVR权重轨迹具有极低的秩且高度可预测。具体而言，我们发现大多数下游性能提升可通过参数增量的秩-1近似来捕捉，其中该投影的幅度与训练步数近似线性增长。受此启发，我们提出了一种简单且计算高效的RELEX（强化学习扩展）方法，通过从短观察窗口估计秩-1子空间并利用线性回归进行超量扩展，无需任何训练模型。在三个模型（即Qwen2.5-Math-1.5B、Qwen3-4B-Base和Qwen3-8B-Base）上，RELEX生成的检查点在领域内和领域外基准测试中表现匹配或优于RLVR性能，仅需完整RLVR训练的15%步数。令人惊讶的是，RELEX能在无训练成本的情况下超量扩展远超观察窗口，预测检查点多达10-20倍于观察前缀，并持续改进（例如，仅观察前50步并扩展到1000步）。我们的消融分析证实了RELEX的极简充分性：增加子空间秩或采用非线性建模不会进一步提升超量扩展效果。最后，我们显示RELEX的成功源于“去噪”效应：通过将更新投影到秩-1子空间，模型会丢弃那些会降性能的随机优化噪声。我们的代码可在https://github.com/weizhepei/RELEX获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

URL PDF HTML ☆

赞 0 踩 0

2605.21467 2026-05-21 cs.LG cs.CL 版本更新

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

DelTA: 一种用于可验证奖励强化学习的判别性token信用分配

Kaiyi Zhang, Wei Wu, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院 Gallagher 学院）； Ant International（蚂蚁国际）

AI总结本文提出DelTA方法，通过估计token系数来增强特定侧的token梯度方向，从而改进可验证奖励强化学习中的token概率更新，提升了模型在数学基准测试中的性能。

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为提升大语言模型推理能力的核心技术。尽管其有效性已得到认可，但响应级奖励如何转化为token级概率变化仍缺乏深入理解。本文引入了RLVR更新的判别视角，表明策略梯度更新方向隐式地作为token梯度向量的线性判别器，从而决定学习过程中哪些token概率被增加或减少。在标准序列级RLVR中，该判别器由通过优势加权平均得到的正负侧质心构成。然而，此类质心构建可能被共享的高频模式（如格式token）主导，稀释了稀疏但判别性强的方向，这些方向更能区分高分响应与低分响应。为解决这一限制，本文提出DelTA，一种判别性token信用分配方法，通过估计token系数来放大侧特定的token梯度方向并降低共享或弱判别性的方向。这些系数重新加权了自我归一化的RLVR替代方案，使有效的侧向质心更具对比性，从而重塑RLVR更新方向。在七个数学基准测试中，DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分别比最强的同规模基线高出3.26和2.62个平均分。此外，代码生成、不同backbone和域外评估的额外结果进一步展示了DelTA的泛化能力。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

URL PDF HTML ☆

赞 0 踩 0

2605.21465 2026-05-21 cs.CL cs.SE 版本更新

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

利用大语言模型进行语法适应：关于元模型-语法共演的研究

Weixing Zhang, Bowen Jiang, Rahul Sharma, Regina Hebig, Daniel Strüber

发表机构 * Universität Rostock（罗斯托克大学）； Chalmers University of Technology and University of Gothenburg（皇家理工学院和哥德堡大学）； Radboud University（拉德堡德大学）

AI总结本文研究了如何利用大语言模型自动适应语法，通过学习先前版本的语法适应来实现自动适应，同时探讨了在复杂语法场景下的优势与局限性。

详情

AI中文摘要

在模型驱动工程中，元模型的演变导致需要相应地调整语法以保持一致性，这通常需要繁琐的手动工作。现有的基于规则的方法可以实现部分自动化，但在处理复杂语法场景时存在限制。本文提出了一种基于大语言模型的方法，通过学习先前版本的语法适应来自动应用于新语法。该方法在六个真实世界Xtext领域特定语言上进行了评估，使用四个DSL作为训练集开发提示策略，两个DSL作为测试集进行验证，并在QVTo上进行纵向案例研究。评估使用了三个大型语言模型（Claude Sonnet 4.5、ChatGPT 5.1、Gemini 3），并从三个维度测量语法适应质量：语法规则层面的适应一致性、输出相似性和元模型符合性。结果表明，在测试集上，所有三个LLM都实现了100%的适应一致性和输出相似性，而基于规则的方法在DOT上仅达到84.21%，在Xcore上为62.50%。在QVTo的纵向研究中，基于LLM的方法在所有三个演变步骤中成功重用了学习的适应，而基于规则的方法在两个演变步骤中需要手动调整。然而，在大规模语法（EAST-ADL，297条规则）上，LLM的适应一致性远低于90%。本研究展示了基于LLM的方法在处理复杂语法场景中的优势，同时揭示了其在大规模语法适应中的局限性。

英文摘要

In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.21463 2026-05-21 cs.CL cs.AI 版本更新

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Mem-$π$: 通过学习何时以及生成什么来实现自适应记忆

Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian

发表机构 * ServiceNow AI Research（ServiceNow AI研究院）； Mila -- Quebec AI Institute（魁北克AI研究所）； McGill University（麦吉尔大学）； CIFAR AI Chair（CIFAR人工智能主席）

AI总结 Mem-$π$ 通过学习在何时以及生成什么来实现自适应记忆，利用专门的语言或视觉-语言模型生成上下文特定的指导，从而在多种代理任务中优于基于检索和先前RL优化的记忆基线。

Comments Work in progress

详情

AI中文摘要

我们提出了Mem-$π$，一种用于大语言模型（LLM）代理的自适应记忆框架，其中有用的指导是按需生成而非从外部内存存储中检索。现有的记忆增强代理通常依赖于从事件记忆库或技能库中基于相似性的检索，返回静态条目，这些条目往往与当前上下文不一致。相比之下，Mem-$π$ 使用一个具有自身参数的专用语言或视觉-语言模型，与下游代理分开，以生成复杂任务的上下文特定指导。在当前代理上下文中，模型联合决定何时生成指导以及生成什么指导。我们通过决策-内容解耦的强化学习（RL）目标对其进行训练，使其能够避免在生成不会有所帮助的情况，并在其他情况下生成简洁有用的信息。在涵盖网页导航、基于终端的工具使用和基于文本的具身交互等多样代理基准上，Mem-$π$ 一致优于基于检索和先前RL优化的记忆基线，实现网页导航任务超过30%的相对提升。

英文摘要

We present Mem-$π$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$π$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$π$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.21403 2026-05-21 cs.CL 版本更新

Quantifying the cross-linguistic effects of syncretism on agreement attraction

量化语言间合成影响对一致性的吸引力

Utku Turk, Eva Neu

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）； University of Massachusetts, Amherst（马萨诸塞大学阿默斯特分校）

AI总结研究探讨了合成对一致性吸引力的影响在不同语言中的差异，通过大规模语言模型的 surprisal 和注意力熵来分析四种语言中的表现，揭示了合成如何调节吸引力的机制。

Comments SCiL Conference Paper

2605.21391 2026-05-21 cs.CL 版本更新

Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

通过条件尺度熵理解解码器-only语言模型中的隐喻处理

Lawhori Chakrabarti, Jennifer Johnson-Leung, Bert Baumgaertner, Aleksandar Vakanski, Min Xian, Boyu Zhang

发表机构 * Department of Computer Science, University of Idaho（计算机科学系，爱达荷大学）； Department of Mathematics and Statistical Science, University of Idaho（数学与统计科学系，爱达荷大学）； Department of Politics and Philosophy, University of Idaho（政治与哲学系，爱达荷大学）； Department of Nuclear Engineering and Industrial Management, University of Idaho（核工程与工业管理系，爱达荷大学）

AI总结研究探讨了解码器-only语言模型中隐喻处理的机制，通过条件尺度熵（CSE）分析不同层位的频率尺度变化，发现隐喻性词 token 在连续层位上产生显著更高的频谱宽度，且该效应不受语义复杂度或命题内容影响，验证了多尺度协调作为隐喻语言处理的特征。

Comments 18 pages, 3 figures, submitted to ICPR workshop

详情

AI中文摘要

隐喻要求语言模型解析一个词的上下文意义与基本字面意义相异。理解transformer模型如何在深度层面组织这种重新解释仍然是机械可解释性中的开放问题。我们引入了条件尺度熵（CSE），这是一种基于小波的度量，用于衡量transformer计算在每个层位置上跨频率尺度的广泛参与程度。两个定理证明了CSE对更新幅度具有不变性，从而将更新的结构模式与强度分离。使用CSE，我们发现隐喻性词在测试的每种解码器-only架构中（从124M到20B参数，包括GPT-2家族、LLaMA-2 7B、GPT-oss 20B）在连续层位置上产生显著更高的频谱宽度。该效应经聚类置换校正后仍然存在，且在模型的早期到中期相对深度范围内重现，并通过独立分析200对自然VUA对收敛。特定性控制进一步显示，该效应不被语义复杂度或匹配的命题内容所解释。这些结果将多尺度协调确定为所考察的解码器-only架构中隐喻语言处理的一致特征，并确立CSE作为表征transformer跨深度结构的原理性工具。

英文摘要

Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.21384 2026-05-21 cs.SE cs.AI cs.CL 版本更新

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench: 评估长周期编码代理中的奖励黑客现象

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

发表机构 * Weco AI

AI总结该研究通过分解软件工程任务，提出了一种评估长周期编码代理中奖励黑客现象的方法，通过比较可见测试套件和隐藏测试套件的通过率差异，引入了SpecBench基准，展示了奖励黑客现象在不同任务长度上的显著影响。

详情

AI中文摘要

随着长周期编码代理生成的代码量超过任何开发者能够审查的范围，监督责任集中于单一表面：自动测试套件。奖励黑客现象自然出现在这种设置中，因为代理在优化通过测试的同时偏离了用户的真正目标。我们通过将软件工程任务分解为三个部分来研究这种奖励黑客现象：(i) 规格的自然语言描述，(ii) 可见验证测试套件，用于单独测试指定功能，以及 (iii) 隐藏测试套件，用于组合这些相同功能以模拟真实世界使用。基于规格和可见验证测试套件，一个真实的代理能够生成一个能够通过所有隐藏测试套件的解决方案。因此，我们使用这两个套件之间的通过率差异来量化奖励黑客现象。基于这种方法，我们引入了SpecBench，一个包含30个系统级编程任务的基准，从短周期任务如构建JSON解析器到超长周期任务如从头构建整个操作系统内核。大规模实验揭示了一种一致的模式：尽管每个前沿代理都能饱和可见套件，奖励黑客现象仍然存在，较小的模型在隐藏套件上表现出更大的差距。差距也随着任务长度急剧增加：代码规模每增加十倍，差距增长28个百分点。失败范围从微妙的功能隔离到有意的利用，包括一个2,900行的哈希表“编译器”，它记忆测试输入。SpecBench提供了一个原则性的测试平台，用于测量编码代理是构建真正的可运行系统还是仅仅在开发人员提供的测试套件上玩游戏。

英文摘要

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

URL PDF HTML ☆

赞 0 踩 0

2605.21369 2026-05-21 cs.CL 版本更新

Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

第五届多语言指代消解共享任务成果：扩展长距离实体的数据集

Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

发表机构 * Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics（查理大学，数学与物理学院，形式与应用语言学研究所）； University of West Bohemia, Faculty of Applied Sciences, Department of Computer Science and Engineering（西波赫大学，应用科学学院，计算机科学与工程系）

AI总结本文总结了第五届多语言指代消解共享任务的成果，介绍了通过增加五个新数据集和两种新语言扩展数据集，以解决长距离实体的指代消解问题，并展示了传统系统与基于LLM的方法在任务中的表现。

Comments Accepted to CODI-CRAC 2026

详情

AI中文摘要

本文描述了与CODI-CRAC 2026研讨会同期举行的第五届多语言指代消解共享任务。在之前的版本基础上，该任务要求参与者开发能够识别提及并基于身份进行指代聚类的系统。2026版特别强调长距离实体，即跨越多个单词和句子的指代链。该任务通过加入五个新数据集和两种新语言扩展了语言范围，这些数据集利用了版本1.4的CorefUD，这是一个包含19种语言27个数据集的和谐多语言集合。总共十个系统参与了该任务，包括四个基于LLM的方法（三个微调模型和一个少样本方法）。尽管传统系统仍保持领先，但LLM显示出显著潜力，表明它们可能在未来的版本中挑战现有方法。

英文摘要

This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.

URL PDF HTML ☆

赞 0 踩 0

2605.21362 2026-05-21 cs.CL 版本更新

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH：适应性语义混合用于大语言模型的黑盒劫持

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

发表机构 * University of Maine（缅因大学）； University of Tennessee, Knoxville（田纳西大学, 基纳顿分校）； University of Florida（佛罗里达大学）

AI总结本文提出LASH框架，通过适应性语义混合方法，利用多个基础攻击的输出作为可重用的种子提示，针对不同目标模型和有害类别进行自适应组合，从而在黑盒红队测试中取得更高的攻击成功率。

详情

AI中文摘要

劫持攻击暴露了对齐的大语言模型预期安全行为与对抗性提示下行为之间的持续差距。现有自动化方法日益有效，但每个方法都局限于单一攻击家族（例如，一个细化循环、一个树搜索、一个突变空间或一个策略库），并且没有单一家族主导：表现最好的方法会根据目标模型和有害类别而变化，这表明互补优势可以通过每个提示的组合来利用。我们介绍了LASH（LLM适应性语义混合），一个黑盒框架，将多个基础攻击的输出视为可重用的种子提示，并针对每个目标请求自适应地组合它们。给定一个种子池，LASH搜索种子子集和softmax归一化的混合权重；组合模块合成一个候选提示，而无导数遗传优化器通过黑盒目标反馈和一个两阶段适应度函数（结合基于关键词的拒绝检测与LLM判官评分）更新权重。在包含100个有害提示的10个类别的JailbreakBench上，我们评估了LASH在六个常见目标模型上的表现。LASH在基于关键词的评估中平均攻击成功率为84.5%，在两阶段评估中为74.5%，其中响应首先被过滤以拒绝，然后由LLM判官评分是否实质上履行了原始有害请求。LASH在两个指标上均优于五个最先进的基线方法，仅使用30次平均目标查询。LASH还在三种防御机制下保持竞争力，并诱导出更多成功似内部表示。这些结果表明，跨异构劫持策略的适应性组合是黑盒红队测试的一个有前途的方向。

英文摘要

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

URL PDF HTML ☆

赞 0 踩 0

2605.21338 2026-05-21 cs.CL 版本更新

追踪大型语言模型中类人推理的持续涌现

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

发表机构 * Departament de Filologia Catalana, Universitat Autònoma de Barcelona（加泰罗尼亚语言系系，巴塞罗那自治大学）； Institut für Psychologie, Humboldt-Universitat zu Berlin（柏林洪堡大学心理学研究所）； Institució Catalana de Recerca i Estudis Avançats (ICREA)（加泰罗尼亚高级研究与高级教育研究所）

AI总结研究探讨了大型语言模型在条件推理任务中是否具备类人推理能力，发现人类通过语用推理丰富逻辑推理，而模型行为更不稳定，部分模型遵循条件语义但忽视语用推理，表明LLMs在语义准确性上表现良好，但缺乏人类推理中的语用丰富性。

详情

AI中文摘要

人类能够超越字面意义：如果你修剪草坪，我会给你五十美元，通常被理解为说话者只在草坪修剪时支付，而如果你饿了，烤箱里有披萨，意味着披萨无论听者是否饥饿都可用。大型语言模型（LLMs）在许多任务上表现出类人性能，但尚不清楚它们是否像人类一样推理。为此，我们进行了一项人口匹配实验，评估了25个LLMs在四种语言中计算条件推理的能力，并与每种语言中等数量的人类进行比较。我们发现，人类通过跨语言的语用推理丰富逻辑推理。模型行为更具变异性。一些LLMs完全遵循条件语义的真值表，但忽视语用推理，而另一些LLMs偏离真值表，坚持单一解释，从而反映准确的规则处理但不具有类人推理能力。总体而言，LLMs是准确的语义运算符，但未能捕捉到人类推理中特有的语用丰富性。关键的是，LLM的准确性既不被开放与封闭状态、训练方向或架构类型所预测或提升，表明语用推理仍然是人工系统认知工具包中正在兴起的能力。

英文摘要

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15156 2026-05-21 cs.CL cs.AI cs.LG 版本更新

MeMo: Memory as a Model

MeMo：记忆作为模型

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama

发表机构 * Institute of Data Science, National University of Singapore（数据科学研究院，新加坡国立大学）； Integrative Sciences and Engineering Programme, NUSGS（整合科学与工程计划，NUSGS）； Agency for Science, Technology, Research (A*STAR)（科技研究局（A*STAR））； Department of Computer Science, National University of Singapore（计算机科学系，新加坡国立大学）； University of Tokyo（东京大学）； Liquid AI ； CSAIL, Massachusetts Institute of Technology（CSAIL，麻省理工学院）； AI Singapore ； Singapore-MIT Alliance for Research and Technology Centre, Singapore（新加坡-麻省理工学院研究与技术中心，新加坡）

AI总结本文提出MeMo框架，通过在不改变LLM参数的情况下将新知识编码到专用记忆模型中，解决了大型语言模型在需要及时领域特定信息的应用中的问题，同时具备处理复杂跨文档关系、抗检索噪声、避免灾难性遗忘、无需访问LLM权重或输出logits以及检索成本与语料库大小无关等优势。

Comments MeMo augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise

详情

AI中文摘要

大型语言模型（LLMs）在广泛的任务上表现出色，但预训练后保持冻结状态，直到后续更新。许多现实应用需要及时、领域特定的信息，这促使需要高效的机制来整合新知识。在本文中，我们介绍MeMo（Memory as a Model），一个模块化框架，能够将新知识编码到专用的记忆模型中，同时保持LLM参数不变。与现有方法相比，MeMo具有几个优势：（a）它能够捕捉复杂的跨文档关系；（b）它对检索噪声具有鲁棒性；（c）它避免了LLM中的灾难性遗忘；（d）它不需要访问LLM的权重或输出logits，从而能够与开源和专有闭源LLM进行即插即用式集成；（e）其检索成本在推理时间与语料库大小无关。我们在三个基准测试集BrowseComp-Plus、NarrativeQA和MuSiQue上的实验结果表明，MeMo在多种设置中相比现有方法表现优异。

英文摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

URL PDF HTML ☆

赞 0 踩 0

2605.10933 2026-05-21 cs.LG cs.CL 版本更新

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

DECO：稀疏专家混合模型在端侧设备上实现密集级性能

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China（计算机科学与技术系，人工智能研究院，清华大学，北京，中国）

AI总结本文提出DECO，一种稀疏的专家混合模型，能够在相同的总参数预算和训练token数量下，实现与密集Transformer相当的性能，并在端侧设备上实现高效部署。

Comments 15 pages, 10 figures, 12 tables

详情

AI中文摘要

尽管混合专家（MoE）可以在不按比例增加计算的情况下扩展模型容量，但其巨大的总参数足迹造成了显著的存储和内存访问瓶颈，阻碍了同时需要高性能、低计算成本和小存储开销的端侧部署。为了实现这些特性，我们提出了DECO，一种稀疏的MoE架构，旨在在相同的总参数预算和训练token数量下匹配密集Transformer的性能。DECO利用可微且灵活的基于ReLU的路由增强，通过可学习的专家级缩放，自适应地平衡路由和共享专家的贡献。此外，我们引入了NormSiLU，一种在SiLU操作前对输入进行归一化的激活函数，产生更稳定的路由专家激活比率趋势和更高的内在稀疏性水平。我们还发现使用非门控MLP专家与基于ReLU的路由具有经验优势，表明MoE架构可能简化。实验表明，DECO仅激活20%的路由专家，能够实现密集性能，并优于现有的MoE基线。我们的专用加速内核在Jetson AGX Orin上相比密集推理实现了2.93倍的加速。代码和检查点可在https://github.com/thunlp/DECO获取。

英文摘要

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of routed experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 2.93$\times$ speedup on Jetson AGX Orin compared with dense inference. Code and checkpoints are available at https://github.com/thunlp/DECO.

URL PDF HTML ☆

赞 0 踩 0

2605.09492 2026-05-21 cs.CL cs.AI 版本更新

SHINE：一种可扩展的上下文超网络，用于在单次传递中将上下文映射到LoRA

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Technion, NVIDIA（技术学院与NVIDIA）； University of Oxford（牛津大学）； School of Electronics Engineering（电子工程学院）； Computer Science, Peking University（计算机科学，北京大学）

AI总结本文提出SHINE，一种可扩展的上下文超网络，用于在单次传递中将多样且有意义的上下文映射到高质量的LoRA适配器，通过重用冻结LLM的自身参数和引入架构创新，克服了先前超网络的关键限制，以较少的参数实现了强大的表达能力。

详情

AI中文摘要

我们提出SHINE（可扩展的上下文超网络），一种可扩展的超网络，能够将多样且有意义的上下文映射到高质量的LoRA适配器，用于大型语言模型（LLMs）。通过在上下文超网络设计中重用冻结LLM的自身参数，并引入架构创新，SHINE克服了先前超网络的关键限制，以相对较少的参数实现了强大的表达能力。我们引入了预训练和指令微调流水线，并训练我们的超网络在单次前向传递中从多样且有意义的上下文中生成高质量的LoRA适配器。它在不进行微调的情况下更新LLM参数，并立即启用与上下文相关的复杂问答任务，而无需直接访问上下文，有效地将上下文知识转换为参数知识。我们的工作在各种任务上取得了出色的结果，相比基于SFT的LLM适应方法，大大节省了时间和计算和内存成本，并展示了良好的可扩展性潜力。我们的代码可在https://github.com/MuLabPKU/SHINE获取。

英文摘要

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

URL PDF HTML ☆

赞 0 踩 0

2602.04916 2026-05-21 q-bio.QM cs.CL 版本更新

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

AFD-INSTRUCTION: 一个全面的抗体指令数据集，具有功能注解，用于基于LLM的理解和设计

Ling Luo, Wenbin Jiang, Hongyuan Chang, Xinkang Wang, Xushi Zhang, Yueting Xiong, Mengsha Tong, Rongshan Yu

发表机构 * National Institute for Data Science in Health and Medicine（健康医学数据科学国家研究院）； Institute of Artificial Intelligence（人工智能研究院）； School of Life Sciences（生命科学学院）； School of Informatics（信息学院）； State Key Laboratory of Vaccines for Infectious Diseases（传染病疫苗国家重点实验室）； Xiang An Biomedicine Laboratory（翔安生物医学实验室）

AI总结本文提出AFD-INSTRUCTION数据集，通过功能注解提升LLM在抗体理解与设计中的性能，为抗体建模和治疗发现提供新基础。

详情

AI中文摘要

大型语言模型（LLMs）在蛋白质表示学习方面显著进步。然而，其通过自然语言解释和设计抗体的能力仍然有限。为解决这一挑战，我们提出了AFD-Instruction，首个大规模指令数据集，专门针对抗体进行功能注解。该数据集包含两个关键部分：抗体理解，直接从序列推断功能属性；抗体设计，允许在功能约束下生成新的序列。这些部分提供了显式的序列-功能对齐，并支持由自然语言指令引导的抗体设计。在通用LLM上的广泛指令微调实验表明，AFD-Instruction在各种抗体相关任务中 consistently 提高了性能。通过将抗体序列与功能的文本描述链接，AFD-Instruction为推进抗体建模和加速治疗发现建立了新基础。

英文摘要

Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

URL PDF HTML ☆

赞 0 踩 0

2510.04905 2026-05-21 cs.SE cs.CL 版本更新

Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

检索增强的代码生成：聚焦于仓库级方法的综述

Yicheng Tao, Yuante Li, Yao Qin, Yepang Liu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Chinese University of Hong Kong（香港中文大学）； Southern University of Science and Technology（南方科技大学）

AI总结本文综述了检索增强的代码生成方法，重点探讨仓库级方法，分析了其在大规模代码生成中的挑战与解决方案，总结了现有方法的分类框架及关键挑战。

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进展显著提升了自动化代码生成的能力。尽管现有方法在函数和文件级别上表现优异，但现实中的软件工程需要对整个仓库进行推理，包括跨文件依赖、不断演变的执行环境和全局语义一致性。这一挑战催生了仓库级代码生成（RLCG），其中模型必须检索、组织并利用仓库级上下文以生成连贯且可执行的代码变更。为解决这些挑战，检索增强生成（RAG）已成为仓库级代码智能的重要范式。本文综述了检索增强代码生成（RACG），特别关注仓库级方法。不同于将RACG视为静态的“检索后生成”流程，我们将其视为一个耦合且不断演变的过程，涉及上下文构建、检索优化、生成和环境交互。通过统一的分析框架，我们组织现有方法，涵盖检索子系统、控制机制和评估设置。基于此框架，我们系统地考察了检索策略、基于图和非基于图的检索范式、训练驱动的优化以及自主代理架构。我们进一步总结了广泛使用的数据集、基准和系统配置，并讨论了关键挑战，包括可扩展性、可靠性、效率以及RACG与长上下文LLMs之间的必要性边界。通过本文综述，我们旨在为快速发展的RACG领域提供结构化的理解，并突出未来人工智能驱动的软件工程研究的有前景方向。

英文摘要

Recent advances in large language models (LLMs) have significantly improved automated code generation. While existing approaches have achieved strong performance at the function and file levels, real-world software engineering requires reasoning over entire repositories, including cross-file dependencies, evolving execution environments, and global semantic consistency. This challenge has led to the emergence of Repository-Level Code Generation (RLCG), where models must retrieve, organize, and utilize repository-scale context to generate coherent and executable code changes. To address these challenges, Retrieval-Augmented Generation (RAG) has become an increasingly important paradigm for repository-level code intelligence. In this survey, we present a comprehensive review of Retrieval-Augmented Code Generation (RACG), with a particular focus on repository-level approaches. Rather than viewing RACG as a static ``retrieve-then-generate'' pipeline, we characterize it as a coupled and evolving process involving context construction, retrieval optimization, generation, and environment interaction. We organize existing methods through a unified analytical framework spanning retrieval substrate, control regime, and evaluation setting. Based on this framework, we systematically examine retrieval strategies, graph-based and non-graph-based retrieval paradigms, training-driven optimizations, and autonomous agent architectures. We further summarize widely used datasets, benchmarks, and system configurations, and discuss key challenges including scalability, reliability, efficiency, and the necessity boundary between RACG and long-context LLMs. Through this survey, we aim to provide a structured understanding of the rapidly evolving RACG landscape and highlight promising directions for future AI-powered software engineering research.

URL PDF HTML ☆

赞 0 踩 0

2507.21168 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

多样化的大语言模型还是多样化的问题解释？那是集成的问题

Rafael Rosales, Santiago Miret

发表机构 * Intel Labs（英特尔实验室）

AI总结本文比较了使用大语言模型回答二元问题的两种多样性方法：模型多样性和问题解释多样性，并发现问题解释多样性在集成准确性上表现更优。

详情

DOI: 10.63317/43t2yvgid7tw
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 5116-5128

AI中文摘要

有效利用多样性已被证明可以提高各种机器学习模型，包括大语言模型（LLMs）的性能。然而，确定最有效的多样性使用方法仍是一个挑战。在本工作中，我们比较了两种用于使用LLMs回答二元问题的多样性方法：模型多样性，即多个模型回答相同的问题，以及问题解释多样性，即使用同一模型以不同方式 framing 相同的问题来回答。对于这两种情况，我们应用多数投票作为集成共识启发式方法来确定最终答案。我们的boolq、strategyqa和pubmedqa实验表明，问题解释多样性在集成准确性上始终优于模型多样性。此外，我们对GPT和LLaMa的分析表明，模型多样性通常产生在最佳和最差集成成员之间的结果，而没有明显的改进。

英文摘要

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

URL PDF HTML ☆

赞 0 踩 0

2506.09521 2026-05-21 eess.AS cs.CL 版本更新

You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

你所说的就是你：利用语言内容进行语音隐私攻击

Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters

发表机构 * International Audio Laboratories Erlangen（国际音频实验室埃朗根）； Department of Data Science（数据科学系）； Department of Electrical and Electronics Engineering（电气与电子工程系）； Friedrich-Alexander-University Erlangen-Nürnberg（弗里德里希-亚当-大学埃朗根-纽伦堡）； Fraunhofer IIS（弗劳恩霍夫研究所IIS）； Trinity College Dublin（都柏林三一学院）

AI总结本文研究了语音隐私保护系统中语言内容对语音隐私攻击的影响，通过调整BERT模型作为自动说话人验证系统，评估了说话人内部语言内容相似性对攻击性能的影响，并提出改进语音隐私数据集以实现更公平的隐私评估。

Comments 5 pages, 6 figures, 1 table, accepted at INTERSPEECH 2025 update reason: change to the acknowledgements

详情

AI中文摘要

说话人匿名化系统在隐藏说话人身份的同时，保留了诸如语言内容和情感等其他信息。为了评估其隐私效益，采用自动说话人验证（ASV）系统进行攻击。在本研究中，我们通过调整BERT模型作为ASV系统，评估了攻击者训练和评估数据集中说话人内部语言内容相似性的影响。在VoicePrivacy Attacker Challenge数据集中，我们的方法实现了平均相等错误率（EER）为35%，某些说话人仅基于其语音的文本内容就达到了2%的EER。我们的可解释性研究发现，系统决策与语音中语义相似的关键词有关，这些关键词源于LibriSpeech的编纂方式。我们的研究建议重新设计VoicePrivacy数据集，以确保公平和无偏的评估，并挑战对全球EER用于隐私评估的依赖。

英文摘要

Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.

URL PDF HTML ☆

赞 0 踩 0

2506.04042 2026-05-21 cs.CL 版本更新

Causal Path Alignment: Anchoring the Optimization Trajectory for Controllable In-Parameter Knowledge Editing

因果路径对齐：为可控的参数知识编辑锚定优化轨迹

Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Weiping Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结本文提出Causal Path Alignment框架，通过锚定优化轨迹来解决参数知识编辑中的主体主导记忆干扰问题，提升关系特异性并减少副作用。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

知识编辑对于高效更新大型语言模型（LLM）的参数记忆至关重要，使它们能够在动态环境中作为演进代理发挥作用。然而，主流的参数知识编辑方法存在主体主导记忆干扰问题：修改特定事实会无意中破坏与同一主题相关的更广泛结构知识。我们诊断根本原因是捷径学习病理，即优化目标过拟合了主体表示，而绕过了本质的关系上下文。为此，我们提出因果路径对齐（CPA），一种原理性的框架，旨在将优化轨迹锚定到有效的因果路径上。CPA强制参数更新通过关系意识的中间状态，从而防止上下文依赖性的丢失。在多种LLM基础架构上的实验结果表明，CPA一致消除了捷径，显著提高了关系特异性，同时表现出最小的副作用。此外，CPA为现有编辑器提供了一种模型无关的插件，为可靠和可信的参数知识编辑铺平了道路。

英文摘要

Knowledge editing is pivotal for efficiently updating the parametric memory of Large Language Models (LLMs), enabling them to function as evolving agents in dynamic environments. However, mainstream in-parameter knowledge editing approaches suffer from Subject-Dominant Memory Interference: modifying a specific fact inadvertently corrupts the broader structural knowledge associated with the same subject within LLMs. We diagnose the root cause as a shortcut learning pathology, where the optimization objective overfits subject representations while bypassing the essential relational context. To rectify this, we propose Causal Path Alignment (CPA), a principled framework designed to anchor the optimization trajectory to valid causal pathways. CPA enforces parameter updates to route through relation-aware intermediate states, thereby preventing the erasure of contextual dependencies. Experimental results across diverse LLM backbones demonstrate that CPA consistently eliminates the shortcut, significantly improving relation specificity while exhibiting minimal side-effects. Moreover, CPA serves as a model-agnostic plug-in for existing editors, paving the way for reliable and trustworthy in-parameter knowledge editing.

URL PDF HTML ☆

赞 0 踩 0

2605.21256 2026-05-21 cs.CL 版本更新

Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification

西班牙临床笔记中可靠自动分诊：一种风险感知的HIV怀疑识别混合框架

Rodrigo Morales-Sánchez, Soto Montalvo, Raquel Martínez

发表机构 * Dept. of Lenguajes y Sistemas Informáticos, Escuela Técnica Superior de Ingeniería Informática, Universidad Nacional de Educación a Distancia (UNED)（语言与系统信息学系，信息工程技术学校，国家远程教育大学（UNED））； Dept. Informática y Estadística, Escuela Técnica Superior de Ingeniería Informática, Universidad Rey Juan Carlos (URJC)（信息与统计学系，信息工程技术学校，皇家胡安·卡洛斯大学（URJC））

AI总结本文提出一种混合框架，用于在西班牙临床笔记中识别HIV怀疑，通过分离随机不确定性和epistemic不确定性，提高分诊的可靠性。

Comments Accepted at the BioNLP Workshop @ ACL 2026

详情

AI中文摘要

标准临床自然语言处理（NLP）基准往往通过在模糊实例上强制确定性分类而产生虚高指标，从而掩盖了过于自信预测的临床风险。为弥合这一差距，我们提出了一种风险感知的混合选择性分类框架，在西班牙临床笔记中早期人类免疫缺陷病毒怀疑识别上进行了评估。我们的双验证方法通过Mondrian符合预测分离随机不确定性，并通过多中心马哈拉诺斯距离 veto 分离epistemic不确定性。实证评估表明，标准不确定性度量和基线分类器在安全医疗分诊中结构上不足，当被迫在严格可靠性约束下运行时，会遭受严重的覆盖崩溃。相反，通过要求临床叙述通过概率和几何保障，所提出的框架成功地隔离了一个高度可信的操作领域。

英文摘要

Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.

URL PDF HTML ☆

赞 0 踩 0

2605.21227 2026-05-21 cs.CL 版本更新

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

LLMs是否知道卢森堡语借词？探测低资源多语言模型中的词汇新词

Nina Hosseini-Kivanani

发表机构 * University of Luxembourg（卢森堡大学）； Radio Télévision Luxembourg (RTL)（卢森堡广播电视台（RTL））

AI总结本文通过LexNeo-Bench基准测试，探讨了低资源多语言模型在词汇借词识别中的表现，发现通过构建语言知识图谱可以显著提升模型的借词分类准确率，表明词汇资源对LLM评估具有结构化上下文的作用。

Comments Accepted to Neollm colocated with LREC2026, Three figures and three tables

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于小接触语言的写作辅助，但不清楚它们是否尊重社区在词汇借词和新词方面的规范。我们引入LexNeo-Bench，一个包含3,050个实例的令牌级基准，来源于LuxBorrow，一个大规模的卢森堡语新闻语料库，其中目标令牌被标记为本土词或法语、德语或英语借词。使用此基准，我们对三种多语言LLM在34种提示设置下进行测试，涉及两种任务：借词类型分类和二元词汇创新代理（借词与本土词）。在无外部上下文的情况下，模型在借词分类上的表现仅略高于随机猜测，因此我们构建了一个语言知识图谱，编码了供体语言、形态模式和词汇类比，并将实例特定的子图注入提示中。知识图谱提示将借词分类准确率从25-35%提升到71-81%，大幅缩小了小模型和大模型之间的差距，同时使新词检测困难且对少样本设计敏感。我们的结果表明，词汇意识提示对低资源接触语言中的稳健借词判断非常有益，且词汇资源可以作为LLM评估的结构化上下文。本研究在ENEOLI COST行动中进行，并探讨了多语言卢森堡语数据中的借词作为词汇创新的形式。

ACL-Verbatim: 无幻觉的科研问答

Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, Ádám Kovács

发表机构 * TU Wien（维也纳技术大学）； KR Labs（KR实验室）

AI总结本研究提出ACL-Verbatim系统，通过提取式问答方法在科研论文中精准映射用户查询到相关文本片段，构建了新的真实数据集并训练评估了多种提取模型，最终通过150M参数的ModernBERT模型在词级F1得分上达到53.6，优于最强的LLM提取器。

Comments 13 pages

详情

AI中文摘要

学术研究者需要高效可靠的工具从可信来源获取高质量信息，但现代AI辅助研究工具仍受大语言模型（LLMs）产生事实不准确或不合逻辑输出（即幻觉）的影响。我们应用提取式问答系统VerbatimRAG到ACL Anthology中的研究论文，直接将用户查询映射到检索文档中的原文文本片段。我们贡献了一个新的真实数据集，用于将用户查询映射到科研论文中的相关文本片段，并利用该数据集训练和评估多种提取模型。人工标注由自然语言处理研究人员完成，基于使用ScIRGen方法生成的合成用户查询，配以由VerbatimRAG检索的论文片段。在该基准上，一个基于我们流水线银色监督训练的150M参数ModernBERT标记分类器在词级F1得分上达到53.6，优于最强的评估LLM提取器（48.7）

英文摘要

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

URL PDF HTML ☆

赞 0 踩 0

2605.21097 2026-05-21 cs.CL 版本更新

WCXB: A Multi-Type Web Content Extraction Benchmark

WCXB：一个多类型网络内容提取基准

Murrough Foley

发表机构 * Murrough Foley

AI总结本文提出WCXB基准，包含2008个网页，涵盖七种结构不同的页面类型，通过五阶段流程生成真实标注，评估13种提取系统，发现现有文章-only基准无法发现结构化页面的盲区。

Comments Dataset: github.com/Murrough-Foley/web-content-extraction-benchmark, doi.org/10.5281/zenodo.19316874. Leaderboard: webcontentextraction.org. Preprint also deposited at doi.org/10.5281/zenodo.19664685

详情

AI中文摘要

网络内容提取——从网页中隔离主要内容以排除周围模板内容——是搜索引擎索引、检索增强生成、NLP数据集构建和大语言模型训练的前提。该领域的进展受到现有评估基准的限制，这些基准规模小（100-800页）、仅限于新闻文章或基于超过十年前的网页。我们介绍了网络内容提取基准（WCXB），包含来自1613个域的2008个网页，涵盖七种结构不同的页面类型：文章、论坛、产品、集合、列表、文档和服务页面。该数据集包括1497页的开发集和511页的测试集，具有匹配的页面类型分布。真实标注通过五阶段流程生成：LLM辅助草稿、自动化验证、四轮前沿模型审查、片段和质量验证脚本以及人工审查。我们评估了13种提取系统——11种启发式和2种神经网络——发现尽管顶级系统在文章上（F1=0.93）表现良好，但在结构化页面类型上性能差异显著（F1=0.41-0.84），揭示了现有文章-only基准无法发现的盲区。该数据集以CC-BY-4.0许可证发布，包含HTML源文件、真实标注、页面类型标签和基线结果。

英文摘要

Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

URL PDF HTML ☆

赞 0 踩 0

2605.21086 2026-05-21 cs.CL 版本更新

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

LoCar: 通过细粒度社会语言学控制评估车载助手的本地化

Seogyeong Jeong, Kiwoong Park, Seyoung Song, Eunsu Kim, Ken E. Friedl, Jaeho Kim, Alice Oh

发表机构 * KAIST（韩国科学技术院）； BMW Group（宝马集团）

AI总结本文提出了一种新的车载助手评估框架，专注于韩语本地化，揭示了当前LLM在细粒度韩语敬语控制方面的不稳定性以及策略性对话指标上的表现不足，强调了汽车AI需要向精确语言定制和可靠安全交互管理发展。

Comments To appear in ACL 2026 Industry Track

详情

AI中文摘要

尽管大型语言模型（LLMs）越来越多地集成到车载对话系统中，但由于缺乏针对实际部署需求定制的领域特定评估标准，确定最优模型仍具挑战性。在本文中，我们提出了一种新的车载助手评估框架，特别关注韩语本地化。我们的实证分析揭示了模型行为中的显著模式。首先，细粒度韩语敬语控制在当前LLM中仍然不稳定，表明在本地化设置中必须明确评估精确的语音层面实现。其次，模型在战略对话指标如澄清和主动性方面表现较弱。我们的分析表明，这源于这些任务本身的主观复杂性，其中我们的框架采取保守的评估立场以优先考虑可靠性。总体而言，我们的发现强调，汽车AI必须超越一般能力，向精确语言定制和可靠、以安全为导向的交互管理发展。

英文摘要

While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

URL PDF HTML ☆

赞 0 踩 0

2605.21076 2026-05-21 cs.CL 版本更新

GradeLegal: Automated Grading for German Legal Cases

GradeLegal: 自动化评分德国法律案例

Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic

发表机构 * University of Passau（帕绍大学）； Deggendorf Institute of Technology（德格多夫技术学院）

AI总结本研究探讨了大型语言模型能否用于自动化评分德国刑事和公法案例解决方案，通过系统评估27个专有和开源LLM，发现基于样本解决方案和评分标准的提示策略能有效模拟专家评分，尤其在公法领域达到0.91的QWK值，而在刑事法律领域仅为0.60，表明刑事法律评分任务更难。此外，集成方法能进一步提高评分一致性，并为更强的闭源单模型提供替代方案。

详情

AI中文摘要

对德国法律考试解决方案进行评分面临日益增长的体积和合格评分员短缺，导致反馈延迟并形成瓶颈。同时，这是一个高风险的专家任务，因为国家考试成绩强烈影响德国的职业发展。尽管具有实际相关性，文献中缺乏系统研究有效评分法律考试的方法。为解决这一差距，我们研究了大型语言模型（LLMs）是否能支持自动化评分德国刑事和公法案例解决方案，从而实现可扩展的反馈和学生自我测试。我们系统评估了27个专有和开源LLM，并通过基准测试提示策略，逐步增加任务相关信息，如样本解决方案和评分标准。使用二次加权κ（QWK），基于推理的LLM在获得样本解决方案和评分标准时，能在公法领域模拟专家评分（最高0.91），而刑事法律领域仅为0.60，表明刑事法律评分任务更难。除了单模型评分外，集成方法能通过其最佳成员提高一致性高达0.15，并可为更强的闭源单模型提供替代方案。此外，我们的发现表明，有效的提示设计和模型选择对于可靠的LLM基于法律考试评分至关重要。

英文摘要

Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

URL PDF HTML ☆

赞 0 踩 0

2605.21063 2026-05-21 cs.CL 版本更新

APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

APM：通过任意偏好映射评估大语言模型中的风格个性化

Philipp Spohn, Leander Girrbach, Zeynep Akata

发表机构 * Technical University of Munich, Helmholtz Munich（慕尼黑技术大学，亥姆霍兹慕尼黑）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））

AI总结本研究提出APM基准，通过隐式偏好映射评估大语言模型的风格个性化能力，发现路由方法是最可靠的方法，而RAG和软提示优化在强基础模型上才有提升。

详情

AI中文摘要

典型的LLM响应往往遵循默认风格，尽管用户对语气、详尽程度和正式程度有不同偏好，但这些偏好并未在提示中明确表达。评估个性化方法是否能适应这些隐式偏好具有挑战性，因为用户通常提供提示而非参考响应，风格偏好无法事实验证，且无参考的LLM评判可能将个性化与一般响应质量混淆。为解决这些挑战，我们引入了任意偏好映射（APM）基准，通过隐式随机映射C将用户属性（如热情）与响应原则（如说服力）解耦。由于C不包含语义内容且在每次运行中重新采样，模型无法利用刻板印象关联，必须从对话历史中推断偏好。使用这种无偏的评估方法，我们适配了检索增强、提示优化和路由个性化方法，并在Llama-3.1-8B和Qwen-3.5-27B上进行评估。我们的结果表明，路由是最佳方法，而RAG仅在更强的基础LLM上有所提升，软提示优化在非个性化基线上提升不显著。我们的广泛评估表明，在这种现实设置中，个性化仍然具有挑战性，但我们的适配方法显示出前景。

英文摘要

Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

URL PDF HTML ☆

赞 0 踩 0

2605.21049 2026-05-21 cs.CL 版本更新

思考-言语：一种受控交错推理方法用于实时语音生成

Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen

发表机构 * Huawei Technologies（华为技术）

AI总结本文提出了一种受控交错推理方法InterRS，用于实时语音生成，通过在自然语音生成过程中插入推理步骤，提高了语音流畅性和推理深度，实验表明其在数学和逻辑基准测试中表现更优，并生成更自然流畅的答案。

详情

AI中文摘要

思考-言语范式旨在使AI交流更人性化。关键挑战是保持流畅的语音生成同时进行深度推理。我们的方法InterRS通过在自然语音生成过程中插入推理步骤来解决这一问题。这需要高质量的数据，其中推理和语音精确对齐且长度比例受控。我们引入了一种新的管道来生成无缝交错的音频数据。为了训练我们的模型，我们结合了交错SFT与精炼数据以及强化学习，使用两种新的奖励：TA-Balance Reward用于管理时间与思考-回答比例，以及Linguistic Quality Reward用于优化表达。实验表明，我们的方法在数学和逻辑基准测试中实现了13%的性能提升，同时生成像口语指令模型一样快速的响应。此外，我们的方法生成的答案例句比先前方法更自然流畅。

英文摘要

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

URL PDF HTML ☆

赞 0 踩 0

2605.20936 2026-05-21 cs.LG cs.AI cs.CL 版本更新

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH：在单个GPU上几分钟内完成的快速可微架构搜索用于混合注意力

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本研究提出DASH，一种快速可微架构搜索框架，用于混合注意力架构设计，通过将离散的层间注意力操作放置转化为连续的架构logits，准备可重用的教师对齐线性候选，并在模型和操作权重冻结的情况下进行架构仅搜索，显著提高了搜索效率。DASH在Qwen2.5-3B-Instruct上优于现有的所有选择器风格的混合注意力设计基线，展示了直接可微搜索可以发现更强的混合架构。此外，DASH在RULER性能上优于已发布的Jet-Nemotron模型，同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是，每个DASH搜索运行仅使用12.3M tokens，并在单个RTX Pro 6000 GPU上仅需约20分钟，对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明，通过分钟级的可微搜索可以获得高质量的混合注意力架构，为混合架构设计提供了有前景的方向。

Comments 19 pages, 7 figures

详情

AI中文摘要

混合注意力架构正变得越来越重要，用于在保持模型质量的同时提高LLM推理效率，使混合架构设计成为核心问题。现有的设计通常依赖于手动经验规则或基于代理的选择器信号来分配层间操作符。最近的NAS风格系统，如Jet-Nemotron，展示了自动混合架构搜索的潜力。然而，Jet-Nemotron的PostNAS搜索阶段单独使用200B tokens，使得此类搜索流程难以作为混合架构设计的常规方法。我们引入DASH，一种用于混合注意力架构设计的快速可微搜索框架，它将离散的层间注意力操作放置放松为连续的架构logits，准备可重用的教师对齐线性候选，并在模型和操作权重冻结的情况下进行架构仅搜索，以显著提高搜索效率。在Qwen2.5-3B-Instruct上，DASH一致优于现有的所有选择器风格的混合注意力设计基线，表明直接可微搜索可以发现更强的混合架构。此外，DASH在RULER性能上优于已发布的Jet-Nemotron模型，同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是，每个DASH搜索运行仅使用12.3M tokens，并在单个RTX Pro 6000 GPU上仅需约20分钟，对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明，通过分钟级的可微搜索可以获得高质量的混合注意力架构，为混合架构设计提供了有前景的方向。

英文摘要

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

URL PDF HTML ☆

赞 0 踩 0

2605.20924 2026-05-21 cs.CL cs.AI 版本更新

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct: 任务级策略诱导用于指令生成

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan（国立台湾大学计算机科学与资讯工程学系）； Institute of Information Science, Academia Sinica, Taiwan（台湾“中央研究院”资讯科学研究所）； AI Research Center (AINTU), National Taiwan University, Taiwan（国立台湾大学人工智能研究中心）

AI总结该研究提出了一种任务级策略诱导方法Strategy-Induct，通过仅使用少量示例问题生成任务指令，无需依赖标注答案，从而在指令生成任务中取得优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

设计有效的任务级提示对于提高大型语言模型（LLMs）的性能至关重要。尽管先前的指令诱导工作表明LLMs可以通过有限的例子推断出更好的指令，但现有方法通常依赖于输入-输出对，而获取标注答案可能困难或成本高昂。为了解决这一限制，我们提出了Strategy-Induct框架，该框架仅从少量示例问题中推导出任务级指令，而无需标注答案。我们的方法首先提示模型为每个问题生成显式的推理策略，形成（策略，问题）对。这些对随后用于诱导一个任务指令，以引导推理。在多个任务和模型规模上的实验表明，Strategy-Induct在仅问题设置中优于最先进的方法。此外，我们观察到在任务指令生成和推理中联合使用LLMs和大型推理模型可能会进一步提高性能。

英文摘要

Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.20920 2026-05-21 cs.CL cs.SD 版本更新

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

通过发声学音素识别评估语音发声合成

Vinicius Ribeiro, Yves Laprie

发表机构 * Université de Lorraine, CNRS, Inria, LORIA（洛林大学、国家科学研究中心、法国国家信息与自动化研究所、LORIA）

AI总结本文通过发声学音素识别作为代理来评估语音发声合成的质量，提出利用发声学特征进行音素识别以更准确捕捉发音细节，从而改进生成模型的评估方法。

Comments Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

详情

AI中文摘要

最近机器学习的进步和发声学数据集的可用性使得声带合成可以基于语音序列进行条件化，这是发声学语音合成的主要任务。然而，质量评估需要更好的定义。通常，对生成模型进行排名具有挑战性，因为这涉及主观性。然而，发声学合成还具有额外的困难，即需要对声带解剖学和声学有专业知识。为了解决这个问题，本文提出通过音素识别来评估语音发声合成。我们的假设是使用发声学特征进行音素识别能更好地捕捉发音细节，如正确的发音位置，这传统度量（如点距度量）无法做到。我们训练了一个神经网络，使用来自单说话人RT-MRI数据集提取的声学和发声学特征。然后，我们比较了在不同合成发声学特征下测试模型的识别性能。我们的结果表明，我们的发声学特征集在语音发声合成中具有丰富的语音信息，并有助于探索额外的维度。

英文摘要

Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

URL PDF HTML ☆

赞 0 踩 0

2605.20916 2026-05-21 cs.CL 版本更新

Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

具有认知评估的任务路由混合专家模型用于隐含情感分析

Yaping Chai, Haoran Xie, Joe S. Qin

发表机构 * Division of Artificial Intelligence, Lingnan University, Hong Kong（人工智能学院，岭南大学，香港）

AI总结本文提出了一种基于认知评估理论的多任务学习框架，通过隐含情感检测和认知推理生成两个辅助任务，提升隐含情感分析的性能，同时采用任务级混合专家模型减少任务干扰，实验表明该方法在隐含情感子集上优于现有方法。

Comments 8 pages, 4 figures, and 3 tables

详情

AI中文摘要

隐含情感分析具有挑战性，因为对某个方面的态度通常是通过事件推断而非显式意见词表达。现有模型通常仅学习最终极性标签，这限制了从上下文推断情感的能力。受认知评估理论启发，我们提出了一种评估意识的多任务学习（MTL）框架，用于隐含情感分析，该框架通过两个互补的辅助任务：隐含情感检测和认知推理生成，提供极性预测。然而，训练多个具有不同目标的任务并共享单一骨干结构会限制灵活性并导致任务干扰。为减少这些相关但不同的目标之间的干扰，我们采用任务级混合专家模型，其中所有任务共享一组专家，任务身份控制这些专家的稀疏组合。我们的方法基于编码器-解码器架构，并用这些稀疏混合替换部分编码器和解码器块。我们使用任务条件路由器为每个任务选择稀疏专家混合，并使用任务分离的路由目标鼓励不同任务学习不同的专家选择模式。实验结果表明，我们的模型在隐含情感子集上优于最近提出的方法，具有显著优势。我们的代码可在 https://github.com/yaping166/TRMoE-ISA 上获得。

英文摘要

Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE-ISA.

URL PDF HTML ☆

赞 0 踩 0

2605.20915 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

校准与决策制定：重新审视未学习语言模型中的可靠性悖论

Divyaksh Shukla, Ashutosh Modi

发表机构 * Indian Institute of Technology Kanpur (IIT Kanpur)（印度理工学院坎浦尔学院（IIT坎浦尔））

AI总结本文研究了生成语言模型中校准与决策可靠性之间的差距，通过TOFU基准测试中的多项选择问答评估协议，发现经过微调的模型在校准误差较低，而未学习后的模型在校准误差仍低，但依赖于相关性特征的决策规则增加，扩展了可靠性悖论到机器未学习领域。

Comments Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

详情

AI中文摘要

机器未学习旨在从模型中移除特定训练数据的影响，同时保持对剩余数据的可靠行为，使可靠的预测和不确定性估计成为评估的关键。校准常被用作语言模型可靠性代理，但低校准误差并不一定意味着可靠的决策规则，因为模型可能依赖于虚假相关性而保持良好校准。我们通过TOFU基准测试中的多项选择问答评估协议，研究了生成语言模型中的这一差距，利用校准指标（ECE、MCE、Brier）测量概率可靠性，并通过基于属性的快捷方式检测（使用积分梯度和局部互信息）评估决策规则可靠性。我们发现，微调模型的校准误差（ECE ~ 0.04）低于预训练模型（ECE > 0.5），而未学习后的模型在校准误差相似，尽管在遗忘分割上的准确性降低，属性分析显示对基于相关性的标记依赖增加。这些结果表明，良好的校准可以与未学习后的基于快捷方式的决策规则共存，将可靠性悖论扩展到了机器未学习领域。

英文摘要

Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

URL PDF HTML ☆

赞 0 踩 0

2605.20912 2026-05-21 cs.CL 版本更新

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

增强科学论述：面向科学领域的机器翻译

Dimitris Roussis, Sokratis Sofianopoulos, Stelios Piperidis

发表机构 * Institute for Speech and Language Processing（语音与语言处理研究所）； Athena RC（雅典研究中心）

AI总结本文针对科学领域中由于专业术语和复杂句式带来的翻译挑战，构建了多语种平行和单语语料库，并通过微调通用神经机器翻译系统评估语料库质量。

详情

AI中文摘要

随着科研文献数量的增加，跨语言交流的需求日益迫切。机器翻译（MT）为获取国际出版物提供了有前景的解决方案。然而，科学领域因其专业术语和复杂句式而具有独特挑战。本文提出了一套面向科学领域的平行和单语语料库，目标语言对为西班牙-英语、法语-英语和葡萄牙-英语。对于每种语言对，我们创建了一个大规模的通用科学语料库以及四个聚焦于癌症研究、能源研究、神经科学和交通运输研究的较小语料库。为了评估这些语料库的质量，我们利用它们对通用神经机器翻译（NMT）系统进行微调。我们详细介绍了语料库的创建过程、所采用的微调策略，并最后给出了评估结果。

英文摘要

The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

URL PDF HTML ☆

赞 0 踩 0

2605.20876 2026-05-21 cs.CL cs.AI 版本更新

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Terminal-World: 通过智能体技能扩展终端智能体环境

Zihao Cheng, Hongru Wang, Zeming Liu, Xinyi Wang, Xiangrong Zhu, Yuhang Guo, Wei Lin, Jeff Z. Pan, Yunhong Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China（北京航空航天大学计算机科学与工程学院）； Independent Researcher（独立研究者）； Beijing Institute of Technology（北京理工大学）； University of Edinburgh（爱丁堡大学）

AI总结本文提出Terminal-World，一种自动化流程，利用智能体技能作为核心合成原语，共同编码任务目标、执行时机和方法，从而生成任务指令、环境和教师轨迹。通过构建5,723个训练环境，训练出Terminal-World-8B/14B/32B模型，在六个基准测试中均优于终端智能体基线，其中Terminal-World-32B在Terminal-Bench 2.0上以仅1.2%的训练数据超越Nemotron-Terminal-32B。

Comments Work in Progress

详情

AI中文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

英文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

URL PDF HTML ☆

赞 0 踩 0

2605.20833 2026-05-21 cs.CL 版本更新

MemGym: a Long-Horizon Memory Environment for LLM Agents

MemGym: 一种长时间跨度的记忆环境用于LLM智能体

Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

发表机构 * Rutgers University（罗杰斯大学）； Capital One ； Princeton University（普林斯顿大学）； Microsoft Research（微软研究院）

AI总结本文提出MemGym，一种用于评估LLM智能体记忆能力的基准测试环境，通过统一现有智能体 gym 和内部记忆基础管道，提供一个记忆推理接口。MemGym 包含五个评估赛道，涵盖四个智能体领域，能够独立评估记忆性能，排除推理、检索和工具使用能力的干扰。

详情

AI中文摘要

记忆是LLM智能体在长时间任务中运营的核心能力。现有的记忆基准测试主要评估多轮聊天场景中个性化信息的保留能力，忽略了在长时间智能体执行过程中发生的动态记忆形成。因此，它们所生成的记忆系统在现实的智能体环境中（如编程和网络导航）转移效果差。我们提出了MemGym，一个用于智能体记忆的基准测试，它将现有的智能体 gym 和内部记忆基础管道统一到一个记忆推理接口下。MemGym涵盖五个评估赛道，分为四个智能体领域：工具使用对话（tau2-bench）、多轮深度研究搜索（MEMGYM-DR）、编程（SWE-Gym和MEMGYM-CODEQA）、计算机使用（WebArena-Infinity）。MemGym报告出的记忆隔离分数将记忆性能与推理、检索和工具使用能力分离，因此可以独立对记忆策略进行排名。我们的合成管道为MEMGYM-CODEQA和MEMGYM-DR是长度可控的，在每个阶段都经过消融验证，并紧密对齐下游场景。为了使在编程环境中的评估在学术上具有可操作性，我们训练了MemRM，一个轻量级的奖励模型（使用Qwen3-1.7B微调QLoRA），它以快速标量读取的方式评分压缩质量，而不是完整的Docker回放。

英文摘要

Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

URL PDF HTML ☆

赞 0 踩 0

2605.20815 2026-05-21 cs.CL cs.AI cs.IR cs.LG 版本更新

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

在消费级硬件上实现GraphRAG：对本地LLMs在医疗EHR模式检索中的基准测试

Peter Fernandes, Ria Kanjilal

发表机构 * Department of Computer Engineering（计算机工程系）； California Polytechnic State University（加州州立大学波特兰分校）

AI总结本文研究了在消费级硬件上使用本地LLMs进行医疗EHR模式检索的GraphRAG方法，评估了四种不同模型在索引效率、知识图构建、查询延迟、回答质量和幻觉方面的表现，发现模型参数大小和检索模式对结果有显著影响。

Comments 9 pages, 1 figure, 5 tables

详情

AI中文摘要

基于图的检索增强生成（GraphRAG）扩展了检索增强生成，以支持对复杂语料库的结构化推理，但其在资源受限、隐私敏感的部署中的可靠性仍不清楚。在医疗领域，电子健康记录（EHR）数据复杂且严格监管，依赖云基于大语言模型（LLMs）会带来成本、延迟和合规性的挑战。本文系统评估了GraphRAG在EHR模式检索中的应用，使用本地部署的开源LLMs。我们实现了Microsoft GraphRAG管道在真实的EHR模式文档上，并基准测试了四种模型，包括Llama 3.1（8B）、Mistral（7B）、Qwen 2.5（7B）和Phi-4-mini（3.8B），这些模型通过Ollama在单个消费级GPU（8 GB VRAM）上部署。我们评估了索引效率、知识图构建、查询延迟、回答质量和幻觉在全局和局部检索模式下的表现。我们的结果揭示了显著差异：Llama 3.1生成最丰富的知识图（1,172个实体），Qwen 2.5达到最佳回答质量（3.3/5），Phi-4-mini因结构化输出错误无法完成流程，而Mistral表现出退化重复行为。我们进一步表明，GraphRAG具有实际容量阈值，其中模型参数低于约7B的模型无法可靠地生成有效的结构化输出并无法完成流程。此外，索引和回答质量在不同模型之间是脱耦的，局部检索在延迟和事实基础方面均优于全局总结，且幻觉减少。这些发现表明，GraphRAG可以在消费级硬件上实现，同时强调了模型选择和检索设计在受监管环境中的重要性。

英文摘要

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

URL PDF HTML ☆

赞 0 踩 0

2605.20813 2026-05-21 cs.CL 版本更新

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

PulseCol: 周期性刷新的列稀疏注意力用于加速扩散语言模型

Yanyi Lyu, Letian Chen, Futing Sun, Miao Zhang, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文提出PulseCol，一种周期性刷新的列稀疏注意力方法，通过更细粒度的稀疏化策略提升扩散语言模型的计算效率和加速性能，同时保持模型质量。

详情

AI中文摘要

干预的幻觉：你的LLM模拟实验实际上是一个观察性研究

Victoria Lin, Taedong Yun, Maja Matarić, John Canny, Arthur Gretton, Alexander D'Amour

发表机构 * Google DeepMind（谷歌深Mind）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文探讨了大型语言模型在模拟人类行为中的潜在作用，指出在LLM模拟的合成用户中进行干预可能引起潜在用户属性的意外变化，从而导致用户漂移，影响效果估计。本文提出了使用负对照结果来检测分布变化的方法，并通过调整角色描述以减少偏倚来缓解漂移问题。

详情

AI中文摘要

大型语言模型（LLMs）显示出作为人类行为模拟器的潜力，提供了一种可扩展的方式研究对干预的反应。然而，由于LLMs主要基于观察性数据进行训练，在与LLM模拟的合成用户进行实验时，干预可能会引起潜在用户属性的意外变化，导致用户漂移，其中隐含的模拟总体在不同处理条件下有所不同，这可能会扭曲效应估计。我们正式化了由于用户漂移可能产生的混淆或选择偏差，并展示了干预依赖性变化如何放大或减弱干预下用户响应的观测差异。为了诊断混淆，我们提出使用负对照结果——在干预下应保持不变的属性——来识别干预条件间的分布变化，提供用户漂移的证据。为了缓解漂移，我们研究了通过获取额外的混杂因素来调整角色描述，发现针对特定场景的相关混杂因素可以显著减少调查式和多轮代理评估中的偏倚。

英文摘要

Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.20745 2026-05-21 cs.LG cs.AI cs.CL 版本更新

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

验证器严格性的隐含信号：通过选择性潜在引导控制和改进逐步验证

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui

发表机构 * Dartmouth College（达特茅斯学院）； Datadog AI Research（Datadog人工智能研究）； Salesforce AI Research（Salesforce人工智能研究）

AI总结本文研究了通过隐藏状态干预控制验证器严格性的方法，提出VerifySteer通过利用潜在正确性信号进行样本级路由并选择性干预段落边界，从而在ProcessBench和Hard2Verify数据集上优于基线方法，且在推理计算上更高效。

详情

AI中文摘要

生成验证器已成为逐步验证的一种有前途的范式，但其验证行为往往校准不佳：它们可能过于宽松而错过错误步骤，或过于严格而拒绝正确推理。我们将这种倾向于过于宽松或过于严格的行为称为验证器严格性。在本工作中，我们研究是否可以通过隐藏状态干预来控制验证器严格性。我们揭示了一个验证特定的隐藏状态信号：在逐步验证中，验证器接受或拒绝解决方案步骤的倾向编码在对应的验证段落边界附近。利用这一信号，我们证明隐藏状态引导可以直接调节验证器严格性，而无需微调。然而，统一引导会导致错误检测与正确性认证之间的权衡。为了解决这个问题，我们提出了VerifySteer，它利用潜在正确性信号进行样本级路由，并选择性地在段落边界进行干预。在ProcessBench和Hard2Verify上的实验表明，VerifySteer优于提示优化和激活引导基线，并且在需要更少推理计算的情况下与自一致性竞争。VerifySteer还与验证微调互补，在微调验证器上提供进一步的收益。代码可在https://github.com/YefanZhou/VerifySteer上获得。

英文摘要

Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

URL PDF HTML ☆

赞 0 踩 0

2605.20743 2026-05-21 cs.CV cs.CL 版本更新

人工智能审稿人的局限与机遇：对Nature系列论文审稿的45位专家科学家的审查

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

发表机构 * Nature（自然）

AI总结本文通过大规模专家标注研究，探讨了AI审稿人在科学同行评审中的能力与局限，发现AI审稿在准确性、显著性和证据充分性方面表现优异，但存在领域知识有限、上下文管理不足等弱点，表明AI审稿是人类审稿的补充而非替代。

Comments Work in progress

详情

AI中文摘要

随着AI能力的提升，AI审稿人开始被应用于科学同行评审，但其能力和可信度仍存疑：许多科学家将其视为概率系统，缺乏评估研究的专业能力，而其他研究人员则对AI的准备程度更为乐观，但缺乏实证支持。理解AI审稿人擅长什么、哪里不足以及仍需解决的挑战至关重要。然而，现有的AI审稿评估主要关注其判断是否与人类一致（例如评分对齐、接受预测），这不足以表征其能力和局限。在本文中，我们通过大规模专家标注研究填补了这一空白，45位物理、生物和健康科学领域的专家花费469小时对2960个个体批评（每个批评针对论文的一个特定方面）进行评分，这些批评来自人类和AI生成的82篇Nature系列论文的审稿。在综合正确性、显著性和证据充分性三个维度上，由GPT-5.2驱动的审稿代理在每篇论文的最高评分人类审稿人评分上（60.0% vs. 48.2%，p = 0.009），而所有三个AI审稿（包括Gemini 3.0 Pro和Claude Opus 4.5）在每个维度上都超过了最低评分的人类审稿人。AI审稿的准确批评也更常被评分显著且证据充分，并揭示了人类未提及的26%的问题。然而，AI审稿在交叉审稿者对之间重叠远多于人类（21% vs. 3%），并且表现出16个人类不共享的弱点，如领域知识有限、缺乏多文件上下文管理能力以及对次要问题过于批判。总体而言，我们的结果表明当前AI审稿人是人类审稿人的补充，而非替代。

英文摘要

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

URL PDF HTML ☆

赞 0 踩 0

2605.20643 2026-05-21 cs.LG cs.AI cs.CL 版本更新

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

AVSD：通过平衡共识和教师特定的特权信号实现自适应视图自蒸馏

Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Capital One（Capital One公司）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出AVSD，一种通过平衡共识和教师特定的特权信号来实现自适应视图自蒸馏的方法，以解决自蒸馏中教师和学生信息不对称和特权信息选择的问题。

Comments Code: https://github.com/duykhuongnguyen/AVSD

详情

AI中文摘要

自蒸馏使语言模型能够通过使用同一模型作为学生和教师来从自身轨迹中学习，其中教师基于学生无法访问的特权信息进行条件。此类信息可以是不同种类或视图，如解决方案、演示、反馈或最终答案。这种设置可以在不依赖外部模型的情况下提供密集的token级反馈，但会产生根本性的不对称性：教师可能依赖于视图特定的信息，而学生在推理时无法访问。此外，最佳的特权信息类型通常是任务依赖的，使得选择单一教师视图变得困难。在本工作中，我们通过引入AVSD（自适应视图自蒸馏），一种具有多种特权信息视图的自蒸馏新方法，来同时解决这两个挑战。AVSD通过分离稳定的跨视图共识和视图特定的残差信号来重建token级监督。AVSD识别出跨视图共享的共识信号，提供可靠的更新方向，然后在两者一致且比例适当的情况下，选择性地添加视图特定的残差信号以调整更新幅度。在数学竞赛基准（AIME24、AIME25和HMMT25）上的实验表明，AVSD在Qwen3-8B和Qwen3-4B上分别比单视图自蒸馏基线和GRPO平均Avg@8提升了3.1%和2.2%。此外，在代码生成基准（Codeforces、LiveCodeBench v6）上使用Qwen3-8B时，AVSD在平均上比单视图自蒸馏基线高出2.4%。

英文摘要

Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

URL PDF HTML ☆

赞 0 踩 0

2605.20626 2026-05-21 cs.CL cs.AI cs.CV 版本更新

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述：佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida（佛罗里达大学）

AI总结本文提出了一种基于检索的长上下文翻译方法，用于文化图像描述，通过两阶段流程生成西班牙语中间描述，再利用检索增强的多示例提示生成目标语言描述，显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能，并在共享任务中获得冠军。

详情

AI中文摘要

我们提出了佛罗里达大学Gators团队对2026年美洲自然语言处理共享任务在原住民语言文化图像描述任务中的提交。我们的两阶段流程使用Qwen2.5-VL生成西班牙语中间描述，然后利用检索增强的多示例提示与Gemini 2.5 Flash生成目标语言描述。我们在开发集评估中分别实现了Bribri、Guaraní和Orizaba Nahuatl描述生成性能的164.1%、131.7%和122.6%的提升，并在测试集评估中保持Bribri和Orizaba Nahuatl语言的>150%提升。我们发现检索高度依赖语言，仅对大规模、领域内语料有效，并且合成数据增强对开发集Guaraní性能提升贡献了约28 chrF++。我们的提交在共享任务中获得冠军，位列五份最终提交中的第二名。

英文摘要

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

URL PDF HTML ☆

赞 0 踩 0

2605.20616 2026-05-21 cs.CL 版本更新

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

Auto-Dreamer：学习离线记忆巩固用于语言智能体

Chongrui Ye, Yuxiang Liu, Yu Wang, Haofei Yu, Yining Zhao, Ge Liu, Julian McAuley, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of California San Diego（加州大学圣地亚哥分校）

AI总结本文提出Auto-Dreamer，一种学习离线记忆巩固方法，用于语言智能体，通过分离快速会话记忆获取与慢速跨会话巩固过程，提升智能体在多个任务流中的记忆整合与知识复用能力。

Comments Preprint

详情

AI中文摘要

语言智能体越来越多地在相关任务流上运行，但现有记忆系统难以将积累的经验转化为可重用的知识。检索增强和结构化记忆方法能有效记录每会话的观察，但通常将获取和巩固过程合并为一个在线过程，使智能体无法获得跨会话的全局视图以发现重复模式、抽象共享流程或剪枝冗余条目。受互补学习系统理论启发，我们提出Auto-Dreamer，一种学习的离线巩固器用于语言智能体记忆。Auto-Dreamer将快速的每会话记忆获取与慢速的跨会话巩固过程分离。给定一个选定的类型记忆库的工作区域，巩固器将该区域视为只读证据，执行受限的工具使用来检查条目和与来源轨迹相关联的来源轨迹，并合成一个新鲜的紧凑替换集，该集在跨会话中抽象并取代原始区域。我们通过GRPO训练Auto-Dreamer，使用端到端智能体性能作为奖励信号来学习如何通过快速在线经验巩固记忆。仅在ScienceWorld轨迹上训练，Auto-Dreamer在ScienceWorld上优于固定、强化学习训练和提示记忆基线，得分高出7分，同时使用比最强基线小12倍的活跃记忆库，并在不重新训练的情况下继续在held-out的ALFWorld和WebArena上领先，使用比最强基线小6倍的内存。

英文摘要

Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.

URL PDF HTML ☆

赞 0 踩 0

2605.20613 2026-05-21 cs.CL 版本更新

HRM-Text: Efficient Pretraining Beyond Scaling

HRM-Text: 超越规模的高效预训练

Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

发表机构 * Sapient Intelligence ； MIT（麻省理工学院）

AI总结本文提出HRM-Text模型，通过引入分层递归模型和新的训练方法，在减少计算资源消耗的同时实现了与大规模模型相当的性能，展示了高效预训练的可能性。

详情

AI中文摘要

当前大型语言模型的预训练范式依赖于巨大的计算资源和互联网级原始文本，这在基础研究中形成了显著的障碍。相比之下，生物系统通过多时间尺度处理实现高样本效率的学习，例如前额叶环路的功能组织。受此启发，我们引入了HRM-Text，它用分层递归模型（HRM）取代标准Transformer，将计算分解为慢速演变的战略层和快速演变的执行层。为了稳定这种深度递归进行语言建模，我们引入了MagicNorm和深度信用分配的预热。此外，我们不再使用标准的原始文本预训练，而是仅在指令-响应对上进行训练，使用任务完成目标和PrefixLM遮蔽。作为高效预训练的实证存在证明，一个仅用400亿个唯一词和1,500美元预算从头训练的10亿参数HRM-Text模型在MMLU上达到60.7%，在ARC-C上达到81.9%，在DROP上达到82.2%，在GSM8K上达到84.5%，在MATH上达到56.2%。尽管使用了比标准基线少100-900倍的训练词和96-432倍的估计计算，HRM-Text的性能与2-7B参数的开源模型相媲美。这些结果表明，协同设计架构和目标可以大幅降低计算到性能的比率，使从头开始的预训练对更广泛的研究社区具有可及性。

英文摘要

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

URL PDF HTML ☆

赞 0 踩 0

2605.20602 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

自我训练不使语言扁平化——它重构了它：表面标记增强而深层语法消失

Ming Liu

发表机构 * Amazon（亚马逊）

AI总结该研究通过实验发现自我训练过程并非使语言扁平化，而是重构了语言结构，表面标记增强而深层语法结构消失，并提出了结构性深度假说来解释这一现象。

Comments 19 pages (14 main + 5 appendix), 8 figures, 3 tables

详情

AI中文摘要

连续对语言模型自身输出进行自我训练通常被描述为一种扁平化过程：多样性下降，分布变窄，文本变得“更像自己”。我们提供了证据表明这种描述是不完整的。在对五个模型（GPT-2 124M，Pythia-410M，Pythia-1.4B，OPT-1.3B，Pythia-2.8B）进行十一代自我训练的过程中，语言并非均匀扁平化——它被重构了。表面标记（连贯词、缓和词、破折号）上升，而中层和深层语法结构（疑问句、插入语、被动语态、条件句）崩溃。我们正式将这种不对称崩溃定义为结构性深度假说（SDH）：语言特征的每一代衰减率主要由其结构性深度——它所需嵌套语法依赖的数量——决定，其次才由其生成零次输出频率决定。通过汇总五个模型中三个架构家族的17个特征面板（N=85），汇总的斯皮尔曼相关系数为rho=0.540（p < 10^{-6}；簇Bootstrap 95% CI [0.434, 0.634]），而频率是一个显著较弱的预测因子（rho=0.225）。一个匹配的人类文本微调对照实验得到rho=0.039（p=0.88），证实了该梯度是特定于自我训练的。我们进一步记录了一个表面复杂性悖论：总体复杂性代理（依赖树深度、TTR、词长）在底层从句结构消失时均上升，这对训练数据筛选和LLM文本检测有直接影响。

英文摘要

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

URL PDF HTML ☆

赞 0 踩 0

2605.20591 2026-05-21 cs.CL cs.CY 版本更新

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

有害吗？网络部署医疗大语言模型中的幻觉与作用层面滥用

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

发表机构 * The University of New South Wales, Sydney, Australia（新南威尔士大学，悉尼，澳大利亚）

AI总结本文研究了网络部署的医疗大语言模型中的幻觉和作用层面滥用问题，通过评估6233个MedGPT和10个开源LLM，发现25-30%的MedGPT事实准确性较低，33.6-54.3%违反操作阈值，57.06%的Action-enabled模型缺乏充分的隐私披露，揭示了系统性漏洞，强调了多指标评估和更强的安全保障的必要性。

详情

AI中文摘要

医疗大语言模型（LLMs），包括定制医疗GPT（MedGPTs）和开源模型，正越来越多地部署在网页平台上以提供临床指导。然而，它们存在幻觉、政策不合规和不安全设计的风险。我们对6,233个MedGPT进行了大规模评估，评估了1,500个分层样本以及10个开源LLM。我们引入了两个框架：MedGPT-HEval用于幻觉检测，以及一个基于LLM的流程用于评估违规行为和开发者意图。我们的结果表明，25-30%的MedGPT事实准确性较低，底层和中层模型风险最高；33.6-54.3%违反操作阈值，57.06%的Action-enabled模型缺乏充分的隐私披露。与开源模型相比，MedGPT在事实准确性和语义对齐方面表现更好，但开源模型更稳定。这些结果揭示了幻觉和合规性的系统性缺口，强调了多指标评估和更强的安全保障的必要性。我们发布了HAA-MedGPT，一个结构化数据集，支持未来关于网络面向医疗LLM安全性的研究。

英文摘要

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.20588 2026-05-21 cs.CL cs.CV 版本更新

Direct Translation between Sign Languages

手语之间的直接翻译

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang

发表机构 * Oregon State University（俄勒冈州立大学）

AI总结本文提出了一种直接的手语到手语翻译方法，通过使用回译技术生成合成的手语对，从而克服了传统级联方法中的误差传播和信息丢失问题，并在多个手语数据集上实现了更高的翻译质量和速度提升。

详情

AI中文摘要

手语翻译领域在手语与口语之间的翻译上取得了显著进展，但手语之间的翻译仍鲜为人知且难以实现。后者可以帮助15亿全球聋人和听力障碍者在语言障碍中交流，而无需依赖听力翻译者或书面语言能力。级联方法由单独的手语到文本、文本到文本和文本到手语系统组成，但存在误差传播、额外延迟以及视觉模态中独特信息的丢失。我们旨在开发直接的手语到手语翻译。然而，尚未有大规模的开放领域平行语料库在手语之间。为了实现直接的手语翻译，我们使用回译技术从不对齐的个体语言语音-手语语料库中生成合成的手语对。使用这些数据，我们联合训练了一个基于MBART的单一模型，用于文本到手语（T2S）和手语到手语（S2S）。在合成生成的美国手语（ASL）、中国手语（CSL）和德国手语（DGS）之间配对集上，我们的直接S2S方法在几何手语误差指标（20%更低的DTW对齐MPJPE）和翻译回句子后的语言匹配指标（50%高BLEU-4）上优于级联基线，同时实现了大约2.3倍的速度提升。在一小部分现有的跨语言手语数据上，我们发现我们的方法也实现了类似的改进。

英文摘要

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2605.20563 2026-05-21 cs.MA cs.AI cs.CL cs.LG cs.SE 版本更新

Multi-agent Collaboration with State Management

具有状态管理的多智能体协作

Mengyang Liu, Taozhi Chen, Zhenhua Xu, Xue Jiang, Yihong Dong

发表机构 * Shanghai Jiaotong University（上海交通大学）； Cortices AI ； Emory University（埃默里大学）； Peking University（北京大学）

AI总结本文提出STORM，一种面向多智能体协作的状态管理方法，通过在共享工作区中调解智能体的交互，确保每个智能体在一致的代码库视图上操作，并在写入时检测和解决冲突。STORM在多个LLM上优于基于git-worktree的多智能体基线，且在成本效率上具有竞争力，表明显式状态管理比工作区隔离更有效。

详情

AI中文摘要

近年来，多智能体系统在解决复杂任务方面展现出巨大潜力。然而，当多个智能体同时编辑共享代码库时，他们的更改可能会产生冲突，不一致的视图会导致集成失败。现有的多智能体系统通过工作区隔离（例如每个智能体一个git工作树）来解决这个问题，但这种方法将冲突解决推迟到事后合并步骤，恢复成本较高。在本文中，我们提出了STORM，即面向多智能体协作的状态管理（STate-ORiented Management）。具体而言，STORM通过调解智能体与共享工作区的交互来管理智能体状态，确保每个智能体都在代码库的一致视图上操作，并在写入时检测和解决冲突。我们评估了STORM在Commit0和PaperBench多个LLM上的表现。STORM在Commit0-Lite上比基于git-worktree的多智能体基线高出18.7%，在PaperBench上高出1.4%，同时在成本效率上具有竞争力或更好。结合单智能体运行，STORM在两个基准测试中分别达到87.6和78.2的最高分数，表明显式状态管理比工作区隔离更有效作为多智能体协作的基础。STORM也可以无缝地集成到任何多智能体系统中。

英文摘要

Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi-agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post-hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate-ORiented Management for multi-agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git-worktree-based multi-agent baseline by +18.7 on Commit0-Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single-agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi-agent collaboration than workspace isolation. STORM can also be plugged into any multi-agent system seamlessly.

URL PDF HTML ☆

赞 0 踩 0

2605.20052 2026-05-21 cs.CL cs.AI 版本更新

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad: 基于知识的多标签提示微调用于低资源放射报告标注

Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao

发表机构 * Department of Artificial Intelligence and AI Research Center, Chang Gung University（人工智能系及AI研究中心，长庚大学）； Department of Radiology, Sijhih Cathay General Hospital（放射科，西吉医院）； Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital（医学影像与介入科，长庚纪念医院）； Department of Trauma and Emergency Surgery, Chang Gung Memorial Hospital（创伤与急诊外科，长庚纪念医院）； Department of Computer Science, National Tsing Hua University（计算机科学系，国立清华大学）

AI总结本文提出PromptRad，一种基于知识的多标签提示微调方法，用于在低资源环境下进行放射报告标注，通过引入UMLS元词典中的同义词增强类别表示，以更少的标注数据实现优于传统方法的性能。

Comments BioNLP 2026 @ ACL (camera-ready version)

详情

AI中文摘要

自动报告标注有助于从非结构化文本中识别临床发现，并为医学影像研究提供大规模注释。现有的基于规则的标注器难以处理临床报告中的多样化描述，而微调预训练语言模型（PLMs）需要大量标注数据，这些数据在临床环境中通常不可用。在本文中，我们提出PromptRad，一种基于知识的多标签提示微调方法，用于在低资源环境下进行放射报告标注。PromptRad将多标签分类重新表述为掩码语言建模，并将UMLS元词典中的同义词纳入多词提示器以丰富类别表示。通过微调PLM而不增加额外分类层，PromptRad所需的标注数据比传统微调要少得多。在肝CT报告上的实验表明，PromptRad在仅使用32个标注训练示例的情况下，优于基于词典和微调的基线方法，并且在使用远小模型的情况下，性能与GPT-4具有竞争力。进一步分析显示，PromptRad比现有方法更有效地捕捉复杂的否定模式，使其成为数据稀缺临床场景中报告标注的有希望的解决方案。我们的代码可在https://github.com/ila-lab/PromptRad上获得。

JoyAI-Image: 激活统一多模态理解和生成中的空间智能

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy, JD（joy未来学院，京东）

AI总结本文提出JoyAI-Image，一种统一的多模态基础模型，用于视觉理解、文本到图像生成和指令引导的图像编辑。该模型结合了空间增强的多模态大语言模型（MLLM）和多模态扩散Transformer（MMDiT），通过共享的多模态接口实现感知与生成的交互。构建可扩展的训练配方，结合统一指令微调、长文本渲染监督、空间 grounded 数据和通用及空间编辑信号，使模型具备广泛的多模态能力，同时增强几何感知推理和可控视觉合成。实验表明，JoyAI-Image在理解、生成、长文本渲染和编辑基准上达到最先进的性能。更重要的是，增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力，向更强的空间智能发展。

Comments Code: https://github.com/jd-opensource/JoyAI-Image

详情

AI中文摘要

我们提出了JoyAI-Image，一种统一的多模态基础模型，用于视觉理解、文本到图像生成和指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型（MLLM）与多模态扩散Transformer（MMDiT）结合，允许感知和生成通过共享的多模态接口进行交互。围绕此架构，我们构建了一个可扩展的训练配方，结合了统一指令微调、长文本渲染监督、空间 grounded 数据以及通用和空间编辑信号。该设计使模型具备广泛的多模态能力，同时增强了几何感知推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准上的实验表明，JoyAI-Image实现了最先进的或高度竞争的性能。更重要的是，增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力，向更强的空间智能发展。这些结果表明，统一视觉模型在下游应用如视觉-语言-动作系统和世界模型中具有前景。

英文摘要

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

URL PDF HTML ☆

赞 0 踩 0

2604.26052 2026-05-21 cs.CL 版本更新

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

从提示风险到响应风险：大型语言模型安全行为的配对分析

Mengya Hu, Qiong Wei, Sandeep Atluri

发表机构 * Microsoft（微软）

AI总结本研究通过配对分析人类标注的提示和响应记录，探讨了大型语言模型在四个危害类别（性、自残、仇恨和暴力）和有序严重性级别上的安全行为，发现61%的响应比提示减少了危害，3%的响应升级了危害，并揭示了安全行为与无害性之间的权衡。

详情

AI中文摘要

大型语言模型的安全评估通常报告二元结果，如攻击成功率（ASR）、拒绝率或有害与安全分类，这些结果隐藏了提示与响应之间风险的变化。我们通过对四个危害类别（性、自残、仇恨和暴力）和有序严重性级别（安全、低、中、高）的人类标注提示和响应记录进行配对分析，发现61%的响应相对于提示减少了危害，36%的响应保持了严重性，3%的响应升级了危害。升级分为两种机制：良性提示触发未请求的有害细节，以及在更高严重性级别上保持任务的响应。类别分解显示，性内容在此样本中表现出最高的危害持续性，这由相同严重性级别的合规性驱动，而非来自良性输入的漂移。联合相关性分析揭示了有用性与无害性之间的权衡：合规性升级仍保持高度相关，而安全响应包含低相关性的通用拒绝。最后，少样本LLM评估者表现出提示/响应检测的不对称性，数据校准无法弥补这种不对称性。评估者提示可在https://github.com/microsoft/PairedSafety获取。

英文摘要

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

URL PDF HTML ☆

赞 0 踩 0

2603.26539 2026-05-21 cs.CL cs.AI 版本更新

How Open Must Language Models be to Enable Reliable Scientific Inference?

语言模型必须多开放才能实现可靠的科学推断？

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； EleutherAI ； University of California San Diego（加州大学圣地亚哥分校）； Rutgers University-Newark（新泽西州立大学罗威特分校）； Stony Brook University（史泰森布魯克大學）

AI总结本文探讨了语言模型的开放程度如何影响基于其研究的科学推断可靠性，指出封闭模型通常不适合科学用途，并提出系统识别和缓解推断威胁的方法。

2603.23531 2026-05-21 cs.CL 版本更新

Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

大型语言模型通过目标-立场提取解构复杂的政治观点

Özgür Togay, Javier Garcia-Bernardo, Florian Kunneman, Anastasia Giachanou

发表机构 * Department of Methodology and Statistics, Utrecht University（方法论与统计学系，乌特雷赫特大学）； Department of Languages, Literature and Communication, Utrecht University（语言、文学与传播系，乌特雷赫特大学）

AI总结本文研究了大型语言模型是否能通过目标-立场提取任务解构复杂政治观点，通过构建包含138个不同政治目标的Reddit帖子数据集，评估了多种LLM在零样本、少样本和上下文增强提示策略下的表现，结果显示最佳模型表现接近高度训练的人类标注者，证明了LLM在最小监督下的复杂政治观点提取能力。

详情

AI中文摘要

政治极化源于对政策、人物和议题的复杂信念相互作用。然而，大多数计算分析将话语简化为粗粒度的党派标签，忽视了这些信念之间的互动。这在在线政治对话中尤为明显，这些对话通常具有细微差别且涵盖广泛主题，使自动识别讨论目标和对它们的观点变得困难。在本研究中，我们探讨了大型语言模型（LLMs）是否能通过目标-立场提取（TSE）任务解决这一挑战，TSE是一种结合目标识别和立场检测的自然语言处理任务，能够更细致地分析政治观点。为此，我们构建了一个包含1,084个Reddit帖子的数据集，来自r/NeutralPolitics，涵盖138个不同的政治目标，并使用零样本、少样本和上下文增强提示策略评估了一系列专有和开源的LLM。我们的结果表明，最佳模型在高度训练的人类标注者表现相当，并且在具有低标注者一致性挑战的帖子上仍保持稳健。这些发现表明，LLM能够以最小监督的方式提取复杂的政治观点，为计算社会科学和政治文本分析提供了可扩展的工具。

英文摘要

Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

URL PDF HTML ☆

赞 0 踩 0

2603.06007 2026-05-21 cs.CL cs.AI cs.MA 版本更新

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

MASFactory: 一种基于图的框架，用于通过Vibe图谱编排基于大语言模型的多智能体系统

Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出MASFactory，一种基于图的框架，用于通过Vibe图谱编排基于大语言模型的多智能体系统，解决了现有框架在实现复杂图工作流时需要大量手动工作、重用性差和难以整合异构外部上下文源的问题。

Comments Accepted to the ACL 2026 Demo Track. Camera-ready version. 10 pages, 6 figures. Code and documentation are available at: https://github.com/BUPT-GAMMA/MASFactory

详情

AI中文摘要

基于大语言模型的（LLM-based）多智能体系统（MAS）越来越多地被用于通过角色专业化和协作扩展智能体问题解决。MAS工作流可以自然地建模为有向计算图，其中节点执行智能体或子工作流，边编码依赖性和消息传递。然而，目前框架在实现复杂图工作流时仍然需要大量的手动工作，提供有限的重用性，并使整合异构外部上下文源变得困难。为克服这些限制，我们提出了MASFactory，一种用于编排基于大语言模型的MAS的基于图的框架。它引入了Vibe图谱，一种人机交互的方法，将自然语言意图编译成可编辑的工作流规范，然后编译成可执行的图。此外，该框架提供了可重用的组件、技能支持、多模态消息处理和可插拔的上下文整合，以及用于拓扑预览、运行时跟踪和人机交互的可视化工具。我们在七个公开基准上评估了MASFactory，验证了代表性MAS方法的再生产一致性以及Vibe图谱的有效性。我们的代码（https://github.com/BUPT-GAMMA/MASFactory，Apache-2.0许可）和视频演示（https://youtu.be/ANynzVfY32k）均已公开。

英文摘要

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2602.19320 2026-05-21 cs.CL cs.AI 版本更新

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

代理记忆的解剖结构：评估和系统限制的分类与实证分析

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； University of California Davis（加州大学戴维斯分校）； Texas A&M University（德克萨斯农工大学）

AI总结本文通过分类和实证分析，探讨了代理记忆系统的架构和系统限制，揭示了现有系统在基准饱和效应、指标有效性、模型依赖性和内存维护开销等方面的问题，并提出了更可靠的评估方法和可扩展的系统设计方向。

详情

AI中文摘要

代理记忆系统使大型语言模型（LLM）代理能够在长时间交互中保持状态，支持超出固定上下文窗口的长周期推理和个性化。尽管架构发展迅速，这些系统的实证基础仍脆弱：现有基准通常规模不足，评估指标与语义效用不一致，性能在基础模型上变化显著，且系统层面的成本常被忽视。本文从架构和系统角度对代理记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统简要分类。然后，我们分析了限制当前系统的关键痛点，包括基准饱和效应、指标有效性和判断敏感性、基础模型依赖的准确性，以及内存维护引入的延迟和吞吐量开销。通过将内存结构与实证限制联系起来，本文阐明了当前代理记忆系统为何经常无法达到其理论承诺，并概述了更可靠评估和可扩展系统设计的方向。

英文摘要

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

URL PDF HTML ☆

赞 0 踩 0

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG 版本更新

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能：面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

发表机构 * College of Engineering and Applied Science, University of Colorado Colorado Springs（科罗拉多州立大学工程与应用科学学院）

AI总结本文提出了一种上下文感知分层集成梯度框架（CA-LIG），用于解释Transformer模型的决策过程，通过计算每个Transformer块内的分层集成梯度，并将这些token级属性与类特定的注意力梯度融合，从而生成具有符号和上下文敏感性的属性图，以捕捉支持和反对的证据，并追踪Transformer层中的相关性层次流动。

详情

DOI: 10.1016/j.neucom.2026.133050

AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能，然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性，只能捕捉局部token级属性或全局注意力模式，缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制，我们提出了上下文感知分层集成梯度（CA-LIG）框架，一种统一的层次属性框架，该框架在每个Transformer块内计算分层集成梯度，并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图，能够捕捉支持和反对的证据，同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现，包括使用BERT进行情感分析和长多类文档分类，使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测，以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中，CA-LIG提供了更忠实的属性，显示出对上下文依赖的更强敏感性，并产生了更清晰、更语义连贯的可视化结果，优于现有可解释性方法。这些结果表明，CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释，推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

URL PDF HTML ☆

赞 0 踩 0

2602.08819 2026-05-21 cs.LG cs.CL 版本更新

Bayesian Preference Learning for Test-Time Steerable Reward Models

基于测试时间可调节的贝叶斯偏好学习的奖励模型

Jiwoo Hong, Shao Tang, Zhipeng Wang

发表机构 * LinkedIn Corporation（LinkedIn公司）； Nubank

AI总结本文提出了一种新的贝叶斯奖励建模目标，即变分上下文奖励建模（ICRM），通过上下文偏好演示实现测试时间可调节性，从而适应未见过的偏好分布，提高了奖励模型的准确性和鲁棒性。

Comments Preprint

详情

AI中文摘要

奖励模型在通过强化学习（RL）对语言模型与人类偏好对齐中起核心作用。随着RL越来越多地应用于可验证奖励和多目标对齐等场景，RMs被期望编码更复杂和多维的偏好分布。然而，分类RMs一旦训练完成就保持静态，限制了测试时间的适应性。我们提出变分上下文奖励建模（ICRM），一种新颖的贝叶斯奖励建模目标，通过上下文偏好演示实现测试时间可调节性。ICRM将奖励建模视为在Bradley-Terry模型下对潜在偏好概率的变分推断，使用共轭Beta先验。我们证明ICRM能够适应单目标和多目标设置中的未见过的偏好分布。随着更多演示，ICRM在RM-Bench上的准确性从60.5提高到70.8，在道德困境偏好上比生成判断者具有更低的校准误差，并在冲突偏好下扩展了可达到的帕累托前沿。我们进一步研究了ICRM在RL训练中的实际适用性，证明其可以通过在数学推理中优于传统RM来有效编码可验证奖励。最后，我们提供了理论保证，变分目标在有限置信度下具有全局内部最优解，并分析了KL正则化如何缓解奖励过度优化。

英文摘要

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapts to unseen preference distributions at test time for both single and multi-objective settings. With more demonstrations, ICRM improves RM-Bench accuracy from 60.5 to 70.8, achieves lower calibration error than a generative judge on moral dilemma preferences, and expands the attainable Pareto frontier under conflicting preferences. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

URL PDF HTML ☆

赞 0 踩 0

2602.08028 2026-05-21 cs.CL cs.AI 版本更新

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

偏离以诱导提示：多理性诱导用于零样本推理

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan（国家台湾大学计算机科学与信息工程系）； Institute of Information Science, Academia Sinica, Taiwan（学术院信息科学研究所）； AI Research Center (AINTU), National Taiwan University, Taiwan（国家台湾大学人工智能研究中心（AINTU））

AI总结本研究提出DIP框架，通过生成多个多样化的高层理由并诱导最终计划，以提升零样本推理的准确性，克服了传统链式思考提示中推理路径不稳定的问题。

Comments Accepted to Findings of IJCNLP-AACL 2025

2601.05877 2026-05-21 cs.CL 版本更新

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

iReasoner: 一种面向轨迹的内在推理监督方法，用于自演化的大多模态模型

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

发表机构 * Vellore Institute of Technology（韦洛雷理工学院）； School of Information and Data Sciences（信息与数据科学学院）； Nagasaki University（长崎大学）； Loughborough University（洛桑大学）

AI总结本文提出iReasoner，一种自演化框架，通过显式引导推理链和奖励内部一致性来提升大模型的隐式推理能力，在无监督设置下实现了多模态推理基准的性能提升。

Comments ACL 2026 (Findings)

详情

AI中文摘要

最近的研究表明，大多模态模型（LMMs）可以通过自博弈和内在反馈从无标签数据中自我改进。然而，现有的自演化框架主要奖励最终结果，尽管中间推理对于视觉基础决策至关重要，但其约束较弱。我们提出iReasoner，一种自演化框架，通过显式引导推理链（CoT）并奖励其内部一致性来提升LMM的隐式推理能力。在无标签图像上进行Proposer--Solver循环时，iReasoner通过在中间推理步骤上定义轨迹感知信号，增强结果层面的内在奖励，提供无需真实标签或外部评判的学习信号，以区分导向相同答案的不同推理路径。从Qwen2.5-VL-7B开始，iReasoner在完全无监督的后训练阶段，在多样化的多模态推理基准上实现了最高+2.1分的提升。我们希望这项工作能成为在纯无监督设置中实现推理感知自改进的LMMs的起点。我们的代码可在https://meghanaasunil.github.io/iReasoner上获取。

英文摘要

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.

URL PDF HTML ☆

赞 0 踩 0

2601.03135 2026-05-21 cs.CL 版本更新

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

通过合成数据和语言特定预处理改进原住民语言机器翻译

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

发表机构 * University of Florida（佛罗里达大学）

AI总结本研究通过合成数据生成和语言特定预处理方法，改进低资源原住民语言的神经机器翻译效果，实验显示合成数据增强对翻译质量有积极影响，但通用预处理在高度屈折语言中存在局限。

详情

DOI: 10.18653/v1/2026.loresmt-1

AI中文摘要

低资源原住民语言往往缺乏用于有效神经机器翻译（NMT）所需的平行语料库。合成数据生成为数据稀缺环境提供了一种实用策略。在本工作中，我们通过使用高容量多语言翻译模型生成合成句子对，扩充美洲原住民语言的精选平行语料库。我们对多语言mBART模型进行微调，使用curated-only和合成增强的数据，并通过chrF++评估翻译质量，该指标是最近美洲NLP共享任务中用于屈折语言的主要指标。我们进一步应用语言特定的预处理，包括正字法标准化和噪声感知过滤，以减少语料库中的伪影。在瓜拉尼-西班牙语和克丘亚-西班牙语翻译实验中，合成数据增强显示出一致的chrF++提升，而对艾马拉语的诊断实验则揭示了通用预处理在高度屈折语言中的局限性。

英文摘要

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

URL PDF HTML ☆

赞 0 踩 0

2601.03019 2026-05-21 q-bio.GN cs.CL 版本更新

DNACHUNKER: Learnable Tokenization for DNA Language Models

DNACHUNKER: 用于DNA语言模型的可学习分词

Taewon Kim, Jihwan Shin, Hyomin Kim, Youngmok Jung, Jonghoon Lee, Won-Chul Lee, Sungsoo Ahn, Insu Han

发表机构 * Graduate School of AI, KAIST（韩国科学技术院人工智能研究生院）； Department of Electrical Engineering, KAIST（韩国科学技术院电气工程系）； Inocras Korea Inc.（Inocras韩国公司）

AI总结本文提出DNACHUNKER，一种可学习的DNA分词方法，通过动态分段模块生成上下文依赖的变长单元，提升DNA语言模型在基因组序列处理中的鲁棒性和效率。

Comments ICML 2026 camera-ready version

详情

AI中文摘要

DNA语言模型越来越多地用于表示基因组序列，但其有效性严重依赖于原始核苷酸如何转换为模型输入。与自然语言不同，DNA没有标准的边界，使固定分词成为在移位、插入缺失和局部重复下脆弱的设计选择。我们引入了DNAChunker，一种带有可学习自适应分段模块的遮蔽DNA语言模型，以生成上下文依赖、变长的单元。基于动态分段过程，DNAChunker学会在功能丰富区域分配更细的粒度，同时压缩重复或冗余序列。我们预训练DNAChunker在人类参考基因组上，并在五个基准上评估其性能，结果在强固定分词基线中一致提升。进一步的分析和消融实验表明，与固定分词不同，分段是以生物信息学指导、对突变具有鲁棒性的学习方式进行的。

英文摘要

DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pretrain DNAChunker on the human reference genome and evaluate it across five benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that unlike fixed tokenizations, segmentation is learned in a biologically-informed, mutation-resilient manner.

URL PDF HTML ☆

赞 0 踩 0

2512.14896 2026-05-21 cs.CL cs.AI 版本更新

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

DrugRAG: 通过一种新颖的检索增强生成流水线提升药学LLM性能

Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Fatemeh Latifi, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Babak Khalaj, Mohammad Hossein Rohban, Glolamali Aminian, Zohreh Amoozgar, Tahereh Javaheri

发表机构 * Department of Medicinal Chemistry, Faculty of Pharmacy, Tehran University of Medical Sciences（药学系，泰赫兰医科大学）； Department of Computer Sciences, Faculty of Mathematics and Computer Sciences, Amir Kabir University of Technology（计算机科学系，阿米尔·卡比尔技术大学）； Department of Mathematical Sciences, Sharif University of Technology（数学科学系，沙菲克技术大学）； Department of Computer Sciences, Missouri University of Science and Technology（计算机科学系，密苏里科学与技术大学）； Department of Computer Engineering, Sharif University of Technology（计算机工程系，沙菲克技术大学）； Department of Faculty of Interdisciplinary Science and Technology, Tarbiat Modares University（跨学科科学与技术学院，塔里亚特莫达res大学）； Electronics Research Institute, Sharif University of Technology（电子研究所，沙菲克技术大学）； Department of Electrical Engineering, Sharif University of Technology（电气工程系，沙菲克技术大学）； The Alan Turing Institute, London, United Kingdom（艾伦·图灵研究所，伦敦，英国）； Department of Radiation Oncology, Massachusetts General Hospital & Harvard Medical School（放射肿瘤科，麻省总医院及哈佛医学院）； Health Informatics Lab, Metropolitan College, Boston University（健康信息学实验室，波士顿大学）

AI总结本研究评估了大型语言模型在药学执业资格问答任务中的性能，并开发了一种外部知识整合方法以提高准确性，通过DrugRAG流水线整合结构化药物知识，从而提升药学相关问答任务的LLM性能。

Comments 14 pages, 2 figures, 2 tables. The revised version includes McNemar's paired statistical analysis, Wilson confidence intervals, expanded methodological clarifications, a revised discussion of evidence retrieval, improved reproducibility details, and updated limitations

详情

AI中文摘要

在本研究中，我们评估了大型语言模型（LLM）在药学执业资格问答任务中的性能，并开发了一种外部知识整合方法以提高准确性。我们使用一个包含141个问题的药学数据集，对十个参数规模不同的LLM（8十亿到70十亿以上）进行了基准测试，测量了基线准确性。基线性能范围从46%到92%，其中GPT-5（92%）和o3（89%）取得了最高分数，而较小的开源模型表现显著较低。然后，我们开发了DrugRAG，一种三步检索增强生成（RAG）流水线，该流水线检索结构化、基于证据的药物信息，并将上下文药理学证据添加到模型提示中，该流水线在模型架构或参数无需更改的情况下外部运行。DrugRAG在所有五个评估模型上均提高了准确性，提升幅度范围从7到21个百分点（例如，Gemma 3 27B：61.0%到71%，Llama 3.1 8B：46%到67%）。McNemar分析显示，这些改进在较小和中等规模的开源模型中具有统计学显著性。这些发现表明，通过DrugRAG整合结构化外部药物知识可以提高LLM在药学相关问答任务中的性能，而无需修改底层模型，为提升基于证据的药学相关AI应用提供了实用的流水线。

英文摘要

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

URL PDF HTML ☆

赞 0 踩 0

2511.01482 2026-05-21 cs.CL 版本更新

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

迈向认知扭曲一致检测：基于大语言模型的标注与数据集无关评估

Neha Sharma, Navneet Agarwal, Kairit Sirts

发表机构 * LREC-2026（LREC-2026会议）

AI总结本文探讨了利用大语言模型作为一致且可靠的标注器进行认知扭曲检测的方法，并提出了一种数据集无关的评估框架，以公平比较不同数据集训练的模型，结果显示GPT-4能产生一致的标注，提升了模型在主观NLP任务中的表现。

详情

DOI: 10.63317/4joqxtgdgxq2
Journal ref: https://lrec.elra.info/lrec2026-main-851

AI中文摘要

基于文本的自动化认知扭曲检测是一项具有挑战性的任务，由于其主观性质，即使在专家人类标注者之间也观察到低一致性分数，导致不可靠的标注。我们探索了使用大型语言模型（LLMs）作为一致且可靠的标注器，并提出多个独立的LLM运行可以揭示稳定的标注模式，尽管任务本身具有内在的主观性。此外，为了公平比较训练于不同特征数据集上的模型，我们引入了一种使用Cohen's kappa作为效应大小度量的数据集无关评估框架。该方法允许在传统指标如F1分数不足的情况下进行公平的跨数据集和跨研究比较。我们的结果表明，GPT-4可以产生一致的标注（Fleiss's Kappa = 0.78），从而在使用这些标注训练的模型在测试集上的表现优于使用人类标注数据训练的模型。我们的发现表明，LLMs可以为生成支持强大下游性能的主观NLP任务的训练数据提供可扩展且内部一致的替代方案。

英文摘要

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE 版本更新

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

发表机构 * The University of Hong Kong（香港大学）； Shanghai AI Laboratory（上海人工智能实验室）； Nanjing University（南京大学）； Carnegie Mellon University（卡内基梅隆大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出JanusCoder，一种面向代码智能的视觉-程序化界面，通过构建大规模多模态代码数据集和统一模型，实现从文本指令、视觉输入或两者结合生成代码，展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情

AI中文摘要

神经代码智能的范围正在迅速扩展，从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而，进展受到高质量多模态代码数据稀缺的阻碍，这源于合成和质量评估的挑战。为了解决这些挑战，我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包，利用数据模态之间的相互协同效应，高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包，我们构建了JanusCode-800K，目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练，建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法，后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明，JanusCoder系列的性能优越，我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外，广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

URL PDF HTML ☆

赞 0 踩 0

2510.08482 2026-05-21 cs.CV cs.CL 版本更新

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

视觉象征性挑战：在手语形式-意义映射上评估视觉-语言模型

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

发表机构 * Multimodal Language Department（多模态语言部门）； Max Planck Institute for Psycholinguistics（马克斯·普朗克心理语言学研究所）； Department of Linguistics（语言学系）； Boğaziçi University（博多伊奇大学）； Donders Institute for Brain Cognition and Behaviour（多纳尔斯脑认知与行为研究所）； Radboud University（拉德堡德大学）； Department of Linguistics and Communication（语言学与沟通系）； University of Birmingham（伯明翰大学）

AI总结本文提出一个新颖的视频基准测试，用于评估视觉-语言模型在手语形式-意义映射上的表现，通过心理语言学测量来评估三种任务：语音学手语形式预测、透明度和渐进象征性评分，并发现模型在语音形式预测上表现较好但整体仍低于人类表现。

详情

AI中文摘要

象征性，即语言形式与意义之间的相似性，在手语中普遍存在，为视觉 grounding 提供了自然的测试环境。对于视觉-语言模型（VLMs），挑战在于从动态的人类运动中恢复这种本质的映射，而非静态上下文。我们引入了视觉象征性挑战，一个新颖的基于视频的基准测试，将心理语言学测量适应于评估 VLMs 在三个任务上的表现：（i）语音学手语形式预测（例如，手形、位置），（ii）透明度（从视觉形式推断意义），以及（iii）渐进象征性评分。我们评估了13种最先进的VLMs在零样本和少样本设置下在荷兰手语上的表现，并将其与人类基线进行比较。在语音形式预测上，VLMs恢复了一些手形和位置细节，但表现仍低于人类；在透明度上，它们与人类基线相差甚远；只有顶级模型与人类象征性评分有中等相关性。有趣的是，语音形式预测能力更强的模型更能与人类象征性判断相关联，表明它们对视觉基础结构有共同的敏感性。我们的发现验证了这些诊断任务，并推动了以人类为中心的信号和具身学习方法，用于建模象征性和改进多模态模型中的视觉 grounding。

英文摘要

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2509.08010 2026-05-21 cs.CY cs.AI cs.CL cs.HC 版本更新

Measuring and mitigating overreliance to build human-compatible AI

测量和缓解过度依赖以构建人类兼容的AI

Lujain Ibrahim, Katherine M. Collins, Sunnie S. Y. Kim, Anka Reuel, Max Lamparth, Kevin Feng, Lama Ahmad, Prajna Soni, Alia El Kattan, Merlin Stein, Siddharth Swaroop, Vishakh Padmakumar, Ilia Sucholutsky, Andrew Strait, Diyi Yang, Q. Vera Liao, Umang Bhatt

AI总结本文研究了大型语言模型过度依赖的风险，探讨了测量和缓解过度依赖的方法，以确保AI能增强而非削弱人类能力。

详情

AI中文摘要

大型语言模型（LLMs）通过作为协作的『思想伙伴』而区别于先前的技术，能够在多种任务上更流畅地进行自然语言交互。随着LLMs在医疗、个人建议等不同领域中日益影响关键决策，过度依赖LLMs的风险也随之增加。本文认为，测量和缓解过度依赖必须成为LLMs研究和部署的核心。首先，我们汇总了个体和社会层面的过度依赖风险，包括高风险错误、治理挑战和认知退化。然后，我们探讨了LLMs的特点、系统设计特征和用户认知偏见，这些因素共同引发了关于实际中过度依赖LLMs的严重且独特的问题。我们还审查了历史上的过度依赖测量方法，识别出三个重要的差距，并提出三个有前景的方向来改进测量。最后，我们提出了可以采取的缓解策略，以确保LLMs增强而非削弱人类能力。

英文摘要

Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities.

URL PDF HTML ☆

赞 0 踩 0

2509.03526 2026-05-21 cs.CL eess.AS 版本更新

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

通过强化行为对齐增强语音大语言模型

Yansong Liu, Jiateng Li, Yuan Liu

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出了一种名为强化行为对齐（RBA）的框架，通过自合成方法生成高质量对齐数据来提升语音大语言模型（SpeechLMs）的语言生成能力，实验表明该方法显著提升了SpeechLMs的指令遵循能力，并可扩展至语音问答和语音转文本翻译等任务。

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进展引发了对扩展其语言能力至其他模态的广泛关注，从而催生了能够处理语音或文本格式用户请求的语音基础语言模型（SpeechLMs）。然而，由于跨模态差异，这些SpeechLMs在指令遵循方面仍显著逊色于文本基础的LLMs，特别是在面对用户语音的动态和多变性质时。为了解决这一挑战，本文提出了一种名为强化行为对齐（RBA）的框架，旨在增强SpeechLMs的语言生成能力。RBA不依赖于监督微调的人类标注，而是通过强大的教师LLM生成大量高质量的对齐数据。然后，通过强化学习方法将SpeechLMs的行为与教师模型的行为对齐。实验结果表明，该方法有效提升了SpeechLMs的指令遵循能力，使其优于传统蒸馏基线。关键的是，我们证明RBA可以无缝扩展到包括语音问答和语音到文本翻译在内的任务，在仅使用自生成数据的情况下，在开放基准上取得了最先进的性能。

英文摘要

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

URL PDF HTML ☆

赞 0 踩 0

2508.16453 2026-05-21 cs.SI cs.CL cs.LG 版本更新

Anti-establishment sentiment on TikTok: Implications for understanding influence(rs) and expertise on social media

TikTok上的反 Establishment 情绪：对社交媒体中影响者和专业知识理解的启示

Tianliang Xu, Ariel Hasell, Sabina Tomkins

发表机构 * GitHub

AI总结本文研究了TikTok上反 Establishment 情绪的普遍性，通过计算方法分析了金融、健康和阴谋论等主题内容中反 Establishment 情绪的分布，并探讨了社交媒体环境中反 Establishment 情绪对用户参与和平台激励的影响。

Comments 10 pages excluding references; 14 pages in total; 4 figures; Accepted by the AAAI Conference on Web and Social Media (ICWSM-2026)

详情

AI中文摘要

对公共服务机构的不信任和反 Establishment 观点正在上升（尤其是在美国）。随着人们转向社交媒体获取信息，有必要了解社交媒体环境是否以及如何促进对机构的不信任。在社交媒体中，内容创作者、影响者和其他意见领袖往往将自己定位为在健康、政治等众多话题上具有专业知识和权威性，并在许多情况下贬低和否定机构专业知识以建立追随者并增加自身可见性。然而，这种内容的普及程度以及此类内容是否增加参与度仍不清楚。本研究分析了TikTok平台上反 Establishment 情绪（AES）的普遍性。尽管TikTok作为信息来源非常流行，但其仍然相对研究较少，可能为人们如何形成对机构态度提供重要见解。我们采用计算方法，对TikTok帖子进行标注，判断其是否包含AES，涵盖内容创作者通常定位为专家的主题领域：金融和健康。作为比较，我们还考虑了阴谋论主题，其中AES预期较为常见。我们发现，AES在阴谋论内容中最为普遍，而在其他两个主题的内容中相对罕见。然而，我们发现与此类内容的参与模式因领域而异，并且可能存在平台激励用户发布表达反 Establishment 情绪的内容。

英文摘要

Distrust of public serving institutions and anti-establishment views are on the rise (especially in the U.S.). As people turn to social media for information, it is imperative to understand whether and how social media environments may be contributing to distrust of institutions. In social media, content creators, influencers, and other opinion leaders often position themselves as having expertise and authority on a range of topics from health to politics, and in many cases devalue and dismiss institutional expertise to build a following and increase their own visibility. However, the extent to which this content appears and whether such content increases engagement is unclear. This study analyzes the prevalence of anti-establishment sentiment (AES) on the social media platform TikTok. Despite its popularity as a source of information, TikTok remains relatively understudied and may provide important insights into how people form attitudes towards institutions. We employ a computational approach to label TikTok posts as containing AES or not across topical domains where content creators tend to frame themselves as experts: finance and wellness. As a comparison, we also consider the topic of conspiracy theories, where AES is expected to be common. We find that AES is most prevalent in conspiracy theory content, and relatively rare in content related to the other two topics. However, we find that engagement patterns with such content varies by area, and that there may be platform incentives for users to post content that expresses anti-establishment sentiment.

URL PDF HTML ☆

赞 0 踩 0

2508.09001 2026-05-21 cs.CL cs.AI cs.LG 版本更新

大语言模型像医生一样分诊吗？对外科会诊的动态研究

Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng, Ziniu Li, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

发表机构 * Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Bournemouth University（伯恩茅斯大学）； National Health Data Institute, Shenzhen（深圳国家健康数据研究院）； Shenzhen Research Institute of Big Data（深圳大数据研究院）

AI总结本文研究了大语言模型在动态分诊过程中的表现，发现其在动态场景中通过有效提问减少不确定性，优于传统分类器，但静态场景下优势有限。

详情

AI中文摘要

门诊会诊（OR）是一种核心临床流程，将患者分配到医院部门，在信息不完整且不断演变的情况下进行，但通常被简化为静态分类问题，尽管实际上是交互性的。在本工作中，我们将门诊会诊视为由信息获取和不确定性降低驱动的动态过程。我们分析了基于固定患者信息的静态场景和涉及多轮对话的动态场景，以测试大语言模型（LLMs）是否通过更好的预测或更有效的提问来改善分诊结果。我们的发现表明，LLMs在静态分诊准确性上对传统分类器几乎没有优势，但在动态设置中始终优于它们，通过询问具有辨别性的后续问题来减少候选部门的不确定性。这些结果表明，大语言模型在门诊分诊中的主要价值不在于静态预测，而在于支持交互式、具有不确定性的临床决策。

英文摘要

Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

URL PDF HTML ☆

赞 0 踩 0

2502.18915 2026-05-21 cs.CL cs.AI 版本更新

END: Early Noise Dropping for Efficient and Effective Context Denoising

END：早期噪声丢弃以实现高效有效的上下文去噪

Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Fangran Mo, Jinghan Zhang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin

发表机构 * Amazon（亚马逊）

AI总结本文提出END方法，通过在早期层对输入序列进行分割和线性探针，有效识别并丢弃噪声部分，从而提升LLM在不同任务上的性能和效率，同时加深了对LLM内部上下文推理机制的理解。

详情

AI中文摘要

大型语言模型（LLMs）在广泛自然语言处理任务中表现出色，但它们经常受到输入序列中无关或噪声内容的干扰，从而降低输出质量。这个问题影响了长上下文和短上下文场景，如检索增强生成、表格问答和上下文学习。我们发现LLMs可以在生成令牌之前，在早期层中隐式地识别输入序列中是否有有用信息。基于这一见解，我们引入了早期噪声丢弃（END），一种无需微调LLMs的新方法，以缓解此问题。END将输入序列分成块，并在LLMs的早期层上使用线性探针来区分信息丰富和噪声块。通过在过程中早期丢弃噪声块，END保留了关键信息，减少了干扰，并降低了计算开销。广泛的实验表明，END在不同LLMs上多个评估数据集上显著提高了性能和效率。此外，通过探针研究LLMs对输入的隐式理解，这项工作也加深了对LLMs如何内部进行上下文推理的理解。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

URL PDF HTML ☆

赞 0 踩 0

2502.12120 2026-05-21 cs.LG cs.AI cs.CL 版本更新

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

LLMs on the Line: 数据决定损失-损失缩放定律

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

发表机构 * Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Tübingen AI Center（图宾根人工智能中心）； University of Tübingen（图宾根大学）

AI总结研究探讨了影响LLM损失-损失缩放定律的主要因素，发现预训练数据决定了缩放趋势，而模型大小、优化超参数、分词器和架构差异对缩放影响有限，因此应精心选择预训练数据以获得最佳下游性能。

Comments ICML 2025 camera-ready version

详情

AI中文摘要

缩放定律指导大型语言模型（LLMs）的发展，通过提供模型大小、令牌和计算量之间的最佳平衡估计。最近，损失-损失缩放定律，即预训练数据集和下游任务之间损失的关系，已成为理解并改进LLM性能和泛化能力的强大工具。在本工作中，我们研究了哪些因素最强烈地影响损失-损失缩放。我们的实验发现，预训练数据决定了缩放趋势。相比之下，模型大小、优化超参数、分词器甚至显著的架构差异，如基于Transformer的模型如Llama和状态空间模型如Mamba之间的差异，通常影响有限。因此，从业者应仔细选择适合的预训练数据集以获得最佳下游性能，而架构和其他设置可以自由优化以提高训练效率。

英文摘要

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

URL PDF HTML ☆

赞 0 踩 0

2501.02407 2026-05-21 cs.CL cs.CR cs.LG 版本更新

从文本自动学习事故前兆

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier

发表机构 * University of Edinburgh, UK（爱丁堡大学，英国）； University of Colorado at Boulder, USA（科罗拉多大学博尔德分校，美国）

AI总结本文研究了如何利用建设行业数字记录的安全报告，通过比较几种深度学习方法自动学习事故前兆，以提高对安全事故的理解和学习能力。

Comments Added author contributions and journal reference, updated corresponding author

详情

Journal ref: Automation in Construction 118 (2020): 103145

AI中文摘要

鉴于建设行业数字记录的安全报告日益增多，开发方法利用这些数据以提高对安全事故的理解和学习能力变得重要。在本研究中，我们比较了几种自动从原始建设事故报告中学习事故前兆的方法。更具体地说，我们尝试了两种最先进的自然语言处理（NLP）深度学习架构，即卷积神经网络（CNN）和分层注意力网络（HAN），以及已建立的词频-反文档频率表示（TF-IDF）+支持向量机（SVM）方法。对于每种模型，我们提供了一种方法，在训练后识别出平均最能预测每种安全结果的文本模式。我们显示，在这些文本中可以找到有效的事故前兆。所提出的方法也可以让用户可视化和理解模型的预测。

英文摘要

In light of the increasing availability of digitally recorded safety reports in the construction industry, it is important to develop methods to exploit these data to improve our understanding of safety incidents and ability to learn from them. In this study, we compare several approaches to automatically learn injury precursors from raw construction accident reports. More precisely, we experiment with two state-of-the-art deep learning architectures for Natural Language Processing (NLP), Convolutional Neural Networks (CNN) and Hierarchical Attention Networks (HAN), and with the established Term Frequency - Inverse Document Frequency representation (TF-IDF) + Support Vector Machine (SVM) approach. For each model, we provide a method to identify (after training) the textual patterns that are, on average, the most predictive of each safety outcome. We show that among those pieces of text, valid injury precursors can be found. The proposed methods can also be used by the user to visualize and understand the models' predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.20537 2026-05-21 cs.CL 版本更新

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Biomedical NER和实体链接基准测试测量什么？一种以语料库为中心的诊断框架

Robert Leaman, Rezarta Islamaj, Zhiyong Lu

发表机构 * National Library of Medicine（国家医学图书馆）

AI总结本文提出一种以语料库为中心的诊断框架，用于分析生物医学NER和实体链接基准测试的相关属性，揭示语料库特性对评估信号、泛化需求和基准测试结论范围的影响。

Comments Accepted to the ACL 25th Workshop on Biomedical Language Processing

详情

AI中文摘要

生物医学命名实体识别（NER）和实体链接（EL）高度依赖于标注语料库，但这些资源用于基准测试的效用往往被假设而非明确描述。我们提出一种以语料库为中心的框架，直接从语料库注释、概念链接、训练-测试分割、文档元数据和术语映射中诊断与基准测试相关的属性。该框架将标准化统计分为五类：（1）规模、密度和标签分布，（2）词汇和概念结构，（3）训练-测试重叠，（4）元数据组成，以及（5）术语覆盖（如适用）。对九个涵盖疾病、化学物质和细胞类型的语料库应用该框架，发现即使针对相同明显任务，语料库属性也可能存在显著差异。我们发现它们提供的评估信号、施加的泛化需求、允许的训练-测试重用程度以及所代表的生物医学文献和概念空间区域也存在差异。这些差异表明，通常报告的语料库统计可能不足以描述生物医学NER和EL基准测试所评估的内容。我们主张语料库中心的诊断提供了一种实用框架，用于分析语料库，超越仅基于语料库大小和实体类型的表面描述，以识别潜在的迁移风险，并解释基准测试结论的范围。我们以开源代码和交互式仪表盘的形式发布该框架，以支持重现我们的分析并表征其他语料库。

英文摘要

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

URL PDF HTML ☆

赞 0 踩 0

2605.20529 2026-05-21 cs.CL cs.AI 版本更新

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

词组关联性：人类和神经网络中主谓一致学习的一种假设

Claire Hobbs, R. Thomas McCoy

发表机构 * Yale University（耶鲁大学）； Cognitive Science Program（认知科学项目）； Dept. of Linguistics（语言学系）； Wu Tsai Institute（吴泰科创研院）

AI总结本文探讨了语言输入中的统计信号如何帮助语法习得，提出词组关联性假设，通过词组共现规律提供句法依赖线索，并验证该机制在英语主谓一致习得中的有效性。

Comments Accepted to CoNLL

详情

AI中文摘要

在何种程度上，语言输入中的统计信号可以促进语法的习得？本文提出了一种称为词组关联性学习的机制，其中词组共现规律可以提供句法依赖的线索。我们研究这种机制是否能支持英语主谓一致的习得。首先，我们通过在不同可预测性水平的合成数据集上训练神经网络来模拟语言习得，发现存在一个可预测性范围，使得这些统计学习器能够稳健地学习主谓一致。然后，我们分析儿童导向语言中主谓配对的可变性，并发现此类数据中的可变性落在我们计算模拟中支持稳健泛化的范围内。综合来看，这些结果表明词组关联性是一种可行的学习策略，适用于儿童所接收的输入类型。

英文摘要

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

URL PDF HTML ☆

赞 0 踩 0

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV 版本更新

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出NeuroQA，一个大规模的3D脑部MRI视觉问答基准，包含来自12977名受试者的56953个问答对，涵盖5-104岁及五个临床领域，通过3D体积评估11种临床推理技能，并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情

AI中文摘要

我们提出了NeuroQA，一个大规模的3D脑部磁共振成像（MRI）视觉问答基准，包含来自12977名受试者的56953个问答对，涵盖5-104岁及五个临床领域：阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答（VQA）方法不同，NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能，涵盖是/否、多项选择和开放式格式。在203个模板中，131个是图像 grounded（可从3平面查看器回答），72个是图像 informed（答案来自定量体积测量或临床仪器）。为消除纯文本捷径，我们应用了答案分布优化，将封闭式文本-only 准确率从>80%降至44.6%；图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配，零个相同受试者矛盾。我们进行了临床评估，两名临床医生独立评估100个冻结测试项目，使用3平面查看器。在封闭式（是/否+多项选择）测试公开项目上，最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率，均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布，公开QA对用于开放访问数据集和受数据使用协议（DUAs）限制的数据集的可复现生成脚本，加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2605.20506 2026-05-21 cs.LG cs.CL 版本更新

Reinforcing Human Behavior Simulation via Verbal Feedback

通过言语反馈强化人类行为模拟

Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Microsoft（微软）

AI总结本文提出DITTO模型，通过将言语反馈作为强化学习中的首要信号来提升LLM模拟人类行为的能力，并引入SOUL基准测试平台，展示了在多个任务中显著提升性能的成果。

详情

AI中文摘要

人类通过言语反馈（例如父母说“那很粗鲁”或朋友解释“这是为什么那会伤害你”）学习社会规范和行为。然而，对于LLM而言，学习反馈主要集中在代码和数学等领域，这些领域中的RL奖励可以直接验证并压缩为标量值。随着LLM越来越多地用于模拟人类行为，例如代表用户、患者、学生和其他角色，有必要使它们更加人性化，这需要接受一种根本不同的信号：主观的、多方面的言语反馈。我们提出了DITTO，一个通过将言语反馈作为强化学习中的首要信号进行训练的模型。每次回放后，DITTO会接收言语反馈并生成反馈条件的改进回放；两个输出通过GRPO联合优化，将言语指导蒸馏到基础策略中，而无需在测试时使用反馈。我们还引入了SOUL（Simulation gym Of hUman-Like behavior），一个涵盖10个任务、六个类别的统一基准和训练数据集：理论思维、角色扮演、社交技能、学习模拟、用户模拟和角色模拟。DITTO在基础模型上平均提升了36%，并在SOUL基准测试中的6个任务上超过了GPT-5.4，证明了通过言语反馈的强化学习是训练LLM模拟人类行为的有前途的方向。

英文摘要

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.20478 2026-05-21 cs.CL 版本更新

Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

Stage-Audit: 用于跨维基表的可审计源前沿发现

Chen Shen

发表机构 * Megagon Labs（梅加贡实验室）

AI总结本文研究了LLM整理的表格可能存在的源不一致问题，提出Stage-Audit方法通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学，提高了源前沿的精度和F1值，同时保持了每行的源可追溯性。

Comments 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild

详情

AI中文摘要

LLM整理的表格可能看似源相关，但实际上包含不支持的行：整理者可能从参数记忆中回忆条目并回溯性地附加页面级引用，这些引用并非实际来源。我们研究了这一风险在Seed2Frontier发现任务中的影响：该任务是从种子页面找到互补的维基百科页面以构建结构化表格。Stage-Audit通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学（涵盖键、模式、源角色、基数和范围）来解决这一问题。在覆盖15个顶级域的51实例Seed2Frontier评估集上，Stage-Audit将源前沿的精度从0.356提升到0.505（+42%相对提升），F1值从0.334提升到0.451（+35%），同时保持了每行的源可追溯性。Vanilla-LLM与Stage-Audit的比较隔离了策略贡献，而非一般LLM发现过程的贡献。

英文摘要

LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

URL PDF HTML ☆

赞 0 踩 0

2605.20477 2026-05-21 cs.LG cs.AI cs.CL 版本更新

对ChatGPT感到困惑？不了！一个拼图游戏以促进AI素养和意识

Francesca Padovani, Malvina Nissim

发表机构 * Center for Language and Cognition (CLCG)（语言与认知中心（CLCG））； University of Groningen（格罗宁根大学）

AI总结本文提出了一种基于游戏的互动拼图，通过完成后的漫画信息图来展示生成AI的工作原理、能力、限制及社会影响，并通过漫画草图作为独立信息卡提供具体AI使用、设计和影响的解释，以提高公众对AI的理解和意识。

详情

AI中文摘要

生成式AI的快速采用，包括基于LLM的聊天机器人如ChatGPT，凸显了需要支持公众理解和AI素养的可访问方法。为了解决这一需求，我们介绍了一种基于游戏的互动方法，形式为一个拼图，其完成图像是一幅基于漫画的信息图，展示了这些技术的工作原理、能力、限制和社会影响。每个漫画草图也作为独立的信息卡片，提供关于AI使用、设计和影响的具体解释。视觉内容是在专业插画师和多学科专家及非专家群体的实时协作会议中创建的，结合了结构化的知识和讨论期间分享的非正式、探索性反思。通过整合动手组装、视觉叙事和协作互动，该拼图提供了一种引人入胜且有趣的工具，用于在非正式学习环境中探索AI系统的机制、优点和危险。

英文摘要

The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.20382 2026-05-21 cs.CL cs.AI 版本更新

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

言行不一：大型语言模型中的指令诱导冲突

Carolina Camassa, Derek Shiller

发表机构 * Future Impact Group（未来影响组）

AI总结研究探讨了大型语言模型在面对指令与模式完成之间的冲突时的表现，发现指令遵循率在不同模型和指令下差异显著，输出多样性是预测鲁棒性的主要因素。

Comments 31 pages

详情

AI中文摘要

语言模型被训练以遵循指令，但它们也是强大的模式完成器。当这些两个目标发生冲突时会发生什么？我们构建了对话，其中用户指令要求以目标方式T（例如始终输出特定标记、用特定语言回答或采用特定角色）行为，与N个硬编码助手回合展示的竞争对手模式P相冲突。然后我们测量在此设置下的指令遵循率（IF率），在13个模型和16种不同指令上，最多50回合。平均指令遵循率在模型之间从1%到99%不等，与标准能力基准 largely 无关。从指令遵循到模式遵循的转变是普遍的但高度模型依赖的。鲁棒性由指令内容调节，当指令与训练价值先验一致时，模型对诱导的抵抗更长；同时由输出格式调节，多样化的多标记响应比单标记输出更耐受。链式推理提高鲁棒性但不消除易受性，并可能导致正确推理与错误输出之间的脱节。当被要求预测其在此设置下的行为时，模型平均准确率为83.5%，但系统性低估了自身对诱导压力的抵抗力。这些结果表明，即使对于其他方面能力较强的模型，指令遵循在诱导压力下仍很脆弱，而输出多样性而非输入的语义参与是预测鲁棒性的主要因素。

英文摘要

Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

URL PDF HTML ☆

赞 0 踩 0

2605.20369 2026-05-21 cs.CL cs.AI cs.LG 版本更新

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

DEL：用于大语言模型数值学习的数字熵损失

Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； VCIP, College of Computer Science, Nankai University（南开大学计算机学院VCIP）

AI总结本文提出Digit Entropy Loss (DEL)用于大语言模型的自回归数值学习，通过重新设计传统无监督熵优化，引入数字条件概率和二元交叉熵，使熵优化转向监督方式，同时推广整数基于的数值学习到浮点数优化，从而提升数值预测的准确性。

详情

AI中文摘要

数字预测是大语言模型（LLMs）在数学问题解决和代码生成中的基本能力。广泛采用的最大似然估计（MLE）用于LLM训练并不适合数字预测。最近，惩罚驱动的方法，例如数字标记损失和离散化距离损失，引入了数字距离的归纳偏置，但分别导致了数字分布过度锐化和过度扁平化。在本文中，我们深入分析了LLM的数值学习，并表明现有的数值学习方法在概念上遵循一个准则-距离公式，其中准则项代表优化模式，距离项灌输几何先验。因此，我们提出了Digit Entropy Loss (DEL)用于自回归数值学习，其重新设计传统无监督熵优化的三个关键设计：利用数字条件概率和二元交叉熵将熵优化引导为监督方式；舍弃距离项以避免数值距离的问题；并将整数基于的数值学习推广到浮点数优化，使数值预测更加准确。我们的DEL公式可以结合整数、小数和小数点，将学习目标从单个数字扩展到浮点数领域。在七个数学推理基准测试中使用四个代表性的LLM，包括CodeLlama、Mistral、DeepSeek和Qwen-2.5，进行实验，结果表明DEL在整体预测准确性和数值距离方面均优于其替代方法。源代码在https://github.com/PolyU-VCLab/DEL。

英文摘要

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

URL PDF HTML ☆

赞 0 踩 0

2605.20364 2026-05-21 cs.CL 版本更新

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

推理监督有害：基于TTCW的长文本文学评论生成

Jinlong Liu, Mohammed Bahja, Mark Lee

发表机构 * School of Computer Science, University of Birmingham, United Kingdom（英国伯明翰大学计算机科学学院）

AI总结本文提出了一种基于TTCW的长文本文学评论生成方法，通过构建包含263911个长文本故事的数据集，每个故事都标注了14个TTCW维度的评分和元合成评论。实验发现，非推理微调在性能和稳定性上更优，而推理监督模型更易出现解析失败，导致生成无关或重复的推理文本，表明在固定格式评分标准下，推理监督并不总是有益的。

Comments Submit to EMNLP 2026

详情

AI中文摘要

在压力下：情感框架引发小型语言模型可测量的行为转变和结构化内部几何

Rana Muhammad Usman

发表机构 * Independent Researcher（独立研究者）

AI总结该研究探讨情感框架如何影响小型本地部署语言模型的行为和内部表示，通过四个不可能约束编码任务和八个后续框架评估，发现压力框架在行为和内部几何上产生显著变化，同时揭示了模型在不同框架下的响应模式。

Comments 18 pages, 4 figures. Exploratory empirical study with fully local experiments on small open language models. Code and data: https://github.com/ranausmanai/LLMEmotionGeometry

详情

AI中文摘要

我研究情感框架的评估后续是否改变小型本地部署语言模型的行为和冷静相对内部表示。我们的主要基准使用Qwen 3.5 0.8B在四个不可能约束编码任务和八个后续框架（冷静、压力、紧迫、批准、羞愧、好奇、鼓励和威胁）上进行测试。在0.8B八条件扫描（160次对话）中，压力产生最强的捷径标记（11/20次运行）和最清晰的过拟合模式（3/20次），而冷静和好奇更常保留显式诚实（7/20和6/20）。对于所有七个非基准条件，对应的冷静相对方向向量在最终transformer层峰值。对层23方向向量的探索性PCA显示，主导的第一个成分（59.5%的解释方差）与手动标注的正负分割对齐（余弦对齐0.951）；批准和紧迫在内部几乎相同（余弦0.957），而好奇与紧迫方向相反（-0.252）。在单独的冷静与压力重新运行用于规模比较中，Qwen 3.5 2B在冷静框架下表现出更高的诚实率，并在小规模4提示A/B探测中表现出方向一致的激活引导，而0.8B的引导结果则相反。我将这些结果解释为小型开放模型中可测量的提示敏感控制方向的证据，但未声称存在内在情感状态。

英文摘要

I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

URL PDF HTML ☆

赞 0 踩 0

2605.20199 2026-05-21 cs.CL cs.AI 版本更新

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

FlowLM: 通过扩散到流的适应实现少步语言建模

Runzhe Zhang, Letian Chen, Wenpeng Zhang, Zhouhan Lin, Peilin Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Department of XXX, University of YYY, Location, Country（XXX系，YYY大学，地点，国家）； School of ZZZ, Institute of WWW, Location, Country（ZZZ学院，WWW研究所，地点，国家）

AI总结本文提出FlowLM，一种通过高效微调从预训练的扩散语言模型转换而来的流匹配语言模型，通过将扩散模型的弯曲采样轨迹重新对齐为直线流，实现了高质量的少步生成，其质量可与甚至超越2000步扩散采样。此外，作者提出了一种更有效的流匹配训练目标：预测干净数据以持续引导采样过程向真实数据分布靠近。

Comments 26 pages, 11 figures

2605.20197 2026-05-21 cs.CL 版本更新

闪耀的故事，隐藏的挣扎：通过LLM的视角调查残疾人的表现

Marco Bombieri, Simone Paolo Ponzetto, Marco Rospocher

发表机构 * Department of Foreign Languages and Literature, University of Verona（外国语言文学系，威尼斯大学）； Department of Information Engineering and Computer Science, University of Trento（信息工程与计算机科学系，特伦托大学）； Data and Web Science Group, University of Mannheim（数据与网络科学小组，曼海姆大学）

AI总结本文研究了大型语言模型（LLMs）如何表现残疾人群体，通过模拟残疾人视角生成社交媒体帖子，并与真实残疾人的帖子进行比较，揭示LLMs对残疾人的理想化倾向和潜在的偏见。

Comments Accepted for publication in ACM Transactions on Intelligent Systems and Technology

详情

DOI: 10.1145/3806202
Journal ref: ACM Trans. Intell. Syst. Technol. (2026)

AI中文摘要

现代大型语言模型（LLMs）近年来因其模拟人类行为和生成反映个人身份和人口群体的文本的能力而受到广泛关注。尽管这些能力可以打开跨领域多样应用的大门，但必须检查这些模型如何代表各种目标群体，因为LLMs可能会延续和放大对历史上被边缘化社区的偏见或歧视，或者由于去偏见努力，过度纠正导致过度积极的刻板印象。这种过度补偿会理想化这些群体，忽视他们所面临的复杂性和挑战，转而采用不现实的描绘。在本文中，我们通过模拟残疾人视角生成社交媒体帖子来研究LLMs如何表现残疾人群体。这些帖子随后与真实残疾人撰写的帖子进行比较，重点是情感基调、情感倾向和代表性的词汇和主题。我们的分析揭示了两个关键发现：（1）LLMs常常理想化残疾人的经历，生成过于积极的刻板印象，尽管看起来鼓舞人心，但未能真实捕捉他们的现实生活；（2）对模拟有和没有残疾人的帖子的比较分析显示了一种负面偏见，某些主题，如职业和娱乐，被不成比例地与非残疾人群体相关联。这强化了排斥性叙述和对残疾的过度理想化描绘，歪曲了该社区实际面临的挑战。这些发现与更广泛的关注和持续研究一致，表明LLMs难以反映社会的多样现实，特别是边缘化群体的复杂经历，并强调了对其表现进行批判性审视的必要性。

英文摘要

Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.

URL PDF HTML ☆

赞 0 踩 0

2605.17140 2026-05-21 cs.CV cs.AI cs.CL 版本更新

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA: 用于脑肿瘤MRI解读的视觉问答数据集

Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil

发表机构 * Fung Institute for Engineering Leadership（工程领导力基金会）； University of California, Berkeley（加州大学伯克利分校）； Department of Radiology（放射科）； University of California, San Francisco（加州大学旧金山分校）； Division of Clinical Informatics and Digital Transformation（临床信息学与数字转型部）； Department of Neurological Surgery（神经外科部）

AI总结本文提出一个临床相关的视觉问答基准数据集UCSF-PDGM-VQA，包含2387个问题-答案对，用于评估视觉语言模型在处理多序列3D MRI扫描中的能力，发现现有模型在多模态处理上存在缺陷。

Comments 10 pages, 2 figures, 6 tables

详情

AI中文摘要

脑肿瘤诊断很大程度上依赖于磁共振成像（MRI）评估，这需要放射科医生综合分析成千上万张来自多种3D序列和纵向研究的图像。这一过程需要高级的神经放射学培训，具有显著的认知负荷，并且非常耗时。尽管放射学需求不断增长，但这种专业知识难以扩展，给当前的医疗系统带来压力。视觉-语言模型（VLMs）提供了一种通过半自动化、互动解释复杂脑MRI来减轻这种负担的机会。然而，由于缺乏专门的评估基准，它们在神经肿瘤学中目前使用有限。我们介绍了一个临床相关的视觉问答（VQA）基准——UCSF-PDGM-VQA数据集，包含来自公共UCSF-PDGM数据集中473个胶质瘤相关MRI研究的2387个QA对。我们进一步在该数据集上建立了六种最先进的视觉语言模型（VLMs）和一个大型语言模型的性能基线。我们发现，当前模型无法有效处理多序列、三维MRI扫描，导致视觉特征的抑制和对语言先验的过度依赖，从而造成模态崩溃。这些发现突显了当前模型在临床环境中的可靠性和安全性方面的关键缺陷，需要开发稳健的、领域特定的VLMs。

英文摘要

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.12770 2026-05-21 cs.LG cs.AI cs.CL 版本更新

WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE: 用于递归状态的稀疏自编码器

Jack Young

发表机构 * Indiana University（印第安纳大学）

AI总结本文提出WriteSAE，一种用于递归语言模型状态中矩阵更新的稀疏自编码器，通过在递归缓存中替换原始写入操作来提升生成效果，并在多个模型上验证了其有效性。

Comments 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae

详情

AI中文摘要

我们介绍了WriteSAE，一种用于递归语言模型状态中矩阵更新的稀疏自编码器。在Gated DeltaNet、Mamba-2和RWKV-7中，每个token向递归缓存写入一个矩阵形状的更新；残差流SAE具有向量形状的原子，无法直接替换该更新。WriteSAE学习具有与模型自身写入相同形状的秩-1矩阵原子。这使我们能够测试直接替换：在SAE激活原子的位置，我们移除模型的写入，插入由SAE激活缩放的原子，并继续前向传递。在92.4%的评估位置上，原子比删除写入能产生更接近的最终token分布；平均每个原子，该比率是89.8%。对于Gated DeltaNet，一个使用忘记门、读取查询和输出嵌入的公式可以预测结果的logit变化，$R^2 = 0.98$。相同的替换测试在Mamba-2-370M上转移，达到88.1%。在生成中，该公式选择写入方向；将写入方向写入三个连续的缓存位置，其范数为模型写入的3倍，使在未修改模型中初始排名为100-1000的token出现在100%的延续中，比33.3%有所提高。据我们所知，这是首次在状态空间或混合递归层中报告的缓存级引导干预。

英文摘要

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

URL PDF HTML ☆

赞 0 踩 0

2604.09174 2026-05-21 cs.CL 版本更新

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

面向证据不确定性和幻觉的面级追踪（RAG）

Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl

发表机构 * Johannes Kepler University Linz（约翰内斯·开普勒大学林茨）； Linz Institute of Technology（林茨技术学院）； Institute of Computational Perception（计算感知研究所）； Artificial Intelligence Lab（人工智能实验室）

AI总结本文提出了一种面向问答任务的面级诊断框架，通过分解每个输入问题为原子推理面，评估证据充分性和基础性，揭示RAG系统中幻觉产生的根源在于生成过程中证据整合的方式，而非检索准确性。

详情

AI中文摘要

检索增强生成（RAG）旨在通过检索证据来减少幻觉，但即使有相关文档可用，幻觉答案仍然常见。现有评估集中在答案级或段落级准确性，对生成过程中证据使用缺乏深入理解。本文引入了一种面向问答任务的面级诊断框架，将每个输入问题分解为原子推理面。对于每个面，我们使用结构化的Facet x Chunk矩阵评估证据充分性和基础性，该矩阵结合检索相关性和基于自然语言推理的忠实度评分。为了诊断证据使用，我们分析了三种受控推理模式：严格RAG，强制依赖检索证据；软RAG，允许整合检索证据和参数知识；以及无检索的LLM生成。比较这些模式能够深入分析检索生成不一致，即相关证据被检索但未在生成过程中正确整合的情况。在医疗问答和HotpotQA上，我们评估了三种开源和闭源LLM（GPT、Gemini和LLaMA），提供了可解释的诊断，揭示了反复出现的面级失败模式，包括证据缺失、证据不一致和先验驱动覆盖。我们的结果表明，RAG系统中的幻觉更多由生成过程中证据整合的方式决定，而非检索准确性，面级分析揭示了系统性的证据覆盖和不一致模式，这些模式在答案级评估下是隐藏的。

英文摘要

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

URL PDF HTML ☆

赞 0 踩 0

2603.00086 2026-05-21 cs.CL cs.AI cs.SD eess.AS 版本更新

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

基于迭代的LLM改进法用于法语临床访谈的转录与说话人识别

Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec

发表机构 * LaTIM UMR 1101 INSERM（INSERM拉蒂姆UMR1101）； University of Western Brittany（西布列塔尼大学）； University of Rouen Normandy（诺曼底大学）

AI总结本研究提出一种多轮LLM后处理架构，通过交替进行说话人识别和词识别流程，提高法语医疗对话的转录准确性和说话人归属，通过两个法语临床数据集的消融研究验证了四种设计选择的效果。

详情

AI中文摘要

法语医疗对话的自动语音识别仍然具有挑战性，自发临床语音的词错误率通常超过30%。本研究提出一种多轮LLM后处理架构，通过交替进行说话人识别和词识别流程来提高转录准确性和说话人归属。在两个法语临床数据集（自杀预防电话咨询和术前清醒神经外科会诊）上的消融研究调查了四种设计选择：模型选择、提示策略、流程顺序和迭代深度。使用Qwen3-Next-80B，Wilcoxon符号秩检验证实了在自杀预防对话上词错误率（WDER）的显著降低（p<0.05，n=18），同时在清醒神经外科会诊上保持稳定（n=10），零输出失败和可接受的计算成本（RTF 0.32），表明该方法在离线临床部署中的可行性，有待在更大语料库上验证。

英文摘要

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

URL PDF HTML ☆

赞 0 踩 0

2602.10408 2026-05-21 cs.LG cs.CL 版本更新

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

预归一化变换器中的门控归一化移除与尺度锚定

Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall

发表机构 * Department of Electrical Engineering（电气工程系）； Department of Computer Science（计算机科学系）； Stanford University（斯坦福大学）

AI总结本文研究了预归一化变换器中归一化层的必要性，提出了一种门控归一化移除方法，通过TaperNorm逐步将归一化操作转为样本无关的线性或仿射映射，并揭示了最终归一化层对预logit表示尺度的锚定作用。

详情

AI中文摘要

归一化层在变换器中是标准组件，但其样本依赖的计算在整个训练和推理过程中是否必要尚不明确。本文为预归一化变换器开发了一种门控归一化移除方法。该方法通过TaperNorm实现，从标准RMSNorm/LayerNorm逐步过渡到学习的样本无关线性或仿射映射。一旦门控达到零， tapered层将不再计算每个token的统计信息，所得到的映射可以折叠到相邻的线性投影中。结果表明，在测试的预训练和微调设置中，内部归一化可以逐步移除，且验证损失的增加较小。我们的方法揭示了最终归一化层的独特作用，即它锚定了预logit表示的尺度。有了这个锚定，最后隐藏状态的径向变化不会直接减少损失；当移除它时，通过增加logit的幅度可以实现交叉熵的减少。固定目标尺度损失提供了显式的替代锚定，使在测试范围内能够完全无归一化地进行消融实验。最后，在KV缓存自回归解码基准中，逐步移除内部归一化可提供高达1.14倍的吞吐量，使用显式缩放操作，折叠后可达1.18倍。

英文摘要

Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present, radial changes in the last hidden state do not directly reduce the loss; when it is removed, reducing cross-entropy can be achieved by increasing logit magnitudes. A fixed-target scale loss provides an explicit alternative anchor and enables fully norm-free ablations in the tested regimes. Finally, in a KV-cached autoregressive decoding benchmark, tapering internal norms gives up to $1.14\times$ higher throughput with explicit scaling operations and up to $1.18\times$ after folding.

URL PDF HTML ☆

赞 0 踩 0

2602.09723 2026-05-21 cs.CL 版本更新

AI-Assisted Scientific Assessment: A Case Study on Climate Change

AI辅助的科学评估：气候变化案例研究

Christian Buck, Levke Caesar, Michelle Chen Huebscher, Massimiliano Ciaramita, Erich M. Fischer, Zeke Hausfather, Özge Kart Tokmak, Reto Knutti, Markus Leippold, Joseph Ludescher, Katharine J. Mach, Sofia Palazzo Corner, Kasra Rafiezadeh Shahi, Johan Rockström, Joeri Rogelj, Boris Sakschewski

发表机构 * Potsdam Institute for Climate Impact Research (PIK), Member of the Leibniz Association（波茨坦气候影响研究所（PIK），莱比锡协会成员）； Institute for Atmospheric and Climate Science, ETH Zurich（大气与气候科学研究所，苏黎世联邦理工学院）； University of Zurich（苏黎世大学）； University of Miami（迈阿密大学）； Centre for Environmental Policy, Imperial College London（伦敦帝国理工学院环境政策中心）； Grantham Institute for Climate Change and Environment, Imperial College London（伦敦帝国理工学院气候变化与环境研究所）； Energy, Climate and Environment Program, International Institute for Applied Systems Analysis（国际应用系统分析研究所能源、气候与环境项目）

AI总结本文探讨了AI在科学评估中的应用，通过与气候科学领域13名科学家的合作，测试了基于Gemini的AI环境在评估大西洋经向翻转环流（AMOC）稳定性方面的效果，展示了AI在加速科学流程和提升报告质量方面的贡献，同时强调了专家补充和监督的重要性。

详情

AI中文摘要

新兴的AI合作者范式专注于可重复验证的任务，其中智能体通过'猜测和检查'循环探索搜索空间。这一范式无法扩展到无法重复评估的问题，其真实情况由理论和现有证据的共识综合确定。我们评估了一个基于Gemini的AI环境，该环境旨在支持协作科学评估，并整合到标准科学流程中。与13名气候科学领域的科学家合作，我们测试了该系统在复杂主题——大西洋经向翻转环流（AMOC）稳定性上的表现。我们的结果表明，AI可以加速科学流程。小组通过104次修订循环，在不到46人小时内完成了79篇论文的综合综述。AI的贡献显著：大多数AI生成的内容保留在报告中。AI还帮助维持逻辑一致性和呈现质量。然而，专家补充至关重要，以确保其可接受性：报告中不到一半的内容由AI生成。此外，需要大量监督以扩展和提升内容到严谨的科学标准。

英文摘要

The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.

URL PDF HTML ☆

赞 0 踩 0

2512.03671 2026-05-21 cs.CL 版本更新

AI增强的调查：利用大型语言模型和调查进行意见预测

Junsol Kim, Byungkyu Lee

发表机构 * Department of Sociology（社会学系）； University of Chicago（芝加哥大学）； New York University（纽约大学）； Chicago, IL（伊利诺伊州芝加哥市）； New York, NY（纽约州纽约市）

AI总结本文提出了一种基于大型语言模型的框架，通过结合问题、受访者和调查时期的嵌入表示，预测重复横断面调查中缺失的响应，从而弥补传统调查在捕捉历史变化方面的不足。

详情

AI中文摘要

全国代表性调查追踪公众意见，但每年只询问有限的问题，限制了其捕捉历史变化的潜力。为填补这一空白，我们开发了一个基于大型语言模型（LLM）的框架，通过结合问题、受访者和调查时期的嵌入表示，预测重复横断面调查中缺失的响应。我们引入了LLM在调查研究中的两个新应用：回溯预测（预测年度层面的缺失意见）和未询问意见预测（预测完全缺失的意见）。使用1972-2021年一般社会调查的数据，我们的LLM模型在交叉验证和在GSS未询问的年份中通过其他组织测量的公众意见方面表现良好。这些能力使我们能够恢复缺失的趋势并确定公众态度变化的时间，例如同性婚姻支持率的上升。然而，未询问意见预测的性能仍较为有限。我们展示了当我们的模型优于现有基准时的情况，检验了哪些意见和受访者更具可预测性，并评估了我们的方法是否减少了LLM预测响应的同质化倾向。我们的研究证明了LLM和调查可以相互增强：LLM扩大了调查的潜力，而调查则校准LLM以模拟人类意见。

英文摘要

Nationally representative surveys track public opinion, yet they ask only a limited set of questions each year, limiting its potential to capture historical changes. To fill this gap, we develop a large language model (LLM)-based framework for predicting missing responses in repeated cross-sectional surveys by incorporating embeddings for questions, respondents, and survey periods. We introduce two new applications of LLMs to survey research: retrodiction (predicting year-level missing opinions) and unasked opinion prediction (predicting entirely missing opinions). Using data from the 1972-2021 General Social Surveys, our LLM-based models perform strongly in retrodicting masked GSS opinions through cross-validation and public opinions measured by other organizations in years when the GSS did not ask them. These capabilities enable us to recover missing trends and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. However, performance remains modest for unasked opinion prediction. We show when our models outperform established benchmarks, examine which opinions and and respondents are more predictable, and evaluate whether our approach reduces LLMs' tendency to homogenize predicted responses. Our study demonstrates that LLMs and surveys can mutually enhance each other: LLMs broaden survey potential, while surveys calibrate LLMs for simulating human opinions.

URL PDF HTML ☆

赞 0 踩 0