arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28819 2026-05-28 cs.LG cs.CL 版本更新

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

PEFT-Arena:从稳定性-可塑性角度理解参数高效微调

Yangyi Huang, Ruotian Peng, Zeju Qiu, Jiale Kang, Yandong Wen, Bernhard Schölkopf, Weiyang Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Westlake University(西湖大学) MPI for Intelligent Systems(智能系统研究所)

AI总结 提出PEFT-Arena基准,通过稳定性-可塑性困境评估参数高效微调方法,发现正交微调在帕累托前沿上最优,并从权重空间和激活空间分析其几何特性。

Comments Technical report v1 (28 pages, 9 figures, project page: https://spherelab.ai/PEFT-Arena/)

详情
AI中文摘要

参数高效微调(PEFT)已成为适应大型语言模型的标准方法,然而评估主要强调下游准确性,而忽略了预训练能力的保留。我们认为,PEFT应通过稳定性-可塑性困境来评估:目标任务适应与抵抗遗忘之间的权衡。我们引入了PEFT-Arena,一个联合测量下游性能和通用能力保留的基准。在不同方法中,我们发现了不同的稳定性-可塑性特征;在可比的参数预算下,正交微调实现了最有利的帕累托前沿。为了解释这些差异,我们从两个几何角度分析PEFT更新。在权重空间中,谱分析揭示了参数化如何与预训练的奇异值结构相互作用。在激活空间中,保留指标显示微调是保留还是扭曲了通用能力表示,遗忘与非等距表示扭曲相关。最后,分析表明最终的SFT检查点通常超过了一个更好的目标-保留操作点。受此启发,我们展示了通过路径回退进行事后改进的案例研究。

英文摘要

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.

2605.28818 2026-05-28 cs.CL q-bio.NC 版本更新

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

VLMs 在自然阅读中可能不会全局性地增强与人类的对齐性优于 LLMs

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

发表机构 * Department of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Linguistics and Translation, City University of Hong Kong(香港城市大学语言学与翻译系) McGovern Institute for Brain Research, Massachusetts Institute of Technology(麻省理工学院麦戈文脑科学研究所)

AI总结 通过严格文本设置比较LLM和VLM,发现多模态预训练在自然阅读中未带来全局性人类对齐优势,但视觉语义内容强的句子中VLM有选择性优势。

Comments 17 pages, 10 figures

详情
AI中文摘要

大型语言模型(LLMs)已成为人类语言处理的有用计算模型,但尚不清楚视觉-语言学习是否使文本表示在自然阅读中更接近人类。本文通过严格文本设置比较紧密匹配的LLM和视觉-语言模型(VLM)对,从而将多模态训练历史的影响与在线视觉输入或跨模态融合分离。我们使用包含全脑fMRI反应和同步眼动扫视的人类自然阅读数据集评估模型对齐。我们的发现表明,多模态预训练可能不会在自然阅读中赋予均匀的全局性人类对齐优势,表明语言内部表示仍然是建模人类文本处理的关键因素。然而,当句子包含更强的视觉语义内容时,VLM的优势可能更具选择性出现,fMRI和眼动对齐均提供汇聚证据。总之,我们的发现提供了一个受控的计算框架,用于测试视觉学习历史如何塑造语言处理的模型-人类对齐,表明多模态预训练在自然阅读中对类人语言表示的贡献是选择性的而非全局性的。

英文摘要

Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

2605.28814 2026-05-28 cs.CL 版本更新

Self-Improving Language Models with Bidirectional Evolutionary Search

具有双向进化搜索的自我改进语言模型

Guowei Xu, Zhenting Qi, Huangyuan Su, Weirui Ye, Himabindu Lakkaraju, Sham M. Kakade, Yilun Du

发表机构 * Harvard University(哈佛大学) MIT(麻省理工学院)

AI总结 提出双向进化搜索(BES)框架,通过前向候选进化与后向目标分解相结合,克服了传统搜索方法中稀疏验证信号和自回归扩展的局限,在训练后和推理时均显著提升语言模型性能。

详情
AI中文摘要

搜索已被提出作为自我改进语言模型和代理系统的有效方法,既用于训练后样本生成,也用于推理。然而,广泛使用的方法如最佳N采样和树搜索面临两个基本限制:它们由稀疏的验证信号引导,并且主要通过自回归扩展构建候选,将探索限制在具有大量模型概率质量的区域。为了解决这些问题,我们提出了双向进化搜索(BES),一个将前向候选进化与后向目标分解相结合的搜索框架。在前向搜索中,BES通过进化算子增强标准扩展,这些算子重组部分轨迹以生成难以从单次模型推出中获得的候选。在后向搜索中,BES递归地将原始任务分解为可检查的子目标,产生密集的中间反馈来指导前向搜索。我们提供了理论动机,表明仅通过扩展搜索生成的候选被限制在狭窄的熵壳内,而进化算子可以逃脱它,并且后向搜索可以指数级减少找到正确答案所需的样本数量。实验表明,在主流后训练算法无法改进的具有挑战性的后训练任务上,BES实现了持续的增益,并且在推理时三个开放问题解决基准上,BES在平均和最佳性能上均优于现有的开源框架。代码和训练模型可在 https://github.com/Embodied-Minds-Lab/BES 获取。

英文摘要

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.

2605.28806 2026-05-28 cs.CV cs.CL cs.IR 版本更新

Personal Visual Memory from Explicit and Implicit Evidence

来自显式和隐式证据的个人视觉记忆

Viet Nguyen, Thao Nguyen, Vishal M. Patel, Yuheng Li

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe研究)

AI总结 本文提出个人视觉记忆基准和VisualMem混合架构,通过显式与隐式视觉证据增强AI代理的长期记忆,显著提升个性化任务性能。

Comments Project Page: https://viettmab.github.io/visualmem-page/

详情
AI中文摘要

长期记忆对于个性化AI代理越来越重要,然而现有的基准和方法仍然主要以文本为中心。即使包含图像,后续问题所需的用户特定信息通常仅从文本中即可恢复,并且大多数记忆系统将图像轮次简化为通用描述。然而,图像通常携带文本很少陈述的个人信息——包括显式证据(如重复出现的用户相关实体)和隐式证据(如从视觉或多模态线索推断出的潜在用户事实)。我们引入了一个针对这两种证据形式的个人视觉记忆基准,并提出了VisualMem,一种混合视觉-文本架构,通过结构化个人视觉记忆模块增强文本记忆后端。VisualMem不是将图像压缩为描述,而是利用对话上下文来解析身份、所有权和持久的用户事实。实验表明,VisualMem在我们的基准上显著优于先前的记忆系统,同时在标准文本记忆基准上保持竞争力,这表明个人视觉记忆是个性化AI代理长期记忆中一个独特且重要的组成部分。

英文摘要

Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

2605.28805 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1: 具有显式结构化重校准的多模态元验证器

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang

发表机构 * Tsinghua University(清华大学) Pennsylvania State University(宾夕法尼亚州立大学) University of Southern California(南加州大学) Microcyto Princeton University(普林斯顿大学)

AI总结 提出OmniVerifier-M1,通过符号化元验证(如边界框)和解耦强化学习,实现多模态大模型的可靠细粒度验证与动态区域级自校正。

Comments ICML 2026. Project: https://github.com/Cominclip/OmniVerifier

详情
AI中文摘要

视觉结果日益成为多模态大语言模型的核心,因此可靠且细粒度的验证对于扩展通用基础模型至关重要。在这项工作中,我们研究了多模态元验证,它利用验证器生成的推理过程而非仅决策信号,并探索如何有效地将元验证反馈纳入多模态验证器训练。我们发现了两个关键发现。首先,符号化验证器输出(例如边界框)作为元验证推理过程优于文本解释,能够实现高效的基于规则的强化学习奖励,同时避免依赖来自辅助评判模型的基于模型的奖励。其次,解耦二元判断和元验证的强化学习目标显著优于联合奖励优化,这是由于输出结构和学习动态的内在差异。基于这些见解,我们训练了OmniVerifier-M1,一个利用符号化元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1提供稳健的验证和细粒度的错误定位,并进一步实现了M1-TTS,一个由验证器驱动的智能体生成系统,实现动态区域级自校正。这种方法为更可靠、可解释和细粒度的多模态验证铺平了道路,支持更安全、更可控的基础模型部署。

英文摘要

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

2605.28802 2026-05-28 cs.CL 版本更新

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

人类标注变异作为稳定信号:通过跨标注者偏好优化学习标注者特定的解释行为

Beiduo Chen, Pingjun Hong, Ziyun Zhang, Benjamin Roth, Anna Korhonen, Barbara Plank

发表机构 * MaiNLP Center for Information and Language Processing(信息与语言处理中心) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Vienna(维也纳大学) LTL University of Cambridge(剑桥大学)

AI总结 研究大语言模型能否学习并复现标注者特定的标签-解释行为,提出跨标注者偏好优化(CAPO)方法,通过对比目标标注者与其他有效但非目标标注者的响应来提升模仿和归因能力。

Comments 43 pages, 20 figures

详情
AI中文摘要

自由文本解释通过揭示标注者决策背后的推理和偏好,将人类标注变异(HLV)扩展到标签分歧之外。我们研究大语言模型(LLM)能否学习并复现这种标注者特定的标签-解释行为。使用两个句子对任务(自然语言推理和释义判断),每个任务有四位标注者,我们首先分析标注者是否表现出稳定的个体模式。我们发现,由于强烈的输入内容效应,这些模式在单标注层面较弱,但在减少输入内容影响并进行标注者级别聚合后变得可检测。然后,我们比较了提示学习和监督微调(SFT)基线,并提出了跨标注者偏好优化(CAPO),该方法将目标标注者的响应与同一输入的其他有效但非目标标注者的标注进行对比。实验表明,提示学习有限且不稳定,SFT能更好地捕捉标注者特定行为,而CAPO进一步改进了聚合感知模仿和基于判断的归因,同时在人类验证下保留了目标特定的推理模式。总体而言,我们的结果表明,HLV可以作为标注者特定的标签-解释行为被学习,这为基于标注者历史而非仅标签的可扩展解释型标注提供了路径。

英文摘要

Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators' decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each -- natural language inference and paraphrase judgment -- we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator's response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

2605.28791 2026-05-28 cs.CL cs.AI 版本更新

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

技能条件门控自蒸馏用于大语言模型推理

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao

发表机构 * Tsinghua University(清华大学) Fudan University(复旦大学) City University of Hong Kong(香港城市大学) Huazhong University of Science and Technology(华中科技大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出技能条件门控自蒸馏(SGSD),通过从经验技能库中检索技能-错误对构建多教师池,并利用验证器验证教师极性,以鲁棒门控目标蒸馏信息性师生差异,在弱先验信息假设下提升数学推理性能。

详情
AI中文摘要

在线自蒸馏(SD)通过使用教师端特权信息(PI)将稀疏的验证器结果转化为密集的令牌级监督,从而改善大语言模型推理。现有方法通常假设可信的PI,例如参考答案或成功轨迹。我们提出PI是否可以来自经验驱动的技能库,其中检索到的技能紧凑且可重用,但也可能不相关或具有误导性。我们提出技能条件门控自蒸馏(SGSD),将基于技能的SD表述为教师假设验证而非无条件模仿。SGSD检索技能-错误对,构建多教师池,并让所有技能条件教师对相同的普通提示学生输出进行评分。验证器验证每个教师的极性:支持成功或抑制失败提供正向监督,而相反立场则被反转。然后,一个鲁棒的门控目标蒸馏信息性的师生差异,同时抑制不确定或极端信号。在多个数学推理基准上的实验表明,SGSD在弱PI假设下持续优于GRPO,并与答案条件OPSD保持竞争力。例如,在Qwen3-1.7B上,SGSD在AIME24、AIME25和HMMT25上平均比GRPO高出6.2%,比OPSD高出1.7%。我们的代码可在https://github.com/walawalagoose/SGSD获取。

英文摘要

On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.

2605.28782 2026-05-28 cs.CL 版本更新

Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

大型语言模型能否处理话语标记?以口语马来语为例

Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, Guangliang Liu, Xi Chen

发表机构 * Nanyang Technological University(南洋理工大学) University of Mississippi(密苏里大学) Indiana University(印第安纳大学)

AI总结 本文通过构建MalayPrag基准和提出五个属性,系统评估并改进了大型语言模型在口语马来语中处理话语标记的能力。

详情
AI中文摘要

话语标记,如 extit{well}和 extit{kind of},是使LLMs更接近人类“说话”方式的关键组成部分。它们用于传达情感、意图和人际意义。然而,现有研究尚未全面了解LLMs处理话语标记的能力。此外,有限的研究主要集中在英语等高资源语言上,对东南亚语言的关注很少。在本文中,我们(1)提出了 extsc{MalayPrag},一个旨在系统评估和分析LLMs处理口语马来语中话语标记能力的基准;(2)引入了五个属性,这些属性提供了一个基于语言学的统一框架,用于解释话语标记的语用功能。应用这两项贡献,我们提示十个现成的LLMs执行三个预测任务。实验结果表明,当前LLMs在准确将马来语话语标记与其语用功能联系起来方面面临重大挑战。本研究中设计的五个属性的提供显著改善了这些联系,突出了模型语用能力结构化支架的必要性。

英文摘要

Discourse particles, such as \textit{well} and \textit{kind of}, are crucial components that enable LLMs to ``speak'' more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs' capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textsc{MalayPrag}, a benchmark designed to systematically evaluate and analyze LLMs' capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models' pragmatic competence.

2605.28779 2026-05-28 cs.CL cs.CV 版本更新

The Abstraction Gap in Vision-Language Causal Reasoning

视觉-语言因果推理中的抽象差距

Chinh Hoang, Mohammad Rashedul Hasan

发表机构 * Department of Electrical and Computer Engineering, University of Nebraska--Lincoln, Lincoln, Nebraska, USA(电气与计算机工程系,内布拉斯加大学林肯分校,内布拉斯加,林肯,美国)

AI总结 针对视觉-语言模型(VLM)生成因果解释时语言流畅性与忠实因果推理的混淆问题,提出双探针方法和抽象差距(AG)指标,通过CAGE基准评估发现多数模型存在显著AG,但通过预训练和架构选择可缩小差距。

详情
AI中文摘要

视觉-语言模型(VLM)能生成流畅的因果解释,但当前的评估无法区分语言合理性与忠实因果推理。我们提出一种双探针方法来分离这些属性。文本探针测量语言质量。链式文本探针要求模型首先生成显式因果链。抽象差距(AG)指标量化归一化的性能差异。在CAGE(因果抽象差距评估)基准上评估八个VLM,该基准包含跨越Pearl因果层次的5,500张图像上的49,500个问题,我们发现七个模型的AG超过0.50,文本得分为6-8,但链式得分低于2.5。在45,000个链式标注样本上进行微调未能缩小差距。然而,一个模型实现了接近零的AG。该能力存在于当前VLM架构中,并取决于预训练和架构选择。CAGE为评估VLM中的忠实因果推理提供了诊断工具。

英文摘要

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

2605.28778 2026-05-28 cs.CL 版本更新

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

LLMs 能否使用语言不确定性标记可靠地反映内在置信度?

Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University(耶鲁大学)

AI总结 本研究首次系统探究大语言模型(LLMs)是否能够稳定且泛化地将其语言置信度标记与内在置信度关联,并评估上下文特征的影响,通过提出7个指标分析标记内在置信度的稳定性,发现LLMs即使在模型中心解释下也存在忠实校准偏差。

Comments Code: https://github.com/yale-nlp/marker_internal_confidence

详情
AI中文摘要

LLMs 的语言表达置信度应忠实反映其内在不确定性。尽管近期研究表明 LLMs 在以人类对齐的方式使用认知标记(例如,“很可能...”)方面存在困难,但尚不清楚模型是否能够应用其自身的语言置信度框架,以稳定且可泛化的方式将标记与特定置信水平关联起来,以及上下文特征如何影响这种能力。我们首次对此问题进行了系统研究,将_标记内在置信度_(MIC)形式化为模型在给定任务领域中与特定认知标记相关联的估计内在置信度。我们提出了7个指标来评估 MIC 在分布内和跨分布的稳定性。将我们的分析框架应用于多种模型和任务,我们发现,即使在模型中心解释标记含义的情况下,LLMs 仍然存在忠实校准偏差,尽管在任务间保持了一定程度的一致排序,但难以根据内在置信度区分跨分布的标记。这为现有工作提供了关键且互补的证据,有助于全面理解 LLMs 中的忠实校准,强调了需要更对齐和稳定的标记使用以提高可信度和可靠性。

英文摘要

LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

2605.28775 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

从弱点中学习:小型计算机使用代理的自动化领域专业化

Suji Kim, Kangsan Kim, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) Samsung Electronics(三星电子)

AI总结 提出LearnWeak框架,通过更强的参考代理识别学生代理在目标领域的弱点,自动合成针对性任务和监督信号,并引入误差感知专业化目标,显著提升小型计算机使用代理在多个领域的性能。

详情
AI中文摘要

计算机使用代理(CUA)最近取得了实质性进展,但为每个软件领域部署单独的大型专家仍然昂贵。小型开源计算机使用代理是更实用的专业化目标,但它们仍然明显较弱,并表现出不均匀的领域特定失败。一个直接的补救措施是为目标领域合成大规模训练数据,但我们发现这种简单方法仅带来边际改进。基于这一观察,我们引入了LearnWeak,一个针对小型计算机使用代理的无注释专业化框架,它使用更强的参考代理来识别学生在目标领域的弱点,合成有针对性的任务,并自动构建监督。LearnWeak进一步引入了一个误差感知的专业化目标,将规划和执行误差分离,从而实现比广泛统一监督更行为精确的更新。在OSWorld上,LearnWeak在八个领域上分别比EvoCUA-8B和OpenCUA-7B平均提高了11.6和11.1个百分点。我们还验证了我们的学生感知数据集生成和训练方法优于现有的自主轨迹生成和训练基线。我们的工作强调了学生意识在数据合成和代理训练中的重要性,为在多样化领域专业化小型计算机使用代理指明了更原则和高效的路径。

英文摘要

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

2605.28774 2026-05-28 cs.CL 版本更新

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Agent探索性策略优化用于多模态Agent推理

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee

发表机构 * NVIDIA KAIST(韩国科学技术院)

AI总结 针对多模态Agent推理中思考与工具使用的不对称性(Thinking-Acting Gap),提出AXPO算法,通过固定思考前缀并重采样工具调用及其延续,结合基于不确定性的前缀选择,显著提升工具使用率和模型性能。

Comments Project page: https://byungkwanlee.github.io/AXPO-page/

详情
AI中文摘要

具有扩展推理能力的视觉-语言模型在复杂问题上取得成功,但许多现实问题需要外部工具,而仅靠内部推理往往无法解决。因此,Agent推理交织了两种具有结构不对称性的行为:思考(自包含的默认行为)和工具使用(高方差辅助行为)。我们将这种不对称性称为思考-行动差距。在标准RL方法(如GRPO)下,该差距在训练中表现为两种诊断症状:工具使用仅在约30%的rollout中被尝试,而当尝试时,组内使用工具的rollout在约40%的问题上全部错误,从而抑制了需要工具调用的学习信号。我们提出AXPO(Agent探索性策略优化):对于每个全部错误的工具使用子组,AXPO固定思考前缀并重采样工具调用及其延续,同时结合基于不确定性的前缀选择。在九个多模态基准测试和三种规模的Qwen3-VL-Thinking上,SFT+AXPO在平均性能上优于SFT+GRPO(8B模型平均Pass@1提升1.8个百分点,Pass@4提升1.8个百分点),并且8B的SFT+AXPO在Pass@4上以4倍更少的参数超越了32B Base模型。

英文摘要

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

2605.28773 2026-05-28 cs.CL cs.AI cs.LG cs.MA cs.MM 版本更新

Rethinking Memory as Continuously Evolving Connectivity

重新思考记忆作为持续演化的连接性

Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Ying Wei, Guozhou Zheng, Feiyu Xiong, Haofen Wang, Huajun Chen, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团) MemTensor Tongji University(同济大学)

AI总结 提出 FluxMem 框架,将记忆建模为异构图并通过三个阶段(初始连接形成、反馈驱动优化、长期巩固)动态演化拓扑结构,以解决现有记忆增强型 LLM 代理在动态环境中的脆弱性问题。

Comments Ongoing work

详情
AI中文摘要

现有的记忆增强型 LLM 代理通常将记忆视为具有预定义表示和固定检索管道的静态存储库,这在动态代理环境中是脆弱的,因为反馈、任务变化和异构信号不断重塑应该记住的内容以及如何连接它们。为了解决这个问题,我们提出了 FluxMem,一种连接性演化的记忆框架,它将记忆建模为异构图,并通过三个阶段逐步优化其拓扑结构:初始连接形成、反馈驱动优化和长期巩固。在执行过程中,FluxMem 修复缺失的链接、修剪干扰、对齐抽象粒度,并将重复的成功轨迹提炼为可重用的程序化电路,由记忆泛化性和演化成熟度的一个度量指导。在三个根本不同的基准测试(包括 LoCoMo、Mind2Web 和 GAIA)上,FluxMem 实现了持续的最先进性能,展示了在复杂代理环境中的强大适应性和泛化能力。代码将在 https://github.com/zjunlp/LightMem 开源。

英文摘要

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in https://github.com/zjunlp/LightMem.

2605.28751 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

外推权重平均揭示代码强化学习中的正确性-效率前沿

Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

发表机构 * Meta Superintelligence Labs - FAIR(Meta超智能实验室 - FAIR)

AI总结 通过外推权重平均,无需额外RL训练即可扩展微调检查点间的帕累托前沿,在竞争性编程中实现正确性与效率的权衡,并提升推理时性能。

Comments 54 pages

详情
AI中文摘要

线性插值微调检查点已被证明可以追踪竞争目标之间的帕累托前沿,但外推权重平均是否能在不进行额外RL训练的情况下,将此类前沿扩展到推理时有用的新检查点,仍不清楚。我们在竞争性编程的RL中研究这一问题,其中隐藏单元测试在时间和内存限制下同时强制执行功能正确性和计算效率。从共享初始化开始,我们在嵌套单元测试覆盖下训练检查点:低覆盖奖励要求通过较小输入的测试,而高覆盖奖励要求逐步通过更大输入的测试直至完整套件。这种扫描揭示了正确性-效率前沿的出现:在困难问题上,更高覆盖奖励减少了优化失败但增加了正确性失败,使得解决率几乎不变。低覆盖和高覆盖检查点之间的插值恢复了这一前沿,而外推则将其扩展到训练端点之外。该前沿及其外推延续出现在三种推理设置(纯推理、工具使用和智能体编码)以及两种模型规模(32B和7B)中。在问题层面,沿前沿移动会改变被解决的问题,使得外推检查点成为推理时扩展中的互补策略。具有外推权重平均的集成扩大了覆盖范围,并在相同样本预算下,将LCB/hard上的pass@250比最佳单一检查点提高了3.3%。这些结果表明,代码RL中的嵌套单元测试覆盖诱导了一个前沿,外推权重平均可以导航、扩展和利用该前沿。

英文摘要

Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

2605.28745 2026-05-28 cs.CL 版本更新

Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context

预测市场中的立场检测:通过反事实增强和市场上下文解决交易者评论不平衡问题

Thomas Mbrice

发表机构 * Department of Computer Science(计算机科学系) Stony Brook University(石溪大学)

AI总结 针对预测市场评论中极端不平衡的立场检测问题,提出结合市场上下文和LLM驱动的反事实增强方法,显著提升了少数类(反对立场)的召回率和F1值。

Comments 14 pages, 9 figures

详情
AI中文摘要

Polymarket等预测市场将群体信念聚合为实时概率估计,交易者在每个市场下方发布的评论包含价格无法捕捉的丰富方向性立场信号。本文首次将立场检测研究应用于预测市场评论,该领域具有极端简短性、交易者特定用语和严重的类别不平衡(仅8.7%的评论反对市场结果)。我们在4×3消融实验中对RoBERTa-base进行微调:四种输入配置({2类, 3类} × {有/无市场上下文})和三种增强条件(基线、50%合成、100%合成)。通过Anthropic API,利用LLM驱动的Pro→Anti反事实翻转生成合成少数类样本。结果表明:(1)市场上下文是影响最大的单一因素,将3类Anti召回率从0.10提升至0.45;(2)反事实增强有条件地有效,在弱配置中提升Anti F1(0.10→0.24),但在强配置中降低性能(2类上下文宏F1:全剂量下从0.68降至0.50);(3)50%增强是最佳剂量,100%始终损害性能。基于注意力的可解释性分析为所有三个发现提供了机制支持。

英文摘要

Prediction markets such as Polymarket aggregate crowd beliefs into real-time probability estimates, and the comments traders post beneath each market contain rich directional stance signals that prices alone cannot capture. This work introduces the first stance detection study applied to prediction market commentary, a domain characterized by extreme brevity, trader- specific vernacular, and severe class imbalance (only 8.7% of comments oppose the market outcome). RoBERTa-base is fine-tuned across a 4 x 3 ablation: four input configurations ({2- class, 3-class} x {with/without market context}) and three augmentation conditions (baseline, 50% synthetic, 100% synthetic). Synthetic minority-class samples are generated via LLM-driven Pro -> Anti counterfactual flips using the Anthropic API. Results show that (1) market context is the single most impactful factor, raising 3-class Anti recall from 0.10 to 0.45; (2) counterfactual augmentation is conditionally effective, improving Anti F1 in weak configurations (0.10 -> 0.24) while degrading strong ones (2-class-ctx macro F1: 0.68 -> 0.50 at full dose); and (3) 50% augmentation is the optimal dose, with 100% consistently hurting performance. Attention-based interpretability analysis provides mechanistic support for all three findings.

2605.28740 2026-05-28 cs.CL cs.AI 版本更新

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

反向探测:临床文本中大语言模型的监督式词级不确定性量化

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

发表机构 * University of Florida(佛罗里达大学) U.S. National Library of Medicine(美国国家医学图书馆)

AI总结 提出反向探测框架,利用预标注摘要从模型内部激活中提取词级不确定性信号,在临床文本中实现高效、可解释的不确定性量化。

详情
AI中文摘要

随着大语言模型越来越多地应用于临床文本,确保它们能够可靠地表明自身的不确定性变得至关重要。大多数现有的不确定性量化(UQ)方法是为开放域生成设计的,无法在长临床文本中定位到词或跨度级别的不确定性。我们提出了反向探测,这是首个专门针对临床摘要的UQ框架,它直接从预标注的摘要中估计词级不确定性。与采样新输出不同,反向探测将文本视为探测模型内部状态的探针,从四类内部激活中提取不确定性信号。我们在两个专家标注的临床数据集上进行了评估,在所有指标上优于八个适配基线,AUPRC最高提升4倍,同时减少了推理时间和计算成本。特征分析表明,delta能量和邻域上下文是所有模型中最一致的预测因子。本研究提供了关于模型内部如何响应无支持的临床内容的可解释性见解。

英文摘要

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

2605.28732 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

MemTrace:大型语言模型记忆系统中的错误追踪与归因

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 提出MemTrace框架,通过构建可执行的记忆演化图实现细粒度错误追踪,并利用自动归因方法定位根因,进而优化提示词提升下游任务性能。

Comments Ongoing work

详情
AI中文摘要

记忆对于使大型语言模型支持长程推理至关重要,但现有的记忆系统仍然不可靠且难以调试。追踪记忆的动态演化对于理解信息如何随时间合成、传播或损坏至关重要。在这项工作中,我们研究了LLM记忆系统中错误追踪与归因的新问题。我们提出了一种新颖的框架,将记忆流水线转换为可执行的记忆演化图,从而实现对操作信息流的细粒度追踪。然后,我们构建了MemTraceBench,一个从代表性记忆系统(如Long-Context、RAG、Mem0和EverMemOS)收集的基准,以系统地研究记忆故障模式。我们进一步引入了一种自动归因方法,该方法迭代地追踪操作子图以定位任何失败案例的根本原因。我们的分析表明,记忆故障是系统性的,源于操作层面的问题,如信息丢失和检索错位。关键的是,我们利用这些细粒度的归因信号来指导下游提示优化,建立了一个自动纠正故障并提升最终任务性能高达7.62%的闭环系统。代码将在https://github.com/zjunlp/MemTrace发布。

英文摘要

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

2605.28714 2026-05-28 cs.CL cs.AI 版本更新

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

IPO-Mine:用于长多模态IPO文档的章节结构化分析的工具包和数据集

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye, Arnav Hiray, Rutwik Routu, Prasun Banerjee, Siddhartha Somani, Sudheer Chava

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sai University(赛大学) Duke University(杜克大学)

AI总结 本文提出IPO-Mine工具包和数据集,通过标准化解析IPO文件为章节结构化文本和图像,构建大规模多模态数据集,并建立图表评估任务,揭示多模态模型在长文档分析中的对齐挑战。

Comments 12 pages

详情
AI中文摘要

首次公开募股(IPO)文件是私营公司上市时发布的文件,允许个人(散户)投资者购买其股票。这些文件描述了公司的业务、财务状况和风险,是包含叙述性文本和图像的长篇多模态文档。尽管它们对金融市场至关重要,但目前缺乏用于使用现代语言和多模态模型研究IPO文件的大规模标准化数据集或基准。这些文档带来了重大挑战:文件通常超过50万词,且缺乏一致的结构组织。我们引入了IPO-Toolkit,这是一个开源框架,用于下载和解析IPO文件,将其标准化为章节结构化文本和提取的图像。该工具包分割文件、提取嵌入的图像,并生成结构化输出,从而支持对长多模态文档进行大规模、可重复的分析工作流。利用这一基础设施,我们构建了IPO-Dataset,这是一个大规模、章节结构化的多模态数据集,涵盖1994年至2026年超过109,000份IPO文件及其修订版,包含超过76,000张图像。我们针对提取的金融图表建立了结构化评估任务,包括图表质量和误导性评估。我们的实验表明,最先进的多模态模型在这些任务上常常与专家人类判断存在分歧,揭示了在长篇幅真实监管文档上进行多模态推理时的对齐挑战。除了基准测试,IPO-Dataset还支持对章节级文本变异以及视觉和文本披露实践的跨行业差异进行大规模分析。我们的代码、数据集和网站根据CC-BY-4.0公开提供。

英文摘要

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

2605.28710 2026-05-28 cs.CL cs.AI 版本更新

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

迈向可靠的多语言LLM作为评判者:一项实证研究

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

发表机构 * HiTZ Center - Ixa, University of the Basque Country EHU(希茨中心 - Ixa,巴斯克国家大学EHU)

AI总结 本研究通过分析指令翻译、单语与多语言监督及模型规模等策略,探讨了在有无领域内数据情况下开发多语言LLM评判者的方法,并揭示了领域内数据可用时微调小模型可媲美专有模型、零样本大模型在域外更有效等关键权衡。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于生成文本的自动评估,然而大多数先前工作集中在英语上。尽管对多语言评估的需求日益增长,将基于LLM的评估器扩展到多语言环境仍然具有挑战性,特别是对于低资源语言和领域内数据稀缺的场景。本文探索了开发多语言LLM评判者的几种策略,考虑了是否有领域内数据可用于微调。我们系统分析了英语、西班牙语和巴斯克语(代表高、中、低资源语言),考虑了指令翻译、单语与多语言监督以及模型规模。为了评估,我们将两个现有的元评估数据集扩展到巴斯克语和西班牙语。我们的结果揭示了关键的权衡:当领域内数据可用时,微调的小模型可以达到与专有模型相当的性能,而在域外设置中,使用较大模型的零样本评估更为有效。我们还观察到,在域外数据上进行微调可能会对模型性能产生不利影响。这些发现为构建高效、可靠的多语言评估流程提供了实用指导。数据和代码公开在hitz-zentroa/mJudge。

英文摘要

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

2605.28669 2026-05-28 cs.CL cs.AI 版本更新

Sense Representations Are Inducible Interfaces

Sense Representations Are Inducible Interfaces

Jan Christian Blaise Cruz, Alham Fikri Aji

发表机构 * MBZUAI(马克斯·普朗克智能系统研究所)

AI总结 提出ACROS方法,通过门控残差加法在冻结的预训练解码器LM中诱导显式词义通路,实现零样本词义消歧、低KL词义引导和跨语言适应,保持基础LM质量。

Comments https://github.com/jcblaisecruz02/acros

详情
AI中文摘要

词义表示(显式的、每个标记的意义分解)对于消歧、引导和跨语言对齐很有用,但现有方法要求模型在预训练时就内置词义结构。我们引入了ACROS,它通过门控残差加法在冻结的预训练解码器LM中诱导出显式的词义通路。在SmolLM2-360M上,ACROS在保持基础LM质量的同时,支持相同诱导变量的三种用途:零样本词义消歧(Raganato ALL上F1为64.95,与WordNet首义启发式方法相当)、在5,161个CoInCo案例中进行低KL词义引导(其中简单的非oracle代理恢复了约90%的正向偏移),以及针对四种语言的SENSIA跨语言适应(平均R@1为0.988,目标FLORES PPL为7.94)。ACROS使词义表示成为普通预训练LM的可诱导接口。

英文摘要

Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.

2605.28664 2026-05-28 cs.LG cs.CL 版本更新

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

用于合成数据生成的激活引导:多样性在下游安全检测中的作用

Vijeta Deshpande, Tootiya Giyahchi, Veena Padmanabhan, Leman Akoglu, Anna Rumshisky

发表机构 * UMass Lowell(马萨诸塞大学洛厄尔分校) Amazon(亚马逊公司) CMU(卡内基梅隆大学)

AI总结 研究激活引导(AS)生成高质量训练数据用于下游安全检测分类器,发现多样性是关键但被忽视的轴,且AS在窄参数范围内优于提示生成。

详情
AI中文摘要

安全检测模型需要HHH(有帮助、无害、诚实)违反输出的示例以实现鲁棒泛化,但此类示例稀缺。激活引导(AS)已成为一种数据高效的方法,用于生成与目标概念对齐的响应。我们研究AS能否为下游分类器生成高质量训练数据集,这一问题尚未被测试。我们通过内在和外在评估,跨越4个概念×2个模型×4种引导方法进行了双重研究。内在方面,除了引导成功(概念对齐)和连贯性的领域标准,我们引入了样本级和集合级多样性作为文献中先前缺失的质量轴,并发现增加引导强度会降低响应多样性。外在方面,我们用引导生成替换可用训练数据中的HHH违反示例,并微调检测分类器。AS生成的数据在4个概念中的3个上产生了比提示生成数据更好的分类器。然而,136个AS配置中只有41个优于提示,表明下游效用存在于一个狭窄的区间,该区间同时满足成功、连贯性和多样性。这三个轴的调和平均数比单独的成功和连贯性更一致地与下游AUROC相关,为实践者调整AS超参数提供了实用的启发式目标。总之,我们的结果突出了AS在合成数据生成中改进安全检测的潜力,并确定了多样性作为调整AS的关键且先前被忽视的轴。

英文摘要

Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across $4$ concepts $\times\,2$ models $\times\,4$ steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on $3$ of $4$ concepts. However, only $41$ of $136$ AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.

2605.28649 2026-05-28 cs.LG cs.CL 版本更新

Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

可解释性引导的子空间投影层选择:SAE作为听诊器而非手术刀用于原始任务向量模型编辑

Li Lei, Madalina Ciobanu, Qingqing Mao, Ritankar Das

发表机构 * Incept Labs(Incept实验室) Titan Holdings(Titan控股)

AI总结 本文发现将任务向量投影到稀疏自编码器(SAE)特征子空间会丢弃约97%的修改能量,导致无效编辑;提出将SAE用于层诊断而非干预过滤,通过SAE特异性分数选择层注入原始任务向量,在数学推理任务上显著提升性能。

详情
AI中文摘要

大型语言模型越来越需要精细的模型编辑来增强领域特定能力,而无需承担全微调带来的计算成本或灾难性遗忘。稀疏自编码器(SAE)在此背景下成为一种有前景的工具,原则上允许在特征级别识别干预位置。本文严格评估了基于SAE引导的编辑流程在Gemma-3-4B-IT上的数学推理能力,并揭示了一个根本性失败模式:将任务向量投影到SAE特征子空间这一直观方法实际上充当了信息瓶颈,丢弃了约97%的修改能量,在七个数学科目上均未产生统计显著的改进。我们表明,这种失败源于激活空间SAE方向与权重空间任务向量之间的几何失配。随后,我们提出视角转变:SAE作为听诊器而非手术刀,即SAE用于层级别诊断而非干预级别过滤。通过仅将未过滤的原始任务向量注入由SAE特异性分数识别的层,我们在Minerva Math基准上将数论准确率从29.6%提升至39.4%(z=+3.41,p=0.0007);7个数学科目中有5个显著提升,且无任何科目显著下降。我们的方法完全确定,无需额外推理成本,并为可解释性引导的模型编辑提供了原则性框架。

英文摘要

LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle allowing for feature-level identification of where to intervene. In this work, we rigorously evaluate an SAE-guided editing pipeline for mathematical reasoning on Gemma-3-4B-IT and uncover a fundamental failure mode: the intuitively appealing approach of projecting task vectors onto SAE feature subspaces acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant improvements across seven math subjects. We show that this failure stems from a geometric misalignment between activation-space SAE directions and weight-space task vectors. We then propose a shift in perspective: SAE as a Stethoscope, Not a Scalpel, where SAEs are used for layer-level diagnosis rather than intervention-level filtering. By injecting unfiltered raw task vectors only into layers identified by an SAE-derived specificity score, we improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on the Minerva Math benchmark; 5 of 7 math subjects significantly improved and none significantly degraded. Our method is fully deterministic, requires no additional inference cost, and provides a principled framework for interpretability-guided model editing.

2605.28645 2026-05-28 cs.CR cs.CL 版本更新

GraphSteal: Structural Knowledge Stealing from Graph RAG via Traversal Reconstruction

GraphSteal:通过遍历重建从图RAG中窃取结构知识

Jinze Gu, Qinghua Mao, Xi Lin, Jun Wu

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 提出一种结构导向的重建框架,通过深度优先启发式搜索和广度优先扩散搜索,从黑盒图RAG系统中高保真恢复隐藏知识图,揭示敏感实体、关系和结构依赖。

详情
AI中文摘要

检索增强生成(RAG)通过将生成过程基于查询相关的外部证据来增强LLM。除了非结构化文本语料库外,图RAG将知识图谱集成到检索流程中,使LLM能够访问编码在结构化知识中的实体、关系和多跳依赖。然而,赋能图RAG的相同结构化知识也创造了新的隐私攻击面。我们证明,图RAG系统可以转变为结构预言机:通过自适应黑盒交互,对手可以引出足够的关联证据,以重建隐藏知识图的实质性部分。我们提出了一种面向结构的重建框架,从局部和全局角度恢复目标图。具体来说,深度优先启发式搜索通过递归扩展以实体为中心的证据来提取细粒度节点属性,而广度优先扩散搜索通过跨关系诱导邻域传播来推断图拓扑。在通用和医疗场景上的实验表明,我们的方法可以从代表性图RAG系统中恢复超过90%的原始知识图,以高保真度揭示敏感实体、关系和结构依赖。现有的防护措施对我们的攻击提供的防御有限,突显了在图RAG流程中保护结构隐私的固有困难。

英文摘要

Retrieval-Augmented Generation (RAG) enhances LLMs by grounding generation in query-relevant external evidence. Beyond unstructured text corpora, Graph RAG integrates knowledge graphs into the retrieval pipeline, enabling LLMs to access entities, relations, and multi-hop dependencies encoded in structured knowledge. However, the same structured knowledge that empowers Graph RAG also creates a new privacy attack surface. We demonstrate that Graph RAG systems can be turned into structural oracles: through adaptive black-box interactions, an adversary can elicit sufficient relational evidence to reconstruct substantial portions of the hidden knowledge graph. We propose a structure-oriented reconstruction framework that recovers targeted graphs from both local and global perspectives. Specifically, Depth-Wise Heuristic Search extracts fine-grained node attributes by recursively expanding entity-centered evidence, while Breadth-Wise Diffusion Search infers graph topology by propagating across relation-induced neighborhoods. Experiments on generic and healthcare scenarios demonstrate that our method can recover over 90\% of the original knowledge graph from representative Graph RAG systems, revealing sensitive entities, relations, and structural dependencies with high fidelity. Existing guradrails provide limited defense against our attack, highlighting the inherent difficulty of safeguarding structural privacy in Graph RAG pipelines.

2605.28639 2026-05-28 cs.CL cs.AI 版本更新

The Attentional White Bear Effect in Transformer Language Models

Transformer语言模型中的注意力白熊效应

Rebecca Ramnauth, Brian Scassellati

发表机构 * Yale University(耶鲁大学)

AI总结 通过表征探测、注意力分析和行为语义泄露实验,发现指令抑制下Transformer语言模型仍能恢复被禁止概念的表征并影响后续生成,揭示了行为对齐与表征对齐之间的根本差距。

Comments Currently under review at EMNLP 2026

详情
AI中文摘要

基于指令的抑制被广泛用于防止语言模型生成被禁止的内容,但尚不清楚抑制是减少了内部表征还是仅仅抑制了表达。我们通过跨多个Transformer模型的表征探测、注意力分析和行为语义泄露实验来研究这个问题。我们发现,在抑制下,被禁止的概念仍然可以从隐藏表征中高度恢复,继续影响注意力路由,并且在成功避免词汇的情况下可测量地塑造下游生成。这些效应在池化策略、间接语义控制和多个模型家族中持续存在。我们的结果暴露了行为对齐与表征对齐之间的根本差距。

英文摘要

Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

2605.28629 2026-05-28 cs.CL 版本更新

Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents

Mobile-Aptus: 基于MLLM的手机使用代理中的置信度驱动的主动与鲁棒交互

Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju, Aston Zhang, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) GenAI, Meta(Meta的GenAI)

AI总结 针对手机使用代理中的过度执行和过度请求问题,提出一种置信度驱动的主动与鲁棒交互框架,通过监督微调和置信度偏差校正实现最优性能。

Comments Accepted by TASLP

详情
AI中文摘要

多模态大语言模型(MLLMs)的最新进展在使手机使用代理能够自主执行人类指令方面显示出非凡的潜力。然而,完全自动化的代理即使无法解决任务也常常尝试执行,导致过度执行问题。以往研究通过训练交互式手机使用代理来解决这一问题,让代理在无法完成用户指令时请求人类交互。但我们发现,这些交互式代理倾向于表现出过度请求行为,过度依赖人类干预。为了缓解过度执行和过度请求,我们提出了一种通用的置信度集成框架,使基于MLLM的手机使用代理能够实现置信度驱动的主动与鲁棒交互。该框架包括两个阶段:交互能力赋予和置信度偏差校正。在交互能力赋予阶段,代理通过监督微调学习输出动作和置信度分数。在置信度偏差校正阶段,代理通过结合语义相似性检索和直接偏好优化,学习输出更准确的置信度分数。实验结果表明,Mobile-Aptus在四个流行的手机使用代理基准测试(OS-Kairos、AITZ、Meta-GUI和AndroidControl)上达到了最先进的性能。在离线基准测试中,Mobile-Aptus始终优于所有基线,任务成功率平均提升超过17%。在真实世界动态实验中,Mobile-Aptus的任务成功率比基线高出26%,每个指令仅需0.64次干预步骤。代码可在https://github.com/Wuzheng02/Mobile-Aptus获取。

英文摘要

Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17\% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile-Aptus.

2605.28616 2026-05-28 cs.CL cs.AI 版本更新

Measuring Form and Function in Language Models

语言模型中的形式与功能测量

Héctor Javier Vázquez Martínez, Charles Yang

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Department of Linguistics and Computer and Information Science(语言学与计算机与信息科学系)

AI总结 通过引入儿童语言习得的定量指标,提出上下文替代选择(CAC)提示方法,评估语言模型在英语限定词的形式句法和功能话语知识方面的表现,发现仅大型模型能同时满足形式和功能基准。

Comments Under review at ACL Rolling Review May 2026 cycle

详情
AI中文摘要

我们引入儿童语言习得的定量指标来评估语言模型。我们的重点是英语中限定词的形式句法和功能话语属性,这些属性幼儿早期就能准确习得。我们提出了上下文替代选择(CAC),一种新的提示方法,为语言的句法和话语知识提供针对性测试。该方法能够直接将语言模型与儿童进行比较,更重要的是,与实证研究中独立建立的统计基准进行比较。目前,没有在可比数据量上训练的模型能像人类儿童一样同时满足形式和功能基准,但一些非常大的模型可以做到。我们将结果作为方法论和技术贡献呈现,特别强调语言模型的认知状态。

英文摘要

We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.

2605.28607 2026-05-28 cs.AI cs.CL 版本更新

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

基于自适应多智能体框架的自动工作流执行

Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) Department of Engineering University of Sannio(萨尼奥大学工程系) Faculty of Jurisprudence Unitelma Sapienza University(法理学院萨皮恩扎大学)

AI总结 提出一种多模态多智能体框架,通过离线构建拓扑知识库和在线自适应检索增强生成与闭环协作验证,实现自动工作流执行。

Comments Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)

详情
AI中文摘要

现代信息系统需要能够导航复杂工作流的自主智能体,但当前方法在从结构化元数据解析过渡到通用环境感知时常常遇到困难。虽然多模态大语言模型的集成使智能体能够直接与图形用户界面交互,但现有方法通常将任务序列视为离散的线性片段。这种碎片化阻止了智能体捕捉底层转移拓扑结构,限制了它们在新型或非平稳场景中的有效性。为了解决这个问题,我们提出了一种新颖的多模态多智能体框架,通过一个独特的两阶段流程实现自动工作流执行。首先,在离线发现阶段,该架构从碎片化的执行日志中自适应地构建拓扑知识库。在推理过程中,智能体利用自适应检索增强生成(RAG)作用于这个固定的、预先建立的图,并结合闭环协作验证协议进行动态自我纠正和导航。这种基于图的方法促进了优越的任务分解和自适应导航性能。我们在真实世界环境中验证了该框架,展示了即使在训练数据有限的情况下,它也能保持高可靠性和语义感知能力。

英文摘要

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

2605.28602 2026-05-28 cs.AI cs.CL cs.LO 版本更新

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

用大语言模型求解可满足性问题:推理能力的配对评估

Leizhen Zhang, Shuhan Chen, Sheng Chen

发表机构 * University of Louisiana at Lafayette(路易斯安那州立大学拉斐特分校)

AI总结 提出配对公式协议和准确区分率(ADR)来评估大语言模型在SAT问题上的推理能力,发现传统指标具有误导性,而ADR能更忠实、跨表示鲁棒地评估模型。

Comments Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于隐式归结为布尔可满足性(SAT)的任务,但它们在SAT上的推理能力仍不清楚。我们对LLMs在2-SAT和3-SAT上进行了系统研究,并使用了两个经典归约——顶点覆盖和离散3D装箱——来探测表示不变的推理。我们首先使用传统指标评估模型,包括准确率、精确率、召回率和F1,以及SAT相变设置。我们发现这些指标可能具有误导性:许多模型通过过度预测可满足公式获得高分,未能重现3-SAT阈值附近经典的易-难-易特征,并且随着变量数量的增加而急剧下降。为解决这个问题,我们引入了一个基于最小差异可满足和不可满足实例的配对公式协议,以及准确区分率(ADR),它要求每对中的两个成员都被正确分类。ADR将面向推理的模型与启发式模型区分开来,并与证据有效性相关。在CNF之外,我们通过将CNF转换为顶点覆盖和将3-SAT转换为离散3D装箱来测试跨表示一致性。大多数模型在超过80%的实例上,对CNF和对应图或装箱实例的决策一致,表明跨表示存在稳定的决策规则。总体而言,我们的结果表明SAT是LLM推理的一个保守探针,并且使用ADR的配对评估比传统指标提供了更忠实且表示鲁棒的评估。

英文摘要

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

2605.28598 2026-05-28 cs.CL cs.AI 版本更新

Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

评估基于LLM的社会智能体的真实性:对西班牙在线新闻反应的案例研究

Alejandro Buitrago López, Alberto Ortega Pastor, Javier Pastor-Galindo, José A. Ruipérez-Valiente

发表机构 * Faculty of Computer Science, University of Murcia(计算机科学系,穆尔西亚大学)

AI总结 通过比较真实与LLM生成的西班牙新闻评论,研究LLM在仇恨言论、情感和语义对齐三个维度上的真实性,发现现成模型表现不佳,微调可部分改善。

详情
AI中文摘要

基于LLM的社会智能体越来越多地被用于模拟在线社交行为,但其真实性仍然难以验证。现有工作主要依赖通用基准,而对简短的反应性话语(如受众对在线新闻的回复)关注较少。在本文中,我们评估LLM生成的西班牙新闻反应是否再现了真实受众话语的可测量属性。使用Hatemedia数据集,我们将5,631条新闻与58,555条真实受众反应配对,并在共享实验设置下使用五个LLM生成匹配的合成数据集。我们从仇恨言论、情感和语义对齐三个维度比较真实和合成反应,考虑现成和微调生成。结果表明,现成模型是真实受众反应的糟糕代理:它们严重低估仇恨言论,引入模型特定的情感偏差,并且在分布上与人类回复相距甚远。微调不均匀地提高了保真度。Qwen3提供了最平衡的近似,而Mistral7B实现了最强的情感和语义对齐,但过度估计了仇恨普遍性。看似合理的合成回复不一定再现公共话语的分布特性。

英文摘要

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

2605.28565 2026-05-28 cs.DL cs.AI cs.CL cs.IR 版本更新

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

验证性误导:衡量搜索增强型大语言模型中的结构性引用失败

Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) Department of Computer Science and Engineering, Konkuk University(Konkuk大学计算机科学与工程系) Incheon International Airport Corporation(仁川国际机场公司) Department of Computer Science and Engineering, Ewha Womans University(成均馆女子大学计算机科学与工程系)

AI总结 针对搜索增强型大语言模型中的引用可信度问题,提出CITETRACE数据集和三维评估框架,发现系统性“验证性误导”模式:模型引用真实可访问来源但存在意图对齐、来源适宜性或答案-来源忠实度缺陷,导致用户面临结构性误导。

Comments Working Progress

详情
AI中文摘要

搜索增强型大语言模型的用户依赖引用作为回答基于真实来源的证据,但很少自行验证引用的页面。每天数百万次查询通过这些系统,使得引用质量成为用户是被告知还是被误导的无声决定因素——然而现有基准各自孤立地处理一个方面,导致决定引用可信度的联合结构未被衡量。我们构建了CITETRACE,一个大规模数据集,追踪从用户查询到检索来源再到生成答案的完整引用链:来自28个社区的11,200个真实世界查询,与来自五个提供商的十个模型的112,000个回答配对,产生761,495个可评估的引用对。我们设计了一个三维评估框架,使用专家验证的预定义矩阵和五级忠实度标准,对每个引用在意图-目的对齐、来源适宜性和答案-来源忠实度上进行评分;该框架适用于任何产生带引用回答的系统。大规模应用该框架,我们识别出一种系统性的模式,称为验证性误导(VM):模型引用真实、可访问的来源,但在一个或多个维度上失败,产生忠实度-适宜性权衡,其中忠实模型选择不合适的来源,反之亦然。在我们的池中,30.6%的引用扭曲了其来源,27.1%的引用源自领域不合适的来源;在回答层面,高达96%的用户至少遇到一个结构性误导的引用。提供商层面的差异解释了88-96%的引用质量方差,表明来源选择更多受超出单个模型能力的因素控制,而非LLM本身。总之,CITETRACE及其评估框架为诊断部署的搜索增强系统中的结构性引用失败提供了首个资源。

英文摘要

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

2605.28561 2026-05-28 cs.CL cs.LG 版本更新

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Soft-SVeRL: 基于软奖励的自验证强化学习

Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis

发表机构 * Cohere Labs(Cohere实验室)

AI总结 针对部分可验证任务,提出基于检查表分解的软奖励框架Soft-RLVR及其自验证变体Soft-SVeRL,通过密集部分信用信号提升强化学习训练效果,并解决自验证中的奖励膨胀问题。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)在数学和代码等领域改进了语言模型,这些领域中正确性可以自动检查。然而,许多重要任务仅部分可验证:提示包含多个要求,响应可能满足其中一些但非全部,或者可能不存在单一的参考答案。我们引入Soft-RLVR,一个从分解的、学习的验证信号中进行强化学习的框架。Soft-RLVR将每个提示转换为原子要求的检查表,使用LLM验证器逐项评分候选响应,并在生成的软奖励上进行训练。基于检查表的奖励将稀疏的通过/失败监督转化为更密集的部分信用信号,但它们也引入了一个权衡:平均逐项判断可以减少验证器噪声,而部分信用可能奖励不完整的响应。我们形式化了这一权衡,并确定了基于检查表的验证比整体验证提供更可靠RL训练信号的条件。我们进一步引入Soft-SVeRL,这是Soft-RLVR的一个自验证变体,其中策略也充当验证器。我们表明,自验证容易因过于宽松的自我判断而导致奖励膨胀,并且需要显式稳定化以防止这种崩溃。在基于规则的ground-truth评估的受控指令遵循设置中,基于检查表的Soft-RLVR仅使用学习的验证器奖励就将IFEval提升了最多11.1分。我们的实验进一步表明,验证器质量和检查表质量都影响下游RL结果,并且显式稳定化对于有效的自验证至关重要。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.

2605.28543 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Cultural Binding Heads in Language Models

语言模型中的文化绑定头

Avrile Floro, Luca Benedetto

发表机构 * Mistral-7B Mistral-Nemo-12B Llama-3.1-8B Gemma-2-9B

AI总结 通过机制可解释性和析因设计,识别出8个语言模型中2-3个中间层注意力头对文化绑定有因果贡献,且绑定主要在预训练阶段形成,知识探测表明模型知道的知识远多于其行为表现。

详情
AI中文摘要

大型语言模型通常默认对不同文化群体一视同仁,即使上下文需要区分:这缺乏差异意识。利用机制可解释性和Wang等人(2025)的N4文化挪用基准上的析因设计,我们在八个模型(四种架构,基础版和指令版)中识别出每个模型有2-3个中间层注意力头对文化绑定有因果贡献。文化绑定是将文化项目与适当身份关联的过程。敲除这些头上的身份到项目边会使绑定强度降低9-23%。识别出的头从指令模型转移到基础模型,表明文化绑定是在预训练阶段创建的。α缩放显示分级剂量反应,生成时适度放大引导(α=2-3)可将文化区分准确性提高1-3个百分点,同时基本保持中性推理不变。知识探测任务表明,模型知道的知识比其行为表现多3-5倍,表明瓶颈在于路由而非知识。

英文摘要

LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

2605.28534 2026-05-28 cs.CL 版本更新

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER:通过因果内化和密度感知示例重选进行GUI代理的中期训练

Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Meituan(美团) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出GUI-CIDER中期训练方法,通过因果内化和密度感知示例重选显式内化GUI世界知识,提升代理对GUI操作的理解和任务成功率。

详情
AI中文摘要

尽管多模态大语言模型在构建图形用户界面(GUI)代理方面取得了快速进展,但其现实世界任务完成从根本上受到缺乏GUI操作世界知识的瓶颈。现有解决方案通常依赖昂贵的多代理框架或传统的后训练范式,如监督微调(SFT)和强化学习(RL)。然而,后训练仅允许代理通过动作注释或奖励信号隐式吸收世界知识,导致低效的轨迹记忆而非真正理解。因此,一种能够显式学习这些知识的方法至关重要。为此,我们提出GUI-CIDER,一种通过因果内化和密度感知示例重选显式内化GUI世界知识的中期训练方法。GUI-CIDER分为三个阶段:(1)数据合成,从GUI轨迹中提取静态规划和动态因果知识为文本;(2)示例重选,通过奖励因果结构和惩罚语义冗余来过滤语料库;(3)中期训练,使用精炼数据嵌入所学知识。在两个GUI知识基准和三个任务完成基准上的大量实验表明,GUI-CIDER持续提升了代理对GUI操作的理解及其任务成功率。代码可在https://github.com/Wuzheng02/GUI-CIDER获取。

英文摘要

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.

2605.28526 2026-05-28 cs.AI cs.CL 版本更新

Entropy-aware Masking for Masked Language Modeling

面向掩码语言建模的熵感知掩码策略

Gokul Srinivasagan, Kai Hartung, Munir Georges

发表机构 * AImotion Bavaria(AImotion巴伐利亚) Technische Hochschule Ingolstadt(英戈尔施塔特技术大学)

AI总结 提出基于熵分布的掩码策略,通过模型预测熵识别信息量高的token进行掩码,并引入自掩码方法提升训练效率,在GLUE上平均提升5%。

Comments accepted at starsem 2026 Conference

详情
AI中文摘要

掩码语言建模已成为训练基于编码器的语言模型的标准预训练目标。在该方法中,输入中的某些token被掩码,模型学习利用周围上下文预测它们。这一过程使模型能够捕捉语言的句法和语义属性。传统上,用于掩码的token是随机选择的,这可能并不总是产生最有效的学习信号。在这项工作中,我们研究了一种基于熵分布的token掩码策略。我们利用模型在token预测上的熵来确定哪些token应被掩码。该方法旨在针对信息量更大、不确定性更高的token,以提高训练效率。我们还提出了一种新颖的自掩码方法,无需依赖外部参考模型即可增强训练效率。实验结果表明,与基线相比,我们的方法在GLUE分数上平均提升了5%。此外,我们尝试将知识蒸馏与熵掩码相结合,取得了最佳的整体结果。

英文摘要

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

2605.28521 2026-05-28 cs.CL 版本更新

ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task

ClinicalEncoder26AM:一个多语言可诊断的ColBERT模型——来自MultiClinNER共享任务的证据

François Remy

发表机构 * Parallia AI

AI总结 本文提出ClinicalEncoder26AM,一个基于BGE-M3的多语言可诊断ColBERT模型,通过多适配器蒸馏和ColBERT式检索目标进行临床后训练,在MultiClinNER任务中微调为BIO标注器,实现了最先进的多语言实体召回率和字符加权F1分数前五。

详情
AI中文摘要

ClinicalEncoder26AM是一个用于临床和生物医学文本的多语言可诊断ColBERT模型,它在多个层次上将其token级语义与ClinicalMap25对齐,ClinicalMap25是一个受BioLORD-2023启发并通过合成和标注监督丰富的临床潜在空间。后训练方案基于BGE-M3,结合了合成临床笔记、患者-医生对话以及MedMentions等标注资源,同时通过多适配器蒸馏考虑命名实体级和句子级表示,并采用ColBERT风格的检索目标。在这篇系统演示论文中,我们通过将模型微调为用于患者症状、疾病和程序范围的BIO标注器来评估其在MultiClinNER共享任务中的表现,使用轻量级两层CNN头部来改善局部边界检测。最终系统保持简单,在单个8192 token窗口中处理大多数文档,实现了最先进的多语言实体召回率,并在所有实体类型和语言的字符加权F1分数中达到前五。训练曲线进一步表明,ClinicalEncoder26AM比基础M3模型在数据效率上显著更高,支持其临床后训练对下游信息提取的有用性。模型可在https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts下载。

英文摘要

ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient--doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts

2605.28512 2026-05-28 cs.CL 版本更新

On Compositional Learning Behaviours in Formal Mathematics

论形式数学中的组合学习行为

Kevin Yandoka Denamganaï

发表机构 * University of York(约克大学)

AI总结 本文提出 S2B-LM 基准,通过去除数值处理混淆并添加思维链框架来评估组合学习行为(CLB),发现 CLB 能力对于形式数学验证的困难部分必要但不充分。

Comments work in progress, under review

详情
AI中文摘要

能够征服形式数学困难尾部的自我进化科学智能体需要组合学习行为(CLBs)——在上下文中基础化和重组新颖符号结构的能力,而不仅仅是预学习原子的重组。我们提出了 extbf{S2B-LM},这是符号行为基准的一个改编,它移除了数值处理作为混淆因素,并添加了思维链框架以引发而非仅仅探测潜在的 CLB 能力。在 CLB 能力(adj-ZSCT)和 miniF2F 整体证明性能上交叉评估十个 Lean~4 定理证明器,精确置换检验建立了一个层次必要性结构:搜索密集型模型覆盖了可处理的绝大部分而没有可检测的 CLB,然而每个进入奥林匹克级别(miniF2F $>75\%$)的模型都是五个最高 CLB 得分者之一($p=0.004$)。在排除模型规模作为混淆因素后,我们的结果表明 CLB 能力对于形式数学验证的困难尾部是 \emph{必要但不充分的}。

英文摘要

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

2605.28500 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

功能熵:通过不确定性量化预测LLM生成代码的功能正确性

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

发表机构 * CVS Health(CVS健康)

AI总结 针对LLM生成代码功能不正确的问题,提出基于功能等价性的不确定性量化方法(功能熵),在多个编程语言和模型上优于现有方法。

详情
AI中文摘要

大型语言模型在代码生成方面表现出令人印象深刻的能力,但它们经常生成功能不正确的代码。不确定性量化(UQ)方法已成为检测自然语言生成中幻觉的有前途的方法,但它们在代码生成任务中的有效性仍未得到充分探索。我们系统地评估了UQ技术如何跨三种编程语言、五个LLM和超过1700个问题迁移到代码生成。我们发现,一些基于令牌概率的方法无需修改即可有效泛化,而依赖自然语言推理(NLI)的基于采样的方法失败,因为NLI模型无法区分功能不同的代码,导致大多数响应崩溃为单个语义簇。为了解决这个问题,我们引入了功能等价性方法,这是一类特定于代码的方法,用基于LLM的功能等价性评估取代基于NLI的语义等价性,包括功能熵,即语义熵的代码特定模拟。功能等价性方法在15个模型-基准组合中的11个中实现了最高的AUROC,并在大多数设置中实现了最佳校准,始终优于基于NLI的对应方法以及所有其他评估方法。

英文摘要

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

2605.28494 2026-05-28 cs.CL 版本更新

A new semantically annotated corpus with syntactic-semantic and cross-lingual senses

一个带有句法语义和跨语言义项的新语义标注语料库

Myriam Rakho, Eric Laporte, Matthieu Constant

发表机构 * Université Paris-Est(巴黎-埃松大学)

AI总结 本文构建了一个包含20个法语多义动词实例的新语义标注语料库,每个实例标注了三种义项:平行语料中的英语翻译、法语计算词典(Lexicon-Grammar表)条目以及两者的组合细粒度义项。

详情
Journal ref
Language Resources and Evaluation (LREC), 2012, Istanbul, Turkey, pp.597-600
AI中文摘要

我们描述了一个用于词义消歧的新义项标注语料库。该语料库由20个法语多义动词的实例组成。每个动词实例都标注了三种义项标签:(1) 该实例在平行语料库英语版本中的实际翻译,(2) 法语计算词典(Lexicon-Grammar表)中的动词条目,以及(3) 由翻译和Lexicon-Grammar条目拼接而成的细粒度义项标签。

英文摘要

We describe a new sense-tagged corpus for word sense disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.

2605.28484 2026-05-28 cs.CL 版本更新

Comonadic Morphophonology: A Compositional Framework for Context-Dependent Morphological Rules in Finnish

共单子形态音系学:芬兰语上下文相关形态规则的组合框架

Yongseok Jang

发表机构 * Independent Researcher(独立研究者)

AI总结 提出一个基于共单子的组合框架,将每个形态音系规则表示为局部上下文到输出音段的函数,并通过Writer共单子实现长度变化规则的严格组合,显著减少规则表示规模并支持双向形态分析。

Comments 13 pages. Accepted at the Society for Computation in Linguistics (SCiL) 2026

详情
AI中文摘要

组合用于上下文相关形态音系规则(辅音渐变、元音和谐、所有格后缀同化)的有限状态转录机(FST)会导致乘法状态爆炸;神经模型规避了该问题,但未提供规则本身的形式化描述。我们提出了第一个框架,其中每个形态音系规则是从聚焦的局部上下文到单个输出音段的函数——类似于元胞自动机的局部规则类型——并且长度变化规则作为共单子的coKleisli箭头进行组合。我们的核心贡献是Writer共单子(DeletionSet x Zipper),一种新的代数构造,恢复了此类规则的严格coKleisli组合性:每个规则是一个coKleisli箭头,extend将其提升为全局变换,删除操作作为幺半群作用累积,而不需要中间物化。作为支持证据,十三个coKleisli箭头提供了一种替代形式化,表达了Omorfi通过874个延续类编码的相同形态音系行为(规则表示层面67:1的缩减),并且相同的抽象支持双向形态学——MorphGenerator重用分析箭头进行生成。在UD Finnish-TDT上,该系统仅使用规则消歧达到83.92%的UPOS准确率(使用外部后缀标注器达到94.66%),验证了该框架作为实用形态引擎的有效性。

英文摘要

Composing finite-state transducers (FSTs) for context-dependent morphophonological rules -- consonant gradation, vowel harmony, possessive suffix assimilation -- leads to multiplicative state explosion; neural models sidestep the problem but provide no formal account of the rules themselves. We present the first framework where each morphophonological rule is a function from a focused local context to a single output segment -- the type of a local rule familiar from cellular automata -- and where length-changing rules compose as coKleisli arrows of a comonad. Our central contribution is the Writer comonad (DeletionSet x Zipper), a new algebraic construction that restores strict coKleisli compositionality for such rules: each rule is a coKleisli arrow, extend lifts it to a global transformation, and deletions accumulate as a monoid action rather than requiring intermediate materialization. As supporting evidence, thirteen coKleisli arrows provide an alternative formulation expressing the same morphophonological behaviors that Omorfi encodes via 874 continuation classes (67:1 reduction at the rule-representation level), and the same abstraction enables bidirectional morphology -- a MorphGenerator reuses the analysis arrows for generation. On UD Finnish-TDT, the system achieves 83.92% UPOS accuracy with rule-only disambiguation (94.66% with an external suffix tagger), validating the framework as a practical morphological engine.

2605.28465 2026-05-28 cs.CL 版本更新

Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

超越单一路径:评估与增强交互式LLM代理的发散性思维

Jihyeong Park, Ingeol Baek, Jeonghyun Park, Hwanhee Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 提出交互式基准MUTATE和策略ReDNA,用于评估和增强LLM代理在路径和动作层面的发散性思维,解决现有框架中即时收敛压力导致的动作固定问题。

Comments 28 pages, 16 figures, 19 tables

详情
AI中文摘要

发散性思维是创造力的核心维度,然而现有对大型语言模型(LLM)的评估将其视为单轮文本生成,未能捕捉代理通过迭代交互进行推理的过程。为解决这一问题,我们引入MUTATE,一个交互式基准,旨在从两个层面评估代理的发散性思维:路径层面,代理发现通向同一目标的多个替代路径;动作层面,单个动作需要非典型、机制转换的物体使用。与仅评估成功不同,MUTATE对完成的路径和偏离路径的尝试都进行评分,捕捉传统成功率忽略的发散性推理。我们对前沿LLM的实验揭示了现有框架中的结构性盲点:当面临即时收敛压力时,它们倾向于陷入即时动作固定,无法改善动作层面的发散性。为克服这一点,我们提出ReDNA,它将无约束的发散候选生成与收敛约束选择分离。ReDNA在两个发散性层面上显著优于先前方法,并能有效泛化到外部创造力环境。我们还确认其成功源于弹性发散推理的定性增强,而非简单的环境探索。

英文摘要

Divergent thinking is a core dimension of creativity, yet existing evaluations of Large Language Models (LLMs) treat them as single-turn text generations, failing to capture how an agent reasons through iterative interaction. To address this, we introduce MUTATE, an interactive benchmark designed to evaluate agentic divergent thinking at two levels: path-level, where an agent discovers multiple alternative paths to the same goal, and action-level, where individual actions require non-typical, mechanism-shifting object uses. Unlike success-only evaluations, MUTATE scores both completed paths and off-path attempts, capturing divergent reasoning that conventional success rates discard. Our experiments with frontier LLMs reveal a structural blind spot in existing frameworks: when exposed to immediate convergence pressure, they tend to fall into immediate action fixation, failing to improve action-level divergence. To overcome this, we propose ReDNA, which separates unconstrained divergent candidate generation from convergent constraint selection. ReDNA significantly outperforms prior methods across both divergence levels and generalizes effectively to an external creativity environment. We also confirm its success stems from a qualitative enhancement of resilient divergent reasoning rather than simple environmental exploration.

2605.28464 2026-05-28 cs.CL cs.AI 版本更新

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

LJP 从未见过的案件:面向更完整刑事责任评估的起诉决定预测

Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang, Hui Huang, Qianru Wang, Chuan Xiao, Jianbin Qin, Shuyuan Zheng

发表机构 * Beijing Institute of Technology(北京理工大学) Osaka University(大阪大学) Xi’an University of Technology(西安理工大学) Institute of Science Tokyo(东京科学研究所) City University of Hong Kong(香港城市大学)

AI总结 提出起诉决定预测(PDP)任务,通过分类起诉或三种不起诉决定,弥补法律判决预测(LJP)在刑事责任评估中的盲区,并构建PDP-Bench基准,实验表明大语言模型在PDP上表现显著差于LJP。

Comments 24 pages, 5 figures, 22 tables

详情
AI中文摘要

法律判决预测(LJP)已成为评估刑事法律领域人工智能的核心基准,但它只涉及已经通过检察审查并正式起诉的刑事案件。因此,LJP在评估刑事责任方面留下了大量盲区,忽略了证据不足、无刑事责任或免予处罚的案件。为填补这一空白,我们提出了 extbf{起诉决定预测(PDP)},这是首个围绕检察审查构建的法律AI任务,它将每个案件分类为起诉或三种不起诉决定之一,并反映了法律AI在证据评估、法律归类和基于价值的裁量方面的能力。我们进一步构建了 extbf{PDP-Bench},一个包含4,630个真实中国检察决定、涵盖190个罪名的基准。大量实验表明,最先进的大语言模型在PDP上的表现显著差于LJP,且主流增强途径未能缩小差距。此外,受控的RLVR干预表明,简单的结果奖励无法产生可泛化的PDP判别能力。

英文摘要

Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases that have already passed prosecutorial review and been formally indicted. As a result, LJP leaves a substantial blind spot in assessing criminal liability, overlooking cases involving insufficient evidence, no criminal liability, or guilt exempted from punishment. To fill this gap, we propose \textbf{Prosecution Decision Prediction (PDP)}, the first Legal AI task built around prosecutorial review, which classifies each case into prosecution or one of three non-prosecution decisions and reflects legal AI's capabilities in evidence evaluation, legal subsumption, and value-based discretion. We further construct \textbf{PDP-Bench}, a benchmark of 4{,}630 real Chinese prosecutorial decisions spanning 190 charges. Extensive experiments show that state-of-the-art LLMs perform substantially worse on PDP than on LJP and that mainstream enhancement routes fail to close the gap. Moreover, controlled RLVR interventions show that simple outcome rewards fail to produce generalizable PDP discrimination.

2605.28440 2026-05-28 cs.CL cs.LG 版本更新

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

AdaDPO:具有平衡梯度更新的自适应直接偏好优化

Shaolong Chen, Madalina Ciobanu, Qingqing Mao, Ritankar Das

发表机构 * Incept Labs(Incept实验室)

AI总结 针对DPO中梯度不对称导致模型偏向避免不良回答而非生成优质回答的问题,提出AdaDPO算法,通过引入基于策略模型生成概率的自适应系数来平衡正负偏好梯度,在AlpacaEval 2上优于DPO并缓解长度偏差。

Comments 5 figures

详情
AI中文摘要

DPO已成为替代RLHF用于将LLM与人类偏好对齐的广泛采用方法,无需单独的奖励模型或RL循环。最近的理论分析揭示了DPO中不对称的梯度行为:损失抑制不偏好响应的速度远快于促进偏好响应,导致模型学习避免生成坏答案而非生成好答案。我们提出AdaDPO,一种DPO算法的自适应变体,它引入了基于策略模型生成概率的每偏好对、基于停止梯度的系数,并以参考模型的概率作为可选组件。AdaDPO旨在强制偏好和不偏好概率的梯度幅度相等;实际实现平衡了每token梯度并应用数值裁剪边界以保持稳定性,同时保留DPO的原始超参数结构。在SimPO类似设置下使用UltraFeedback训练的Llama-3-8B-Instruct上,AdaDPO在AlpacaEval 2上持续优于DPO:它在81%的超参数组合中实现了更高的长度控制胜率(LC),达到了全局最佳LC(48.3%)和原始胜率(46.1%),并在88%的组合中扩大了LC与WR的差距,表明有效缓解了长度偏差。对KL散度、奖励边际和奖励准确率的额外分析证实,AdaDPO纠正了梯度不平衡并产生了更高效的优化。由于它纯粹在损失层面操作,AdaDPO可以无缝集成到现有的基于偏好的对齐流程中,无需改变数据收集或模型架构。该方法仅需几行代码,并且相同的自适应原理可推广到广泛的成对对比偏好损失族,包括SimPO、R-DPO、IPO、CPO和ORPO。

英文摘要

DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, causing the model to learn to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the DPO algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model's generation probabilities, with the reference model's probabilities as an optional component. AdaDPO is constructed to enforce equality of gradient magnitudes between preferred and dispreferred probabilities; the practical implementation balances per-token gradients and applies a numerical clipping bound for stability, while retaining DPO's original hyperparameter structure. On Llama-3-8B-Instruct trained on UltraFeedback under a SimPO similar setup, AdaDPO consistently outperforms DPO on AlpacaEval 2: it achieves higher length-controlled win rates (LC) in 81% of hyperparameter combinations, attains the global best LC (48.3%) and raw win rate (46.1%), and enlarges the LC-over-WR margin in 88% of combinations, indicating effective mitigation of length bias. Additional analyses on KL divergence, reward margin, and reward accuracy confirm that AdaDPO rectifies the gradient imbalance and yields more efficient optimization. Because it operates purely at the loss level, AdaDPO can be dropped into existing preference-based alignment pipelines without changing data collection or model architectures. The method requires only a few lines of code, and the same self-adaptive principle generalizes to a broad family of pairwise contrastive preference losses including SimPO, R-DPO, IPO, CPO, and ORPO.

2605.28438 2026-05-28 cs.CL 版本更新

Breaking the Script Barrier: Enabling Automatic Alignment for PoS-based ASR Error Analysis in Non-Latin Scripts

打破脚本障碍:实现基于词性标注的ASR错误分析在非拉丁脚本中的自动对齐

Prasenjit K Mudi, Dahlia Devapriya, Sheetal Kalyani

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯学院)

AI总结 提出一种语言无关的自动对齐机制,使基于词性标注的ASR错误分析能在拉丁和非拉丁脚本中可靠进行,并应用于多种书写系统以提升WER。

详情
AI中文摘要

自动语音识别(ASR)系统通常使用词错误率(WER)等聚合指标进行评估,但这些指标无法捕捉错误的语言结构。细粒度分析(如基于词性(PoS)的错误特征)需要ASR假设与参考转录之间的准确对齐。然而,现有的对齐工具对于非拉丁脚本的语言通常不可靠。在这项工作中,我们通过提出一种鲁棒、自动、语言无关的对齐机制来填补这一空白,该机制适用于各种ASR架构以及拉丁和非拉丁脚本的语言。这使得假设、参考和评估序列能够一致对齐,为下游语言分析奠定基础。在此基础上,我们使用标准PoS标注器进行可扩展且可重复的基于PoS的错误分析。值得注意的是,我们对三种主要的分段书写系统进行了对齐和下游ASR错误分析,即元音附标文字(泰米尔语、印地语、卡纳达语)、字母文字(英语、俄语、希腊语)和辅音文字(阿拉伯语)。我们进一步展示了如何在ASR训练中利用此类错误信息来改进WER等指标。

英文摘要

Automatic Speech Recognition (ASR) systems are commonly evaluated using aggregate metrics such as Word Error Rate (WER), which do not capture the linguistic structure of errors. Fine-grained analysis, such as Part-of-Speech (PoS)-wise error characterization, requires accurate alignment between ASR hypotheses and reference transcriptions. However, existing alignment tools are often unreliable for languages written in non-Latin scripts. In this work, we address this gap by proposing a robust, automated, language-agnostic alignment mechanism applicable across ASR architectures and across languages written in both Latin and non-Latin scripts. This enables consistent alignment of hypotheses, references, and evaluation sequences, forming the basis for downstream linguistic analysis. Building on this, we employ standard PoS taggers to perform scalable and reproducible PoS-wise error analysis. Notably, we perform alignment and downstream ASR error analysis across three major segmented writing systems, namely, Abugida (Tamil, Hindi, Kannada), Alphabetic (English, Russian, Greek), and Abjad (Arabic). We further demonstrate how such error information can be leveraged during ASR training to improve metrics such as WER.

2605.28433 2026-05-28 cs.CL 版本更新

Roles with Rails: Contract-Preserving Role Evolution in Multi-Agent Structured Reasoning

角色与轨道:多智能体结构化推理中保持契约的角色演化

Ling-Yue Ge, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 提出SERO框架,通过契约保持的角色演化机制(信用引导检索、保护终端聚合器、条件验证器修复、上下文赌博机控制器)解决多智能体系统中角色漂移和契约破坏问题,在真实推理基准上验证有效性。

Comments 33 pages, 23 figures, 12 tables

详情
AI中文摘要

基于角色的LLM多智能体系统需要自适应角色池,但适应此类系统不仅仅是提示优化的问题:角色通常带有结构性义务,包括能力覆盖、消息兼容性、验证、最终答案聚合以及解析器兼容的输出协议。现有系统要么固定角色清单而失去自适应性,要么允许无约束生成导致角色漂移,移除结构上必要的角色并破坏答案契约。我们将此形式化为保持契约的角色演化,要求每次提交的编辑保留五个结构性契约(能力、通信、验证、聚合、输出协议)。我们在SERO(自演化角色编排框架)中实例化这一形式化,该框架通过信用引导检索、带有保护终端聚合器和条件验证器修复的信用排序通信DAG,以及一个上下文赌博机控制器来演化类型化角色卡池,其中LLM提出的编辑仅在它们保持契约并提高任务分数时被提交。在三个LLM骨干上的真实世界推理基准实验证实了保持契约的角色演化的价值。

英文摘要

Role-based LLM multi-agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, validation, final-answer aggregation, and parser-compatible output protocols. Existing systems either fix the role inventory and lose adaptivity, or allow unconstrained generation to induce role drift, removing structurally necessary roles and breaking answer contracts. We formulate this as contract-preserving role evolution, requiring every committed edit to preserve five structural contracts (capability, communication, validation, aggregation, output protocol). We instantiate this formulation in SERO, a Self-Evolving Role Orchestration framework that evolves a typed role-card pool through credit-guided retrieval, a credit-ranked communication DAG with a protected terminal aggregator and conditional validator repair, and a contextual-bandit controller whose LLM-proposed edits are committed only when they preserve the contracts and improve task score. Experiments on real-world reasoning benchmarks across three LLM backbones confirm the value of contract-preserving role evolution.

2605.28424 2026-05-28 cs.CL 版本更新

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Skill0.5:面向智能体强化学习中分布外泛化的联合技能内化与利用

Jiapeng Zhu, Jianxiang Yu, Yibo Zhao, Chengcheng Han, Qi Gu, Xunliang Cai, Xiang Li, Weining Qian

发表机构 * East China Normal University(华东师范大学) Meituan Longcat Team(美团Longcat团队)

AI总结 提出Skill0.5框架,通过区分通用技能内化与任务特定技能利用,结合动态难度感知路由器,在ALFWorld和WebShop上提升了分布内和分布外场景的性能。

详情
AI中文摘要

将显式技能赋予大型语言模型已成为使自主智能体解决复杂任务的一种有前景的范式。智能体技能可以内在地分为用于广泛认知迁移的通用技能和用于动态执行的任务特定技能。然而,现有的基于技能的强化学习方法通常强制在完全外化(导致高昂的上下文开销)和完全内化(存在过拟合和知识冲突风险)之间做出僵化选择。为了解决这一困境,我们提出了Skill0.5,一种新颖的智能体强化学习框架,通过结合通用技能内化与任务特定技能利用来明确区分技能处理方式。在动态、难度感知路由器的驱动下,Skill0.5将任务流式传输到不同的掌握层级,以应用定制的优化策略:它通过特权蒸馏内化通用技能,为困难任务构建认知基础,同时在简单任务上使用诊断性探测来惩罚捷径并强制特定技能利用。在ALFWorld和WebShop上的实验表明,Skill0.5优于基于记忆和基于技能的强化学习基线,在分布内和分布外场景中均实现了性能提升。

英文摘要

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.

2605.28389 2026-05-28 cs.CL 版本更新

FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

FABSVer: 更快的训练与更好的自验证用于大语言模型数学推理

Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song

发表机构 * Zuoyebang Education Technology(左岳邦教育科技)

AI总结 提出FABSVer方法,通过融合解生成与自验证为单次前向传播,并引入动态参考模型更新(DRMU)突破奖励瓶颈,在三个模型规模上实现更优的自验证与推理性能,训练时间仅为现有方法的51%-71%。

详情
AI中文摘要

尽管大语言模型在数学推理方面取得了显著进展,但它们在判断自身解决方案的正确性方面仍然不可靠。现有的为模型配备自验证能力的方法通常将解生成和验证视为两个独立的任务,导致训练时间大幅增加。在本文中,我们提出FABSVer,将这两个任务融合为单次生成过程,在联合优化两种能力的同时显著降低训练开销。我们进一步从理论和实验上识别出一个收敛瓶颈:随着训练进行,由于策略受固定参考模型约束,奖励达到平台期。为克服这一问题,我们引入动态参考模型更新(DRMU),提高了奖励上限并实现持续的奖励增长。在数学基准上的大量实验表明,FABSVer在三个模型规模上实现了优越的自验证和推理性能,同时仅需现有方法训练时间的51%–71%。分析进一步揭示了模型获取自验证能力的不同学习阶段,并且随着模型规模增大,验证奖励与答案奖励之间的差距显著缩小。

英文摘要

While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

2605.28375 2026-05-28 cs.CL 版本更新

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

PrionNER: 朊病毒病生物医学文献命名实体识别数据集

An Dao, Nhan Ly, Thao Tran, Yuji Matsumoto, Akiko Aizawa

发表机构 * The University of Tokyo(东京大学) Medical Doctor, Independent Researcher(医学博士,独立研究员) Center for Language AI Research, Tohoku University(语言人工智能研究中心,东北大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院高级智能项目中心) National Institute of Informatics(信息机构国家研究所)

AI总结 针对朊病毒病临床信息,构建了手动标注的命名实体识别数据集PrionNER,包含317篇摘要、15种粗粒度和31种细粒度实体类型,并评估了监督和零样本模型性能。

Comments 29 pages, 5 figures, accepted at ACL 25th Workshop on Biomedical Language Processing (BioNLP 2026)

详情
AI中文摘要

朊病毒病是一种罕见、快速进展且致命的神经退行性疾病,由于非特异性临床表现,早期诊断困难。然而,据我们所知,目前尚无公开的、专注于朊病毒病的数据集,用于从生物医学文献中捕获广泛的临床相关实体。我们推出了PrionNER,一个针对PubMed摘要中朊病毒病临床信息的手动标注命名实体识别数据集。当前版本包含317篇摘要、2,943个句子和6,955个文本绑定实体标注,涵盖15种粗粒度和31种细粒度临床导向实体类型,涉及疾病、症状、诊断、发现、解剖、治疗以及时间和统计证据。标注者间一致性达到81.78%的精确匹配F1值,表明标注一致性较强。我们在PrionNER上对监督BERT基线、W2NER和零样本提取器进行了基准测试。W2NER是最强的监督模型,Gemma-4-31B是最强的零样本模型,但基准测试仍具有挑战性,尤其是对于结构复杂的提及和细粒度的临床邻近标签区分。PrionNER为朊病毒病信息提取提供了临床基础的基准,并支持低资源、细粒度及非平面提取条件下的罕见病生物医学NLP研究。数据集、标注指南和评估脚本可在https://github.com/daotuanan/PrionNER/获取。

英文摘要

Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.

2605.28365 2026-05-28 cs.AI cs.CL cs.LO 版本更新

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

风险控制的 Lean 作为自然语言数学推理的评判者

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

发表机构 * Imperial College London(伦敦帝国理工学院) Huawei Noah’s Ark Lab(华为诺亚实验室) UCL Centre for AI(大学学院伦敦人工智能中心)

AI总结 针对 Lean 评判自然语言数学答案时信号稀疏且不忠实的问题,提出 COVCAL 选择器,通过有限样本选择性风险控制,在自动形式化覆盖率足够高时保证接受答案的准确率。

详情
AI中文摘要

Lean 越来越多地被用于评判自然语言数学答案,但其信号是不完全的:许多答案从未被形式化,而一个失败的证明可能反映类型错误或缺少库事实,而非答案错误。在 MATH-500 上,我们表明该信号 (i) 严重依赖于覆盖率,即在证明覆盖率高的答案中正确率为 96%,但在覆盖率低时为 20%,以及 (ii) 稀疏且常常不忠实:一个 7B 自动形式化器仅对 28% 的问题证明了某个类别,而人工审计发现其中只有约 43% 的证明是忠实的。我们提出 COVCAL,一个基于 Lean 跟踪诊断的选择器,它在两种机制(保守的 Bonferroni 界和更紧的 dev-then-cal 规则)下,对接受的答案认证有限样本选择性风险界,否则弃权。可行性取决于自动形式化覆盖率:对于 7B 形式化器,信号过于稀疏,Bonferroni 在所有 20 个自助法分区上弃权,而一个专用于证明器的形式化器达到 79% 的覆盖率,并在 20 个分区中的 17 个上使其可行,以 0.98 的接受准确率接受约 48% 的问题。由于自一致性本身已达到 91% 的准确率,我们的贡献是精确描述了何时以及使用哪个形式化器,部分形式化信号可以在风险控制下被信任。

英文摘要

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

2605.28363 2026-05-28 cs.CL 版本更新

PubMedCausal: A Span-Level Annotated Corpus for Causal Relation Extraction in Biomedical Text

PubMedCausal: 用于生物医学文本中因果关抽取的跨度级标注语料库

Ifeoluwa Kunle-John, Josiah Paul, Oluwatosin Agbaakin, Peter Aina, Ikenna Odezuligbo, Sydney Anuyah

发表机构 * Edyah Limited(Edyah有限公司) University of Ibadan(伊巴丹大学) Indiana University(印第安纳大学) Creighton University(克雷顿大学)

AI总结 为解决现有资源将因果关系与广义关联混淆、限制句子级标注或仅关注显式因果线索的问题,构建了基于PubMed摘要的跨度级因果关抽取语料库PubMedCausal,包含30,000段落级行、3,945因果行和6,491个裁决的因果对,并基准测试了判别式编码器和开源生成模型。

Comments Submitted to EMNLP 2026, 8 Pages, 23 page appendix

详情
AI中文摘要

因果关抽取(CRE)是生物医学文本挖掘的核心,但当前资源常将因果关系与更广泛的关联混淆,将标注限制在句子级别,或主要关注显式因果线索。这限制了它们在评估模型是否能恢复生物医学文本中实际表达的因果主张方面的实用性。我们引入了PubMedCausal,一个基于PubMed摘要构建的生物医学CRE跨度级标注语料库。该语料库包含30,000个段落级行,包括3,945个因果行和6,491个经裁决的因果对。每个因果关系都标注了全文的原因和结果跨度、因果类型以及句子性,从而支持因果检测和全跨度因果抽取的评估。我们在检测和抽取设置下对判别式编码器和开源生成模型进行了基准测试。对于因果检测,生物医学编码器表现最强,PubMedBERT达到F$_1$分数0.7391。对于跨度级抽取,最佳生成基线是DeepSeek-R1-32B配合少样本提示,达到余弦对F$_1$分数0.6765。我们进一步通过评估在PubMedCausal上训练的编码器在外部因果关数据集上的表现来测试迁移学习,表明该资源支持跨数据集评估。我们的结果表明,在类别不平衡、长因果跨度、隐式因果关系、跨句关系以及提示敏感性下,生物医学CRE仍然困难。代码和数据可在此处找到:https://github.com/josiahpaul07/PubMedCausal_Exp

英文摘要

Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence-level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span-level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph-level rows, including 3,945 causal rows and 6,491 adjudicated cause--effect pairs. Each causal relation is annotated with full-text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full-span causal extraction. We benchmark discriminative encoders and open-source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F$_1$ score of 0.7391. For span-level extraction, the best generative baseline is DeepSeek-R1-32B with few-shot prompting, reaching a Cosine Pair F$_1$ of 0.6765. We further test transfer learning by evaluating PubMedCausal-trained encoders on external causal relation datasets, showing that the resource supports cross-dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter-sentential relations, and prompt sensitivity. Code and Data can be found here: https://github.com/josiahpaul07/PubMedCausal_Exp

2605.28346 2026-05-28 cs.CL 版本更新

When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

当话语压力冲突时:视觉-语言模型输出中的信息结构

Marcell Fekete, Johannes Bjerva, Tamás Káldi

发表机构 * Department of Computer Science, Aalborg University(奥尔堡大学计算机科学系) Department of Psycholinguistics and Neurolinguistics, ELTE Research Centre for Linguistics(ELTE语言研究中心心理学语言学与神经语言学系) ELTE Bárczi Gusztáv Faculty of Special Needs Education(ELTE巴尔茨吉斯塔夫特殊教育学院)

AI总结 研究视觉-语言模型在视觉问答中是否区分话语旧主题和新焦点,发现模型虽产生信息结构相关结构但过度正则化,倾向于窄响应模板,类似模式崩溃。

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地被评估是否能识别正确的视觉内容,但关于它们是否以话语适当的形式表达这些内容却知之甚少。我们利用信息结构(IS)来填补这一研究空白,测试VLM在视觉基础问答中是否能区分话语旧主题(Topic)和话语新焦点(Focus)。我们利用匈牙利语,其中主题和焦点映射到专门的句法位置,使得IS选择在文本中可观察。通过比较六种VLM与人类参与者,我们发现模型产生了与IS相关的结构,但过度正则化了这种敏感性。在话语状态、语法角色(主语主题偏好)和限定性(不定焦点偏好)的相互作用压力下,人类选择多种IS实现策略。相比之下,VLM坍缩为狭窄的响应模板,类似于模式崩溃(Kirk等人,2024)。我们的发现表明,VLM评估应超越内容准确性,关注内容如何为话语打包。

英文摘要

Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.

2605.28315 2026-05-28 cs.CL 版本更新

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

HardMTBench:知识密集型领域的中英翻译压力测试

Zheng Li, Mao Zheng, Mingyang Song, Tianxiang Fei

发表机构 * Large Language Model Department, Tencent(腾讯大语言模型部门)

AI总结 针对现有中英翻译基准饱和问题,提出HardMTBench,一种难度感知的诊断基准,通过多阶段构建和难度融合规则,在12个知识密集型领域上显著扩大系统性能差异并暴露术语和知识弱点。

详情
AI中文摘要

通用机器翻译基准(如FLORES-200)在中英对上已达到饱和状态,现代大语言模型的高分区间狭窄。在22个系统中,FLORES-200中英GEMBA分数落在7.87分范围内,标准差为2.29,这压缩了系统在金融、医疗、法律、科技等知识密集型领域上的区分度。我们提出HardMTBench,一个面向双向中英领域翻译的难度感知诊断基准。HardMTBench涵盖12个领域,包含10,000条人工筛选的源句及其参考译文,打包为20,000个方向性测试项。一个三阶段构建流程构建了包含84,566对的领域平衡候选池,应用基于LLM的多信号评判器评估知识密度、翻译难度、术语负载和参考正确性,并在难度融合规则下按领域配额组装最终测试集。在涵盖通用LLM、商业引擎和专业翻译模型的22个系统上,HardMTBench将跨系统的GEMBA范围相比FLORES-200扩大了约两倍,引发明显的排名重排,并暴露了仅靠质量指标往往掩盖的领域特定术语和知识弱点。所有数据和代码已在https://github.com/jasonNLP/HardMTBench开源。

英文摘要

General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84{,}566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at https://github.com/jasonNLP/HardMTBench.

2605.28313 2026-05-28 cs.CL 版本更新

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

基于大语言模型的论证质量评估:一种成对Bradley-Terry方法

Nicolás Benjamín Ocampo, Agnes Paullate Nyiranziza, Davide Ceolin

发表机构 * Centrum Wiskunde & Informatica, The Netherlands(荷兰代尔夫特理工大学) Vrije Universiteit Amsterdam, The Netherlands(荷兰阿姆斯特丹自由大学)

AI总结 本研究利用12种开源大语言模型,通过成对比较和Bradley-Terry模型评估论证质量,发现Llama-70B与人类专家判断具有中等一致性(Cohen's κ=0.493),其他模型表现各异但互补。

详情
AI中文摘要

大语言模型(LLMs)在推理和判断相关任务中展现出显著能力。然而,评估论证质量需要严格的评价。我们研究了LLMs有效执行此任务的程度。我们在零样本、少样本和思维链设置下测试了12种不同规模和系列的开源LLMs,以近似专家在逻辑、修辞和辩证三个维度上对论证质量的成对比较,并将这些比较用于Bradley-Terry模型,以推断潜在强度分数并得出论证排名。我们的见解表明,LLMs与人类专家判断具有有希望但中等程度的相关性,其中Llama-70B获得最强一致性,达到中等Cohen's κ=0.493,并且与从这些标注导出的Bradley-Terry分数具有中等相关性(Kendall、Pearson和Spearman:0.327-0.477)。其他LLMs与Llama-70B表现出弱、中等或高度一致性,同时在与人类专家比较中取得可比结果,表明尽管模型规模和系列存在差异,但对潜在质量维度具有部分但互补的理解。此外,LLM预测在试验运行中稳定,少于7.75%的情况产生不同标签。剩余变异性通过多数投票和少样本提示对大模型进行处理。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $κ$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75\% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.

2605.28308 2026-05-28 cs.CL 版本更新

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

HELEA: 用于鲁棒实体对齐的硬负样本基准和基于LLM的重排序

Yoonjin Jang, Junwoo Kim, Youngjoong Ko

发表机构 * SungKyunKwan University(全州大学)

AI总结 针对现有实体对齐基准中模型依赖名称重叠而非关系结构的问题,提出同名的硬负样本增强策略生成质量可控的评估基准和训练语料,并设计HELEA两阶段框架(实体编码器检索+LLM重排序),在硬负样本基准上实现鲁棒对齐。

Comments 10 pages, 3 figures, 9 tables. Code and benchmarks available at https://github.com/Wnsdnl/HELEA

详情
AI中文摘要

实体对齐(EA)对于知识图谱(KG)融合至关重要,但现有基准通常允许模型利用名称重叠而非关系结构。这使得难以评估模型是否能拒绝指向不同现实世界对象的同名实体。我们的主要贡献是一种同名的硬负样本增强策略,通过从KG名称冲突组中挖掘同名但不同的实体对,同时生成质量可控的评估基准(DW-HN29K、DY-HN27K)和增强训练语料(DW-Train、DY-Train)。我们进一步引入HELEA,一个两阶段框架,整合了(i)在硬负样本增强训练语料上训练的实体编码器检索(使用1跳KG上下文),以及(ii)无需额外训练的基于LLM的重排序。实验表明,依赖名称的基线在我们的硬负样本基准上性能下降至接近随机,而HELEA在DW-HN29K上达到F1 0.967,同时在标准DW-15K上保持Hit@1 0.993。

英文摘要

Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.

2605.28306 2026-05-28 cs.CL cs.AI 版本更新

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

面向混合专家模型中多语言下游任务的路由对齐微调

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong, Sichun Luo, Linqi Song

发表机构 * City University of Hong Kong(香港城市大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 针对混合专家模型在多语言下游任务中的路由结构异构问题,提出RA-MoE三阶段框架,通过中间层语言通用对齐区识别任务相关专家,并引入路由对齐损失增强目标语言路由,实验表明该方法优于标准微调和强基线。

详情
AI中文摘要

混合专家(MoE)模型已成为高效扩展LLM的主流范式,但将其适配到非英语下游任务仍然具有挑战性。现有的微调方法将MoE模型视为整体学习器,忽略了预训练期间形成的异构路由结构。我们在多个MoE模型和下游任务上验证,中间层形成了语言通用对齐区,其中路由发散性强烈预测了每种语言的任务性能差距。基于这一观察,我们提出了RA-MoE(路由对齐MoE微调),一个三阶段框架,该框架根据英语和目标语言的正确性将并行任务示例分类为四路分类法(cc/ci/ic/ii),识别中间层中与任务相关的专家,并用路由对齐损失增强标准SFT,该损失鼓励ci类型示例上的目标语言路由遵循英语任务专家激活模式。在三个MoE模型、三个任务和六种目标语言上的实验表明,RA-MoE始终优于标准SFT和强基线(包括Routing Steering和RISE),其中任务-语言对的ci比例可作为对齐收益的可靠预测指标。

英文摘要

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

2605.28305 2026-05-28 cs.CL cs.AI 版本更新

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

重新审视大语言模型推理中的拟人化反思标记

Yahan Yu, Noa Nakanishi, Fei Cheng

发表机构 * Kyoto University(京都大学)

AI总结 本文通过提示级和令牌级干预抑制拟人化反思标记,发现这些标记并非推理性能的必要条件,且抑制后模型仍能进行无标记验证,表明它们更多是表面线索而非可靠反思代理。

Comments 15 pages, 12 figures

详情
AI中文摘要

大语言模型(LLMs)在复杂推理过程中经常产生显式的反思痕迹,并伴随有拟人化标记,如“wait”、“hmm”和“alternatively”。尽管这些标记通常被用作反思的可见指标,但其机制仍不清楚,这带来了与冗余和重复反思标记相关的过度思考风险。在这项工作中,我们重新审视了拟人化反思标记,考察了它们对推理的必要性以及在反思中的作用。我们通过提示级和令牌级干预抑制这些标记,并分析了它们对四个基准测试和两种模型规模的任务性能的影响。我们的结果表明,拟人化标记对于推理性能并非普遍必要:抑制它们可以在多种设置下保持或提高性能,尤其是在较大的采样预算下。同时,标记抑制并不一定消除反思行为,因为模型仍然可以进行无标记验证。这些结果表明,拟人化标记更倾向于表面线索,而不是反思本身的可靠代理,并激励未来在显式标记模式之外对推理机制进行研究。

英文摘要

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

2605.28295 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Rollouts 的起点:面向 RLVR 的低负载、高杠杆的首 token 多样化

Soeun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文提出 REFT 方法,通过在推理标记后的第一个 token 处进行均匀采样多样化,以低开销显著提升 RLVR 中 rollout 的多样性,从而改善推理模型的 Pass@k 性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)无需标注轨迹即可训练推理模型,它依赖分组 rollout 将策略暴露于替代推理路径,并由验证器进行评分。Rollout 多样性因此成为 RLVR 的核心瓶颈,现有方法大多通过温度、前缀或 rollout 选择调整来拓宽探索。我们发现了一个结构上独特但被忽视的拓宽多样性的位置:推理标记后的第一个 token。策略的首 token 分布表现出尖锐峰值但正确性解耦的现象,且该首 token 位置可以拓宽 rollout 组覆盖的区域而不改变正确性信号。我们引入 REFT(基于首 token 多样化的 Rollout 探索),这是对 RLVR 流程的一个轻量级补充,它从策略自身的 top-$N$ 候选集中均匀采样首 token,并均匀分配 rollout,其他组件保持不变。在由此产生的多样化 rollout 上训练后,REFT 在四个基础模型(0.5B-7B)和三个难度级别上,相较于 DAPO 和 GRPO 基线,提升了聚合的 Pass@1、Pass@8 和 Pass@64。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

2605.28292 2026-05-28 cs.CL 版本更新

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

CIRF:将思维链分词化为可重用的功能单元,用于大型语言模型的高效潜在推理

Yukyung Lee, Yumeng Shen, Jinhyeong Park, Hyein Yang, Jun-Hyung Park

发表机构 * Boston University(波士顿大学) Hankuk University of Foreign Studies(韩国民法大学)

AI总结 提出CIRF框架,通过将显式思维链中的语义连贯推理单元映射为离散功能令牌,实现动态序列推理,在数学、符号和常识推理基准上取得优于现有隐式CoT方法的准确率-延迟权衡。

Comments 17 pages, 7 figures

详情
AI中文摘要

隐式思维链通过内化显式理由来降低大型语言模型的推理成本。然而,现有方法通常缺乏与显式理由的对齐以及对示例复杂性的适应性。在这项工作中,我们提出了CIRF(思维链转化为可重用功能单元),一个隐式CoT框架,将推理作为离散功能令牌的动态序列进行。CIRF为显式CoT轨迹中的每个语义连贯推理单元分配一个功能令牌。然后对模型进行微调,以自回归方式生成功能令牌及其可选结果,随后生成最终答案。这种设计将潜在推理与功能单元序列对齐,促进了并行训练、显式理由对齐和自适应推理。在数学、符号和常识推理基准上的大量实验表明,与最先进的隐式CoT方法相比,CIRF提供了有利的准确率-延迟权衡。进一步的分析表明,CIRF构建了独特、可解释的功能令牌,从而带来一致的性能提升。

英文摘要

Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\textit{\underline{C}hain-of-thoughts \underline{I}nto \underline{R}eusable \underline{F}unctional units}), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.

2605.28283 2026-05-28 cs.CL cs.AI 版本更新

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath:迈向高度结构化稀疏语言模型

Zhexuan Gu, Zixun Fu, Yancheng Yuan

发表机构 * Department of Applied Mathematics, The Hong Kong Polytechnic University(应用数学系,香港理工大学)

AI总结 提出PrunePath框架,通过软最大归一化路由和累积质量阈值实现自适应预算的结构化稀疏化,在自然语言理解、生成和指令调优中取得优越的稀疏-性能权衡,并利用Triton内核将结构化稀疏转化为实际内存节省和解码速度提升。

详情
AI中文摘要

前馈网络(FFN)主导了现代语言模型的参数数量和计算量,然而现有的剪枝方法往往难以将稀疏性转化为硬件友好的推理效率提升。我们引入了 extbf{PrunePath},一个针对FFN层的预算自适应结构化稀疏化框架。基于MoEfication,PrunePath用软最大归一化路由分布替代独立的专家级阈值,并在累积质量阈值下激活重要专家。这种公式化施加了令牌级概率预算,实现了自适应专家数量以及从单个检查点直接推理时的稀疏性调节旋钮。在自然语言理解、自然语言生成和指令调优评估中,与现有的静态剪枝和基于MoEfication的方法相比,PrunePath实现了有利的稀疏-性能权衡。我们进一步实现了用于KV缓存解码的Triton内核,以将所得的结构化稀疏性转化为实际的内存节省和可测量的解码速度提升。这些结果证明了PrunePath在构建高度稀疏、易于部署的大型语言模型方面的优越性能。

英文摘要

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

2605.28255 2026-05-28 cs.AI cs.CL cs.HC 版本更新

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI,掌舵吧:是什么驱动人机协作问答中的委托与信任?

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

发表机构 * University of Maryland(马里兰大学) University of California(加州大学) MBZUAI

AI总结 通过问答游戏实验,研究人类在何时以及为何选择委托AI或采纳其建议,发现人类存在对AI正确建议的低依赖(3.9%)和错误建议的过度依赖(1.7%),并受确认偏见影响,建议通过校准置信度、基于证据的解释和信任细化机制来改进人机协作。

Comments Findings of the Association for Computational Linguistics, 2026

详情
AI中文摘要

AI系统并非完美无缺,人类在决定是否信任AI而非自身判断时也可能犯错。因此,改善人机协作需要理解人类何时、为何以及如何决定依赖AI。我们研究了两种不同的依赖决策:委托选择——在不知道AI输出结果的情况下决定何时让AI自主行动,以及采纳选择——评估AI建议并决定如何使用它们。这两种解耦的依赖模式塑造了协作,但先前的工作很少在现实环境中对同一用户同时研究它们。我们通过研究在问答游戏中竞争的人机协作团队来填补这一空白,游戏中人类可以选择何时以及如何与AI代理合作以获胜。我们的24场比赛匹配了23位专家人类和16个AI代理,捕获了387次委托决策和1440次采纳决策。虽然人机协作表现优于单独的AI或人类,但人类做出了次优的协作决策,既对正确的AI建议低依赖(错失3.9%的机会),又在AI误导时过度依赖(1.7%)。双方都贡献了错误答案:当人类和AI意见不一致时,报告的模型置信度接近随机水平,而确认偏见导致当AI建议与人类初始错误答案一致时,低依赖率更高(64.5%)。为缩小这一差距,我们建议采用校准的置信度、基于证据的解释以及帮助用户细化信任的机制。

英文摘要

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

2605.28253 2026-05-28 cs.CL cs.DB cs.HC 版本更新

Building Community-Centred NLP Resources for Puno Quechua

构建以社区为中心的普诺克丘亚语自然语言处理资源

Elwin Huaman, Adrian Gamarra Lafuente, Johanna Cordova, Anna Korhonen

发表机构 * University of Cambridge (UK)(剑桥大学(英国)) Stanford University (USA)(斯坦福大学(美国)) ERTIM - Inalco (France)(ERTIM - Inalco(法国))

AI总结 通过参与式设计收集66小时语音数据,微调Whisper-base等模型,首次为普诺克丘亚语建立ASR基准并开源所有资源。

Comments Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026), co-located with ACL 2026

详情
AI中文摘要

保护资源不足的语言需要由使用者塑造并为其服务的数字工具和资源。我们首次为普诺克丘亚语(ISO 639-3: qxp)提供了专门的ASR资源:(1)任何单一克丘亚语变体中最大的语音语料库,包括66小时的脚本和自发性语音录音(其中36小时为手动转录和验证数据),通过参与式设计活动收集;(2)首个系统的普诺克丘亚语ASR基准,评估了最先进模型并微调了Whisper-base、wav2vec2-base和XLS-R-300M,包括有无继续预训练(CPT)的情况;(3)所有数据集和微调模型的开源发布。

英文摘要

The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.

2605.28228 2026-05-28 cs.CL 版本更新

When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

当求助者难以帮助:评估情感支持对话系统在最坏情况交互中的表现

Jiajie Yang, Yangchun Li, Guanyi Chen, Rui Fan, Xin Bai, Tingting He

发表机构 * Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning(湖北人工智能与智能学习省级重点实验室) National Language Resources Monitoring and Research Center for Network Media(网络媒体语言资源监测与研究中心) School of Computer Science, Central China Normal University(华中师范大学计算机学院) Faculty of Artificial Intelligence in Education, Central China Normal University(华中师范大学教育人工智能学院) School of Chinese Language and Literature, Central China Normal University(华中师范大学中文语言文学学院)

AI总结 本研究通过专家模拟和提出最坏情况评估框架,发现现有情感支持对话系统在面对低参与度、抗拒等困难求助者时性能显著下降,并验证了最坏情况模拟数据可提升模型鲁棒性。

详情
AI中文摘要

情感支持对话系统(ESDS)越来越多地使用大语言模型模拟的求助者进行评估和训练。然而,这类模拟求助者通常表现为合作、平均水平的用户,他们清晰披露、建设性回应并在几轮内接受支持。这可能导致过于乐观的评估,并掩盖ESDS是否能够处理困难的求助互动。在这项工作中,我们研究了在最坏情况交互下的ESDS评估,其中求助者由于低参与度、抗拒、有限的自我披露、情绪波动或僵化的负面解释而难以帮助。我们首先进行了一项专家模拟研究,邀请八位经验丰富的咨询专业人员模拟困难求助者,与现有的中文ESDS互动,提供量表评分,并参与半结构化访谈。基于这项研究,我们推导出最坏情况下的求助者行为,并识别出当前系统的关键局限性。然后,我们提出了一个最坏情况评估框架,包括一个基于LLM的最坏情况求助者模拟器和四个面向最坏情况的指标:深度情感理解、引导性探索、平衡的情感支持以及真实和接地气的支持。评估17个系统后,我们发现几乎所有模型在最坏情况交互下性能都大幅下降。大型通用LLM通常比专门的ESDS更稳健,但即使是最强的模型也难以维持参与度并改善求助者的情绪状态。最后,我们表明最坏情况模拟也可以生成有用的训练数据,提高较小模型的鲁棒性。

英文摘要

Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.

2605.28227 2026-05-28 cs.CL 版本更新

Why We Need Speech to Evaluate Speech Translation

为什么我们需要语音来评估语音翻译

Maike Züfle, Danni Liu, Vilém Zouhar, Jan Niehues

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过元评估发现现有文本和语音质量估计指标在评估语音翻译中的语音特有信息(如性别一致性和韵律)时均存在不足,并提出SpeechCOMET模型,分析其失败原因,强调需要专用训练数据和真正基于语音的模型。

详情
AI中文摘要

语音翻译模型越来越能够保留语音特定信息(例如,说话者性别、韵律和强调),但评估指标仍然对这些现象视而不见。我们在两个针对性别一致性和韵律的对比数据集上对基于文本和基于语音的质量估计指标进行了元评估,发现两者均存在不足,即使直接访问语音信号也是如此。然后,我们训练了SpeechCOMET,一个带有语音编码器的质量估计模型家族,并评估了一个最先进的SpeechLLM作为评判者。两者在标准质量估计上匹配或超过基于文本的COMET,但都没有一致地评估语音特定现象。我们确定了三个原因:(1)当前编码器未能可靠地保留语音特定特征,(2)模型倾向于忽略语音源信号,以及(3)质量估计训练数据包含的相关示例太少。我们发布了所有模型和代码,并认为进展需要专用的语音特定训练数据和真正基于语音的模型。

英文摘要

Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

2605.28225 2026-05-28 cs.CL 版本更新

Supervised Semantic Differential for Cross-Cultural Concept Analysis: A Case Study of Human Affect

监督语义差异法用于跨文化概念分析:以人类情感为例

Jan Sikora, Paweł Lenartowicz, Hubert Plisiecki

发表机构 * University of Warsaw(华沙大学) Society for Open Science(开放科学协会) Centre for Brain Research, Jagiellonian University(雅盖隆大学脑研究中心) IDEAS Research Institute(IDEAS研究院)

AI总结 本文提出跨语言监督语义差异法(SSD),通过对齐的多语言词嵌入比较语义维度,并以波兰语、英语和法语情感规范词汇为例,验证了情感维度的跨语言可恢复性及文化差异。

Comments 9 pages, 2 figures, excluding the appendices. Code to reproduce our results is available at https://github.com/przebor/Cross-Cultural-SSD

详情
AI中文摘要

跨文化比较心理意义需要超越词汇层面的翻译,并考察语义维度在不同语言中的组织方式。我们提出了监督语义差异法(SSD)的跨语言扩展,该方法在嵌入空间中估计监督语义梯度,并在对齐的多语言词嵌入之间进行比较。该方法通过置换检验和自助法区间检验梯度对齐性和差异,并通过围绕差异梯度的聚类解释残差差异。我们在波兰语、英语和法语情感规范词汇上展示了该方法,对效价、唤醒度和优势度(如可用)进行建模。情感维度在语言和模型设置中显著可恢复。跨语言比较显示出广泛的对齐性以及结构化的残差差异:效价似乎是共享的,而唤醒度和优势度产生了更多可解释的对比,涉及身体威胁、审美刺激、内部情感性、宏观权威和日常控制。几个聚类也反映了语料库特定的伪影,强调了谨慎解释的必要性。跨语言SSD提供了一个可解释的框架,用于测试语义对齐性、识别差异,并生成关于心理意义跨文化差异的假设。

英文摘要

Cross-cultural comparison of psychological meaning requires methods that go beyond word-level translation and examine how semantic dimensions are organized across languages. We introduce a cross-lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross-lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro-level authority, and everyday control. Several clusters also reflected corpus-specific artifacts, underscoring the need for cautious interpretation. Cross-lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross-cultural differences in psychological meaning.

2605.28222 2026-05-28 cs.CL cs.IR cs.LG 版本更新

Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

使用LoRA适配分析技术文档RAG助手中的质量-延迟-资源权衡

Evgenii Palnikov, Elizaveta Gavrilova

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 本研究通过LoRA适配器在RAG系统中分析质量、延迟和资源之间的权衡,发现仅对q和v注意力投影进行适配的配置在帕累托前沿占优。

Comments 13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at https://github.com/EugPal/rag-lora-tradeoffs

详情
AI中文摘要

我们研究了在基于文档的检索增强生成(RAG)系统中使用生成器的低秩适配(LoRA)时的质量-延迟-资源权衡。我们构建了一个包含5,144个问答对的手动验证基准测试,这些问答对基于官方Kubernetes文档,并将其与固定的混合检索流水线(BGE-M3密集、BGE-M3原生稀疏、互惠排名融合、交叉编码器重排序)结合。在此基准测试上,我们在Llama-3.2-3B-Instruct和Llama-3.1-8B-Instruct上对20种LoRA配置进行了消融实验,涉及秩和目标模块的选择,并评估了每个配置的token级F1、LLM判断的接地性和正确性(pass@4)、推理延迟、推理内存和训练成本,所有结果均附有bootstrap 95%置信区间。帕累托分析表明,仅作用于q和v注意力投影的LoRA适配器始终主导前沿,而3B/8B的选择主要定义了操作区间。参数匹配的控制比较进一步表明,q/v优势是结构性的而非纯粹参数性的。基准测试、选定的适配器和代码可在https://github.com/EugPal/rag-lora-tradeoffs获取。

英文摘要

We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag-lora-tradeoffs.

2605.28218 2026-05-28 cs.CL 版本更新

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

IFMTBench:多语言翻译指令遵循的综合基准

Mingrui Sun, Mao Zheng, Zheng Li, Mingyang Song

发表机构 * Large Language Model Department, Tencent(腾讯大语言模型部)

AI总结 提出IFMTBench基准,涵盖7种语言、4506个单约束和2838个多约束项,通过确定性检查器和基于LLM的评分器评估翻译指令遵循能力,揭示指令遵循随模型规模增长快于翻译质量,且术语表和结构化格式约束难度最高。

Comments 11 pages, 6 figures, conference

详情
AI中文摘要

现代翻译工作流程要求的不只是语义等价。用户通常要求模型保留JSON或HTML模式、遵循精心策划的术语表、利用提供的上下文消除歧义,并匹配指定的语域,往往同时满足多个条件。传统的BLEU和xCOMET等指标能捕捉语义保真度,但对约束遵循的指示甚少,而一般的指令遵循基准则忽略了翻译的跨语言性质。我们引入了\bench,一个涵盖七种语言的多语言翻译指令遵循基准,包含4506个单约束和2838个多约束项,跨越六个约束维度和五种组合模式,指令以所有七种语言发出。约束分为由确定性检查器验证的门控子集和由基于评分规则的LLM法官评分的连续子集,通过乘法规则组合以抵抗奖励黑客攻击。评估15个模型揭示了先前协议遗漏的系统性差距:指令遵循随模型规模增长比翻译质量更显著,术语表和结构化格式约束主导了难度梯度,而一般指令遵循排名与翻译行为的相关性很弱。我们的基准可在https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench获取。

英文摘要

Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.

2605.28215 2026-05-28 cs.AI cs.CL cs.LG cs.LO cs.MA 版本更新

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

解释比单独预测更难:评估基于概念的MLLM解释作为ICL视觉分类器

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

AI总结 本文通过五种形式化程度递增的条件,系统评估多模态大语言模型在少样本上下文学习中的基于概念的可解释性,发现解释比预测更难,且强制生成形式化解释会降低预测准确性。

Comments Accepted to the CompLearn Workshop at ICML 2026

详情
AI中文摘要

上下文学习(ICL)使多模态大语言模型(MLLM)能够从少量标记示例中对图像进行分类。然而,这些模型如何使用提供的上下文仍然不透明。虽然思维链提示被广泛使用,但最近的研究认为它可能不反映真实的内部计算。在本文中,我们通过五种形式化程度递增的条件(从基线分类到描述逻辑(DL)公理生成)系统评估了冻结MLLM在少样本ICL下的基于概念的可解释性。通过独立的LLM-as-a-judge流水线评估四个最先进的MLLM,我们证明解释确实比单独预测更难。令人惊讶的是,强制模型生成形式化结构的基于概念的解释会单调降低预测准确性(从93.8%降至90.1%),这与显式推理普遍有助于性能的假设相矛盾。然而,当模型成功表达类别判别性视觉特征时,解释质量与正确预测强相关。我们的发现表明,虽然MLLM在视觉分类方面表现出色,但它们缺乏形式化、机器可验证的可解释性所需的特定指令微调。

英文摘要

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

2605.28211 2026-05-28 cs.CL 版本更新

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

当有用上下文泄露:领域自适应ASR中的隐私风险

Maike Züfle, Jan Niehues

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文识别并系统研究了领域自适应ASR中因上下文提示或微调导致模型泄露隐私的风险,通过构建控制数据集测量泄露率,并评估了提示级缓解策略及精度-泄露权衡。

详情
AI中文摘要

语音大语言模型越来越多地部署在专业环境中,领域定制是标准做法:用户在提示中提供包含敏感信息的上下文,在专有录音上进行微调,或两者兼有。我们识别并系统研究了这种定制的一个被忽视的隐私风险:适应于识别领域特定术语的模型可以被诱导转录其上下文或训练数据中一个语音相似的词,即使说的是不同的词,从而泄露私人信息。为了评估这一风险,我们构建了一个控制数据集,并测量了两种定制机制(提示和微调)下的泄露率。两种机制都会导致可测量的泄露,且组合时加剧。我们评估了一种提示级缓解策略,并分析了不同定制方法下的精度-泄露权衡,发现无上下文提示的微调提供了最佳平衡。我们公开了代码和数据集。

英文摘要

SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

2605.28190 2026-05-28 cs.CL 版本更新

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

更难文本嵌入基准(HTEB):超越一维静态鲁棒性

Manuel Frank, Haithem Afli

发表机构 * Department of Computer Science Munster Technological University(计算机科学系穆斯特技术大学)

AI总结 提出HTEB动态评估框架,通过LLM随机变换输入,从词汇/风格、长度和语言三个维度挑战文本嵌入模型的鲁棒性,发现模型具有特定且部分解耦的鲁棒性轮廓,规模提升绝对分数但未缩小原始与变换评估差距,且英语数据集对变换更敏感。

Comments 29 pages, 11 figures

详情
AI中文摘要

像MTEB这样的嵌入基准为每个模型报告单一分数,隐含地将鲁棒性视为静态的标量属性。我们认为嵌入鲁棒性是多维的,因为模型对不同类型的变化有不同的响应,并且需要动态评估来暴露静态基准隐藏的失败。我们引入了更难文本嵌入基准(HTEB),这是一个动态评估框架,通过LLM在评估时随机变换输入,沿着三个实际可解释的轴(词汇/风格、长度和语言)挑战模型鲁棒性。在32个数据集(覆盖42种语言)上评估16个开源嵌入模型,变换通过英语子样本上的4800个人类评分验证,我们发现三种模式:(1)模型在各个轴上表现出特定的、部分解耦的鲁棒性轮廓。(2)在三个模型家族中,规模提升绝对分数,但未缩小原始评估与变换评估之间的差距。在这里,缩放倾向于特别改善语言轴。(3)英语数据集对HTEB变换比多语言数据集更敏感。这表明HTEB识别了模型在部署相关轴上的优缺点,挑战了当前的嵌入基准,并主张进行多维、动态的鲁棒性评估。

英文摘要

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

2605.28188 2026-05-28 cs.CL 版本更新

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

框架至关重要:通过基于行为的价值对齐解决决策中的框架敏感性

Seojin Hwang, Minju Kim, Junhyuk Choi, JeongHyun Park, Hwanhee Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 本文提出Fragile基准测试框架,系统评估大语言模型在事实等价但不同框架输入下的决策稳定性,并设计Valign方法通过表示级干预有效降低框架引起的决策翻转。

Comments 29 pages, 7 figures, 31 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地部署在高风险决策场景中,例如法律推理,其中在事实上等价的输入下保持一致性至关重要。然而,我们发现,事实保持不变但框架不同的输入会显著破坏LLM决策的稳定性。为了系统研究这一问题,我们引入了Fragile,一个大规模基准测试,它在三个受控维度上隔离了事实保持的语义框架:价值倾向叙述、时间切片和叙述生动性。我们的实验揭示了LLM对框架的高度敏感性,平均决策翻转率为28.6%。我们发现,简单的先验提示级和激活级干预不仅无法抑制框架敏感性,反而会主动放大它。因此,我们提出了Valign,一种表示级方法,通过将决策锚定到稳定的价值先验、将隐藏状态引导至模型的价值一致方向,并从模型隐藏状态中投影出时间-生动性敏感方向,显式地针对这些框架维度。Valign持续减少了框架引起的决策翻转,表明稳健的缓解需要直接针对框架操作的内部路径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.

2605.28181 2026-05-28 cs.CL 版本更新

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

当置信度误导时:扩散语言模型的后缀锚定与锚邻近置信度调制

Jungwon Park, Jimyeong Kim, Jungmin Ko, Nojun Kwak, Wonjong Rhee

发表机构 * RICS(智能研究学院) AIIS(人工智能研究所) IPAI(人工智能研究所) Department of Intelligence and Information(智能与信息系) Daegu Gyeongbuk Institute of Science and Technology(大邱庆州市科学技术院)

AI总结 针对扩散语言模型中置信度误导导致生成不完整或过早解码的问题,提出后缀锚定与锚邻近置信度调制方法,无需训练即可提升完全非自回归解码性能。

Comments Preprint

详情
AI中文摘要

扩散语言模型通过对掩码标记序列进行迭代去噪来解码文本,使得选择解码位置成为推理时的核心决策。大多数无训练解码策略使用模型置信度进行位置选择,假设高置信度位置已准备好解码。本文通过研究置信度何时误导完全非自回归解码来重新审视这一假设。EOT标记可能获得高置信度并导致生成不完整;插入后缀锚定可以缓解此问题,但会在锚附近引入局部过度置信,导致锚邻近标记过早解码。为解决这些问题,我们提出后缀锚定置信度调制,一种简单的无训练方法,它插入短后缀锚定以鼓励回复完成,并根据解码进度调制锚附近的置信度。这保留了后缀锚定的回复完成优势,同时减少了锚邻近标记的过早解码。在纯文本推理、视觉-语言推理和代码生成基准测试中,我们的方法持续改进基于置信度的完全非自回归解码,优于显式EOT抑制,并保持了完全非自回归生成的并行解码优势。

英文摘要

Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.

2605.28179 2026-05-28 cs.CL 版本更新

SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

SuperValid: 面向泛化下游扩展的能力对齐OOD验证

Quanen Sun, Changxin Tian, Ke Shi, Cai Chen, Cunyin Peng, Jia Liu, Kunlong Chen, Zhiqiang Zhang

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出SuperValid框架,通过从基准测试中提炼核心概念并扩展为多样化的知识丰富文本,合成能力对齐的分布外验证数据,以在能力层面预测下游性能,实现有效的模型选择、早停和扩展决策。

详情
AI中文摘要

扩展定律通过将计算量与交叉熵损失相关联来指导大型语言模型的训练,最近的工作进一步将其扩展到预测下游基准性能。然而,先前的方法在两个方面面临泛化限制:关注基准级性能会引入特定场景的伪影,而依赖IID验证损失则无法在训练分布变化时跟踪能力提升。在这项工作中,我们认为下游扩展应在能力层面进行研究,这能够捕捉跨相关任务的共享技能因素,同时抽象掉基准特定的噪声。我们提出了SuperValid,一个通过从能力领域内的基准测试中提炼核心概念并将其扩展为多样化的知识丰富文本来合成OOD(分布外)、能力对齐验证数据的框架。涵盖6个能力领域内17个基准测试的大量实验表明,SuperValid损失与不同架构、规模和训练数据分布的模型的下游性能表现出强且稳定的相关性。作为一种无需训练、可在训练期间计算且无需基准评估的度量,SuperValid实现了有效的模型选择、早停和扩展决策。

英文摘要

Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

2605.28163 2026-05-28 cs.CL cs.AI 版本更新

DEPART: DEcomposing PARiTy across Multilingual LLMs

DEPART: 跨多语言大模型的性能差异分解

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal, Sunayana Sitaram

发表机构 * Microsoft Research India(微软印度研究院)

AI总结 提出DEPART框架,通过贝叶斯分层模型分解多语言大模型性能差异,发现语言特征解释79%-92%的方差,且模型内部表示与英语的相似性是主要预测因子。

详情
AI中文摘要

多语言大模型(mLLMs)排行榜报告每种语言的准确率,但很少解释为何出现差异,导致系统性偏差未被归因,且从业者无法采取可操作的杠杆。我们首先通过无分布Friedman和Kruskal-Wallis检验确定这些差距是系统性的而非抽样噪声的产物,然后引入一个两步贝叶斯分层框架,将多语言性能方差分解为可解释的组成部分。首先,隔离语言身份归因的方差,我们表明可观察的语言特征(文字、语系、类型学距离)在理解任务上解释了$R^2_{\text{ling}} = 79\%$的方差,在推理任务上解释了$92\%$,而模型内部表示与英语的相似性成为两个任务桶中的主导预测因子。其次,分解完整的(模型×基准×语言)立方体,我们发现NLU和推理具有根本不同的方差分布:模型身份主导理解(方差的66.7%),而基准×模型交互主导推理(46.3%)。这些结果共同将多语言评估从被动的性能映射重塑为一个可解释的诊断框架,并提供针对语言差异根本驱动因素的具体杠杆。

英文摘要

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

2605.24678 2026-05-28 cs.AI cs.CL cs.SD 版本更新

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

探索感知语音特征用于心理健康护理中的临床决策支持

Vassilis Lyberatos, Edmund G. Dervakos, Eleni Adamidi, Athanasios Voulodimos, Giorgos Stamou

发表机构 * National Technical University of Athens(国家技术大学雅典) PsychNow

AI总结 提出一个基于感知声学和语言特征(如韵律、嗓音质量、语义连贯性、句法结构和讽刺)的系统分析框架,结合统计分析和可解释机器学习(XGBoost与SHAP和LIME),在多个数据集上发现语音特征与抑郁、焦虑和ADHD症状严重度之间的稳定关联,并通过消融研究识别最具信息量的特征组。

Comments Accepted to CLPsych 2026, part of ACL 2026

详情
AI中文摘要

语音和语言技术通过客观且可解释的线索为支持心理健康评估提供了宝贵的机会。我们提出了一个系统的基于特征的分析框架,利用感知基础的声学和语言特征,包括韵律、嗓音质量、语义连贯性、句法结构和讽刺。通过统计分析和可解释机器学习(XGBoost与SHAP和LIME),我们研究了语音特征与抑郁、焦虑和ADHD的已验证症状测量之间的关联。在受控基准数据集(StressID、DAIC-WOZ、Androids、EATD)和真实世界临床数据集上的评估表明,该框架揭示了症状严重度与嗓音不规则性(如shimmer、jitter)、词汇-句法模式和情感基调之间的稳定且一致的关系。跨所有数据集进行的消融研究进一步识别了最具信息量的特征组。这项工作探索了一种透明且临床可解释的基于语音的心理健康分析方法。

英文摘要

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

2605.01735 2026-05-28 cs.CL 版本更新

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

少即是多:面向LLM的几何遗忘方法,实现最小数据披露

Chenchen Tan, Xinghao Li, Shujie Cui, Youyang Qu, Cunjian Chen, Longxiang Gao

发表机构 * Faculty of Information Technology, Monash University, Clayton, Victoria, Australia.(墨尔本大学信息技术学院,澳大利亚维多利亚州克莱顿分校)

AI总结 提出几何遗忘(GU)方法,通过操作模型隐藏状态并利用少量安全参考提示和锚点上下文合成提示,在无需原始训练数据的情况下实现高效选择性遗忘,在ToFU和UnlearnPII基准上达到强目标抑制且对非目标性能影响最小。

Comments 21 pages, 8 Figures

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在现实系统中,它们必须支持事后移除特定内容以满足隐私和治理要求。这推动了选择性遗忘的发展,即抑制关于特定实体或主题的信息,同时保持LLM的通用效用。然而,现有的大多数LLM遗忘方法需要访问原始训练语料库,并依赖于输出级拒绝微调或广泛的梯度更新,在遗忘强度、非目标保持和数据可用性之间造成紧张。我们提出几何遗忘(GU),一种直接操作模型提示条件隐藏状态的方法,无需访问原始训练语料库。具体来说,GU从少量安全参考提示中提炼出紧凑的低秩安全行为子空间,并使用轻量级的锚点上下文合成提示触发隐藏表示向该安全子空间的局部投影对齐。对合成非目标锚点的教师蒸馏正则化进一步减少了附带漂移。在面向隐私的遗忘基准(ToFU和UnlearnPII)上,GU实现了强目标抑制,对非目标性能影响最小,表明有效遗忘可以用最少的合成数据实现。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM's general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model's prompt-conditioned hidden states without access to the original training corpus. Specifically, GU distills a compact, low-rank safe-behavior subspace from a small set of safe reference prompts and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden representations to this safe subspace. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.

2605.28142 2026-05-28 cs.LG cs.CL 版本更新

Self-Consistency via Marginal Sharpening

通过边际锐化实现自一致性

Aleksei Arzhantsev, Otmane Sakhi, Nicolas Chopin

发表机构 * Criteo AI Lab(Criteo AI实验室) CREST IP Paris(巴黎CREST研究所) ENSAE, Institut Polytechnique de Paris(巴黎理工学院ENSAE)

AI总结 提出一种自回归并行采样算法,通过锐化答案边际分布而非完整输出分布,在数学和编程基准上优于标准功率采样且速度更快。

详情
AI中文摘要

推理时采样可以在不额外训练的情况下激发语言模型的强推理能力。现有的功率采样方法通过锐化完整生成输出的分布来做到这一点,偏向于模型下个体可能性高的完成。我们认为这对于推理来说是错误的目标:一个完成将推理轨迹与最终答案纠缠在一起,而重要的是一个答案是否被许多合理的推理路径支持。因此,我们将目标从完整输出分布转移到锐化的答案边际,使自一致性成为推理时目标而非事后投票标准。令人惊讶的是,这个边际目标允许一个高效的近似:我们提出一个简单的、纯粹自回归的并行采样算法,近似地从锐化的答案边际中采样,在数学和编程基准上比标准功率采样表现出更强的性能,同时快几个数量级。

英文摘要

Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.

2605.28131 2026-05-28 cs.CL 版本更新

Better heads do not guarantee better binarized constituency parsing

更好的头部并不能保证更好的二值化成分句法分析

Zeyao Qi, Yige Chen, Eitan Klinger, Vivaan Wadhwa, Jungyeul Park

发表机构 * The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong(香港中文大学,大屿山分校) The University of British Columbia, Vancouver, Canada(不列颠哥伦比亚大学,温哥华,加拿大) Korea Advanced Institute of Science & Technology, Daejeon, South Korea(韩国科学技术院,大田,韩国)

AI总结 本文研究了标点感知的树二值化方法,并探讨了依赖诱导的头部信息是否改善了二值化句法分析器的监督信号,发现尽管学习到的头部在内在头部预测上优于基于规则的头部,但在去二值化后并未带来一致的句法分析提升。

详情
AI中文摘要

我们重新审视了成分句法分析中标点感知的树二值化,并询问依赖诱导的头部信息是否改善了二值化句法分析器的监督信号。尽管学习到的头部在内在头部预测上显著优于基于规则的头部,但在去二值化后并未带来一致的句法分析提升。特别地,标点条件评估显示,在宏观平均标点敏感$F_1$上,学习到的头部表现不如基于规则的二值化,尽管在CTB上整体有小幅提升。类似的稳定性问题在跨树库迁移中也出现。这些结果表明,当用作二值化控制信号时,语言学的头部信息不一定对句法分析器最优。本文呈现了一个负面结果:更好的头部预测并不意味着更好的标点敏感成分句法分析。

英文摘要

We revisit punctuation-aware tree binarization for constituency parsing and ask whether dependency-induced headedness improves binary parser supervision. Although learned heads substantially outperform rule-based heads in intrinsic head prediction, they do not yield consistent parsing gains after debinarization. In particular, punctuation-conditioned evaluation shows that learned headedness underperforms rule-based binarization in macro-average punctuation-sensitive $F_1$, despite a small overall gain on CTB. Similar instability appears under cross-treebank transfer. These results suggest that \ycc{linguistically grounded} headedness is not necessarily parser-optimal when used as a binarization control signal. The paper presents a negative result: better head prediction does not imply better punctuation-sensitive constituency parsing.

2605.28128 2026-05-28 cs.CL 版本更新

Chinese Word Boundary Recovery through Character Alignment Projection

通过字符对齐投影恢复中文词边界

Lusha Wang, Yuchen Li, Su Yuan, Jungyeul Park

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Korea Advanced Institute of Science & Technology(韩国科学技术院)

AI总结 提出基于对齐投影的两步方法,从带噪句子中恢复词边界,并构建两个评估基准,实验表明该方法能有效纠正过度切分错误。

详情
AI中文摘要

中文分词在非标准文本中尤其脆弱,语言学习者错误和其他字符层面的差异会破坏下游标注和评估所假设的词边界。本文将中文词边界恢复形式化为基于对齐的投影任务。给定一个带噪的源句子和一个更干净的目标对应句,我们首先在字符级别对齐两个字符串,然后将目标侧的词边界投影回源句。除了恢复方法本身,我们还引入了两个评估资源:基于MuCGEC的人工检查学习者中文基准,以及从中文宾州树库导出的受控合成基准。实验表明,直接分词仍然容易受到学习者输入中的复合碎片化影响,而所提出的两步投影方法通过使用校正后的目标恢复源侧词跨度,纠正了许多过度切分错误。结果表明,词边界恢复不同于普通分词,并且对齐投影为在带噪输入下稳定中文标注和评估提供了一种原则性机制。

英文摘要

Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.

2605.28123 2026-05-28 cs.CL 版本更新

Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models

风险感知的选择性提示用于大型视觉-语言模型中的幻觉缓解

Yuang Huang, Yafeng Zhang, Yu Zilan

发表机构 * Shanghai Jiao Tong University(上海交通大学) iFLYTEK Tsinghua University(清华大学)

AI总结 本文系统研究提示验证在大型视觉-语言模型中的风险,发现其效果依赖输入难度,并提出基于预生成不确定性信号的选择性提示方法RSP以平衡性能。

Comments 7 pages, 1 figures, submitted to ACL ARR 2026 May (EMNLP)

详情
AI中文摘要

基于提示的验证被广泛用于缓解大型视觉-语言模型(LVLMs)中的幻觉,但其何时有效仍不清楚。我们系统研究了两种代表性LVLM架构和幻觉基准上的验证提示,发现它是一种有风险的干预:其纠正随输入难度增加,而新引入的错误在不同难度级别持续存在。因此,始终开启的提示在困难输入上有帮助,但在简单输入上益处甚微甚至有害。我们的分析进一步表明,这种行为与保守的输出偏移相关。验证提示将注意力从视觉令牌重新分配到指令令牌,并诱导出中性提示控制中不存在的中层熵模式,这表明是指令条件化的注意力重新分配而非统一的视觉基础改善。受这种输入依赖风险的启发,我们提出了风险感知的选择性提示(RSP),一种无需训练的方法,利用预生成不确定性信号选择性地触发验证。RSP减轻了始终开启提示的性能下降,同时保持基线性能,并揭示了有效的选择信号因架构而异。

英文摘要

Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit -- and can harm -- easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.

2605.28122 2026-05-28 cs.CR cs.AI cs.CL 版本更新

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

SNARE: 自适应场景合成以诱发编码代理中的过度行为

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang

发表机构 * Griffith University(格里菲斯大学) Quantstamp Nanyang Technological University(南洋理工大学) UNSW Sydney(悉尼大学) Wake Forest University(卫斯理大学)

AI总结 提出SNARE流水线,通过组合良性场景片段并使用无评判器预言机评分与汤普森采样,自适应地诱发编码代理的过度行为,并在4×5代理-模型矩阵上评估,发现19.51%的良性运行触发过度行为,且代理框架比模型影响更大。

详情
AI中文摘要

编码代理以一系列shell、文件和网络操作执行良性任务,其中任何操作都可能悄然超出授权范围而任务仍完成。我们称此为过度行为:提示并非对抗性且运行成功,但超出范围的操作可能泄露凭据或删除文件。现有基准未能捕捉:任务完成套件认可任何完成的运行,越狱套件探测对抗性提示,而先前唯一的过度行为基准对每个代理-模型对应用单一固定提示集,导致其最易和最难的配对测量不足。我们提出SNARE(为非对抗场景合成自适应奖励引导诱发),该流水线从可重用范围和陷阱片段组合良性场景,用无评判器预言机对每次运行评分,标记陷阱模式匹配及未经请求的文件添加或删除,并使用汤普森采样将每对运行预算导向最常触发它的场景。在24个过度行为原型上实例化得到OverEager,我们在四个编码代理和五个基础模型的4×5矩阵上运行。在10,000次良性运行中,19.51%触发过度行为,每对比率跨度达11.9倍。这种变化由代理框架驱动,而非模型:框架占56%而模型占21%,因此任何单一框架或单一模型评估都会低估矩阵约五分之一。

英文摘要

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

2605.28120 2026-05-28 cs.CL cs.AI cs.MA 版本更新

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

LegalGraphRAG:面向可靠法律推理的多智能体图检索增强生成

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能研究院) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LegalGraphRAG框架,通过分层法律图和多智能体系统(研究员、审计员、裁决员)实现可靠的法律推理,在准确性和可信度上超越现有GraphRAG基线。

Comments 30 pages, 18 figures, ACL 2026 Main Conference. Project page: https://github.com/XMUDeepLIT/LegalGraphRAG

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)通过将知识结构化为关系图,推进了平面文档检索,实现了更连贯和有效的推理。然而,将其应用于法律推理等特定领域面临关键挑战。(i) 法律语料库是异构的,包含来自案例、法条和解释的多粒度知识。平面知识图无法充分区分事实细节、适用规则和抽象原则,限制了准确检索。(ii) 可靠的法律判决需要透明、基于证据的推理。传统的RAG直接将检索到的上下文传递给LLM而不进行验证,导致推理不透明且易出错。为此,我们提出了LegalGraphRAG,一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件:一个分层法律图,用于分层组织法律来源,以便在适当的抽象级别进行检索;以及一个用于可靠法律推理的多智能体系统,其中研究员检索候选证据,审计员严格验证其相对于源文档的有效性,裁决员综合已验证的证据集作出最终判决。大量实验表明,LegalGraphRAG达到了最先进的性能,在准确和可信的法律分析方面优于现有的GraphRAG基线。我们的代码、数据集和实现细节可在https://github.com/XMUDeepLIT/LegalGraphRAG获取。

英文摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

2605.28116 2026-05-28 cs.CR cs.AI cs.CL 版本更新

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

MIRAGE:通过用户生成内容对移动GUI代理的上下文感知提示注入

Ruoqi Guo, Yi Liu, Gelei Deng, Yiheng Xiong, Yuekang Li, Ying Zhang, Leo Yu Zhang, Lida Zhao, Ji Jie, Yuxiao Lu

发表机构 * Griffith University(格里菲斯大学) Quantstamp Nanyang Technological University(南洋理工大学) Singapore Management University(新加坡管理学院) University of New South Wales(新南威尔士大学) Wake Forest University(威克森林大学) Independent Researcher(独立研究者)

AI总结 提出MIRAGE管道,通过将攻击者控制的文本嵌入用户生成内容区域,在不修改代理、应用或操作系统的情况下,对视觉语言模型驱动的移动GUI代理实现高成功率的提示注入攻击。

详情
AI中文摘要

由视觉语言模型(VLM)驱动的移动图形用户界面(GUI)代理将屏幕视为渲染像素并根据所见选择动作,因此无法可靠地将受信任的界面元素与用户生成内容区分开来。我们提出MIRAGE(移动逼真对抗性GUI示例注入),这是一个管道,通过将攻击者控制的文本放入普通用户生成内容区域,将良性移动截图转化为提示注入样本,而无需修改代理、应用程序或操作系统。MIRAGE分三个阶段运行:定位器识别截图上用户可控制的区域,生成器合成上下文感知的有效载荷并以应用程序的原生风格渲染,策展人调节逼真度并在应用程序、区域类型和攻击意图之间平衡样本。一个关键挑战是,注入的截图必须在视觉上与真实用户内容难以区分,同时仍能转移代理的注意力;我们通过分离控制可达性、逼真度和分布平衡的阶段来解决这一问题。在一个涵盖十个应用程序和十一种攻击意图的1,111样本基准测试中,所有五个被评估的VLM代理都易受攻击,攻击成功率为23%-30%,并且MIRAGE在人类逼真度评分上高于最强的先前攻击(3.02对比2.52,满分5分)。我们进一步发现,每个样本的逼真度和攻击成功率不相关,因此仅靠视觉质量过滤无法可靠防御此威胁。

英文摘要

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

2605.28112 2026-05-28 cs.CR cs.CL cs.IR 版本更新

A Wolf in Sheep's Clothing: Targeted Routing Hijacking in Federated RAG

披着羊皮的狼:联邦RAG中的目标路由劫持

Junjie Mu, Qiongxiu Li

发表机构 * Politecnico di Milano(米兰理工大学) Aalborg University(奥胡斯大学)

AI总结 本文提出路由劫持攻击,恶意客户端伪造语义配置文件以吸引目标查询,导致检索结果被篡改,并设计了一种基于信任的后路由框架来缓解该攻击。

Comments Under review. Code available at https://github.com/Junjie-Mu/routing-hijacking-fedrag

详情
AI中文摘要

联邦检索增强生成(FedRAG)对隐私敏感的应用具有吸引力,因为原始数据保留在本地。因此,路由必须依赖客户端提供的语义配置文件,这为操纵创造了新的机会。我们引入了路由劫持,一种路由阶段的攻击,其中恶意客户端伪造其配置文件以吸引目标查询,尽管其底层数据不相关。我们表明这种漏洞是严重的。在三种代表性的FedRAG路由架构中,路由劫持一致地错误路由目标查询,并导致下游中断和失败,包括证据缺失、投毒、错误答案和幻觉。在一个高风险的MedQA-USMLE案例研究中,我们进一步表明,投毒的检索证据可以误导不同规模的模型,导致错误答案、幻觉和谄媚失败。现有防御无法弥补这一差距:加密路由保留了被利用的排名,而拜占庭鲁棒的联邦学习(FL)规则难以迁移到异构的路由配置文件。为了解决这一差距,我们提出了一种基于信任的后路由框架,该框架使用返回证据反馈(包括检索相关性、配置文件一致性和跨客户端一致性)对客户端进行重新加权;在线实验表明,它抑制了重复查询上的持续劫持,并迁移到学习的神经路由器。我们的发现将路由完整性确立为FedRAG中一个新的安全挑战,并强调需要更强的防御来确保安全的联邦检索。

英文摘要

Federated Retrieval-Augmented Generation (FedRAG) is attractive for privacy-sensitive applications because raw data remain local. As a result, routing must rely on client-provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing-stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high-stakes MedQA-USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine-robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust-aware post-routing framework that reweights clients using returned-evidence feedback, including retrieval relevance, profile consistency, and cross-client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.

2605.28084 2026-05-28 cs.CL cs.AI 版本更新

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

SMILE-Next: 教授大型语言模型检测、分类和推理笑声

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

发表机构 * School of EE, KAIST(韩国科学技术院电子工程系) Dept. of EE, POSTECH(POSTECH电子工程系) School of Computing, KAIST(韩国科学技术院计算机科学系)

AI总结 提出SMILE-Next数据集和包含笑声特定Self-Instruct与混合笑声专家框架的方法,用于实现多模态笑声理解,显著优于基线模型。

详情
Journal ref
Annual Meetings of the Association for Computational Linguistics 2026
AI中文摘要

笑声是一种复杂的社会信号,传达超越娱乐的交际意图。虽然先前的工作集中在孤立的笑声分析任务上,但在现实场景中对笑声的全面理解仍未得到充分探索。因此,我们引入了SMILE-Next,一个用于现实世界笑声理解的数据集,具有多模态文本表示和跨三个任务的问答标注:笑声检测、笑声类型分类和笑声推理。基于SMILE-Next,我们旨在开发一个能够细致理解现实语境中笑声的笑声专用大型语言模型。为此,我们提出了两个关键组件:笑声特定Self-Instruct和混合笑声专家框架。笑声特定Self-Instruct通过自动合成多样化的以笑声为中心的指令,增强了跨任务和领域的泛化能力。MoLE引入了一种任务自适应专家路由机制,动态选择针对每个笑声相关任务定制的专用专家,提高了任务特定性能和效率。实验结果表明,我们提出的组件的组合显著优于多模态LLM基线,推动了鲁棒的现实世界笑声理解。项目页面位于:https://mok0102.github.io/smile-next/。

英文摘要

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

2605.28079 2026-05-28 cs.CL 版本更新

ATLAS: All-round Testing of Long-context Abilities across Scales

ATLAS: 跨尺度的长上下文能力全面测试

Deli Huang, Cunguang Wang, Hongyin Tang, Zhe Tang, Linsen Guo, Dongyu Ru, Ruoshi Yuan, Ziyue Zhu, Xiaoyu Li, Ziwen Wang, Chen Zhang, Anchun Gui, Wen Zan, Jiaqi Zhang, Xuezhi Cao, Jingang Wang, Xunliang Cai, Yixin Cao

发表机构 * Meituan(美团) Fudan University(复旦大学)

AI总结 提出ATLAS基准框架,通过分层分类、长度感知AUC评分和ATLAScore聚合指标,系统评估长上下文语言模型在不同长度和任务上的性能退化与能力分布。

Comments 29 pages, 13 figures. Preprint

详情
AI中文摘要

长上下文语言模型现在宣称上下文窗口可达数百万token,然而评估通常报告单一长度或狭窄的任务族,掩盖了两种失败模式:性能随长度增长而崩溃,以及强大的检索能力不一定能迁移到下游使用。我们提出ATLAS,一个重新定义长上下文评估为长度依赖能力剖析的基准框架。ATLAS贡献了三个方法论原则:(i) 分层分类法,将基础操作与应用工作负载分离,以便归因失败;(ii) 长度感知AUC评分,在固定的8K-1M网格上积分分数-长度曲线,用完整的退化曲线替代单点指标;(iii) ATLAScore,对分类类别进行调和平均聚合,惩罚不平衡的剖面,并通过非线性最终聚合从子集分数进行端到端不确定性传播。我们在八个能力维度上实例化该框架,包含九个可审计组件和6,438个实例,并评估了26个模型。Gemini-3.1-Pro-Preview在128K处领先,Claude-Opus-4.6在1M处领先。排名在ATLASscore@8K-128K和ATLASscore@8K-1M之间大幅重新洗牌:7个模型移动至少两个排名,两个分类层仅共享61%的跨模型方差,个别排名差距高达12位。这些结果支持按能力和长度报告长上下文质量,而不是单一的标题分数。

英文摘要

Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong retrieval need not transfer to downstream use. We present ATLAS, a benchmarking framework that redefines long-context evaluation as length-dependent capability profiling. ATLAS contributes three methodological principles:(i) a layered taxonomy separating foundational operations from application workloads so failures can be attributed, (ii) length-aware AUC scoring that integrates score-length curves over a fixed 8K-1M grid, replacing single-point metrics with full degradation profiles, and (iii) ATLAScore, a harmonic-mean aggregate over taxonomy categories that penalizes imbalanced profiles, with end-to-end uncertainty propagation from subset scores through the nonlinear final aggregate. We instantiate the framework across eight capability dimensions with nine auditable components and 6,438 instances, and evaluate 26 models. Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M. Rankings reshuffle substantially between ATLASscore@8K-128K and ATLASscore@8K-1M: 7 models move by at least two ranks, and the two taxonomy layers share only 61% of cross-model variance, with individual rank gaps up to 12 positions. These results support reporting long-context quality by capability and length, not by a single headline score.

2605.28074 2026-05-28 cs.CR cs.CL cs.IR 版本更新

SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning

SilentRetrieval:通过语义保持的对抗性数据投毒劫持检索增强生成

Jiachen Qian

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出SilentRetrieval两阶段数据投毒攻击,通过协调束搜索和上下文自适应触发生成,在保持文档流畅性的同时实现高检索命中率和攻击成功率,并评估了防御措施的有效性。

Comments 12 pages, 4 figures, KDD '26 camera-ready version

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea
AI中文摘要

检索增强生成(RAG)缓解了LLM的幻觉问题,但引入了一个关键漏洞:语料库完整性。我们提出SilentRetrieval,一种两阶段数据投毒攻击,通过对抗性构造但流畅的文档劫持RAG系统。第一阶段使用协调束搜索,一种具有流畅性-相似性目标的多令牌联合优化方法,在约束困惑度的同时保持有毒宿主文档的可检索性。第二阶段使用上下文自适应触发生成,一种由冻结LLM驱动的轻量级触发融合步骤,将操纵触发器集成到文档内容中。在合成目标答案的单投毒文档每查询评估下,SilentRetrieval在Natural Questions和MS MARCO上分别达到84.6%/81.3%的HR@10和57.5%/54.8%的ASR-LLM,同时保持接近良性的困惑度。跨四个目标LLM的模型评估显示,在固定触发器生成器下具有非平凡的有效性,针对未见检索器(包括ColBERT和商业嵌入模型)的迁移测试在相同注入语料库协议下平均HR@10为64.7%。在采样维基百科规模评估中,SilentRetrieval在0.016%投毒率下保持74.2%的HR@10。结合检索侧和生成侧防御可大幅降低攻击成功率,但会引入延迟权衡。人工评估显示,与不流畅基线相比,标记率显著降低,但在当前样本量下数值上仍比良性内容更可疑。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

2605.28073 2026-05-28 cs.CL cs.AI 版本更新

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens: 通过上下文感知叙事丰富实现偏好对齐的故事重写

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) AIM3 Lab, Renmin University of China(中国人民大学AIM3实验室) Nanyang Technological University(南洋理工大学)

AI总结 针对故事重写中读者偏好对齐问题,提出结合上下文感知叙事丰富的方法,构建基准STORYLENSBENCH、奖励模型STORYLENSEVAL和两阶段重写模型STORYLENSWRITER,实验表明上下文增强显著提升用户满意度。

Comments 16 pages, 7 figures, 15 tables

详情
AI中文摘要

故事重写旨在适应不同读者偏好的同时保持情节一致性和叙事连贯性。与传统的风格迁移工作不同,我们认为有效的故事重写需要上下文感知的叙事丰富,而不仅仅是表面层面的风格适应。我们的初步人类研究表明,仅风格适应对读者满意度的提升微乎其微(2.3%),而上下文增强的重写则显著改善了用户偏好对齐(24.5%)。受此启发,我们引入了STORYLENSBENCH,一个用于偏好对齐故事重写的大规模基准,包含结构化故事书、多维读者偏好档案和排序后的上下文感知重写故事。基于该基准,我们提出了STORYLENSEVAL,一个用于估计重写故事读者满意度的奖励模型,以及STORYLENSWRITER,一个结合监督微调和基于GRPO的强化学习的两阶段重写模型。我们进一步建立了一个涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明,STORYLENSWRITER持续优于强大的生成和个性化基线,突显了上下文感知叙事丰富对于个性化故事重写的重要性。

英文摘要

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

2605.28062 2026-05-28 cs.CL cs.IR 版本更新

ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor

ConvMemory: 一种轻量级学习型记忆重排序器、一个负归因结果以及一个研究预览冲突编辑器

Taiheng Pan

发表机构 * School of Computing and Information Systems(计算与信息系统学院) University of Melbourne(墨尔本大学)

AI总结 本文提出ConvMemory,一种3.6M参数的学习型重排序器,通过交叉编码器教师监督在融合密集和词汇特征上训练,用于对话长期记忆检索,并报告了负归因结果及研究预览冲突编辑器CCGE-LA。

Comments 15 pages. Technical report

详情
AI中文摘要

我们描述了ConvMemory,一种用于对话长期记忆检索的小型3.6M参数学习型重排序器,通过交叉编码器教师监督在融合密集和词汇特征上训练。在LongMemEval记忆族上,ConvMemory在Recall@10上优于BGE-large交叉编码器,延迟降低12-47倍;在Clean500上,与mxbai-rerank-large-v1相比,Recall@10差距在0.025以内,但运行成本低28倍;在Stress1000干扰项下,Recall@10差距扩大到0.081,但ConvMemory的延迟仍低117倍;这些LongMemEval数字是单次运行或单种子结果,作为指示性成本前沿证据报告,而非基准级。然后,我们发布了一个关于先前声称机制的严格负归因结果:一个五种子重训练消融实验结合配对自助法表明,ConvMemory的学习时间窗口在总体上统计显著,但并非时间特定,对硬非时间控制的影响最大,而对多跳时间查询无显著影响。该机制的诚实描述是在融合密集+词汇特征空间中的廉价交叉编码器蒸馏,而非时间结构利用。此外,我们发布了CCGE-LA,一种低幅度的冲突感知候选集编辑器,基于ConvMemory,作为研究预览,在LoCoMo的替换和过时/恢复切片上取得了适度但一致的改进。所有结果均为检索阶段;ConvMemory在绝对LoCoMo MRR上未匹配mxbai-rerank-large-v1,且该报告为单作者,尚未独立审计。

英文摘要

We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory's learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

2605.28060 2026-05-28 cs.CL 版本更新

Challenges in Explaining Pretrained Clinical Text Classifiers

解释预训练临床文本分类器的挑战

Kristian Miok, Matej Klemen, Blaz Škrlj, Marko Robnik Šikonja

发表机构 * Faculty of Computer and Information Science, University of Ljubljana, Slovenia(卢布尔雅那大学计算机与信息科学学院) ICAM - Advanced Environmental Research Institute, West University of Timisoara, Romania(蒂米什瓦德西大学先进环境研究所)

AI总结 本文通过医院住院时长预测任务,揭示了LIME和SHAP等事后解释方法在临床叙事中的局限性,包括过度关注非信息性标记、归因不稳定以及对不连贯输入的高置信度预测,强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

Comments 9 pages, 7 figures. Accepted at the First Workshop on Responsible Healthcare using Machine Learning (RHCML 2025), co-located with ECML PKDD 2025

详情
Journal ref
Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2025. Communications in Computer and Information Science, vol 2842, pp. 314-322. Springer, Cham (2026)
AI中文摘要

在临床自然语言处理中解释神经模型的预测仍然是一个重大挑战,尤其是对于涉及长篇幅、非结构化医疗文本的复杂任务。尽管LIME和SHAP等事后方法被广泛使用,但它们在应用于临床叙事时常常表现不足。在本文中,我们通过针对医院住院时长预测任务的定向演示,识别了基于标记和基于扰动的解释技术的核心局限性。我们的发现揭示了诸如过度强调非信息性标记、归因不稳定以及对不连贯输入变体的高置信度预测等问题。这些结果强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

英文摘要

Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.

2605.28058 2026-05-28 cs.CL 版本更新

Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis

提示即一切:基于多视角提示的大语言模型在方面级情感分析中的应用

Nils Constantin Hellwig, Niklas Donhauser, Jakob Fehle, Udo Kruschwitz, Christian Wolff

发表机构 * Media Informatics Group, University of Regensburg, Germany(里根大学媒体信息学小组) Information Science Group, University of Regensburg, Germany(里根大学信息科学小组)

AI总结 提出LLM-MvP方法,通过多视角提示、模式约束解码和前缀批处理,使大语言模型在少量样本下达到与微调模型竞争甚至更优的性能,同时降低计算开销。

详情
AI中文摘要

近期工作探索了大语言模型(LLMs)在方面级情感分析(ABSA)中通过少样本提示的能力,相比零样本基线显著改进,且所需标注示例大幅减少。然而,与在数百个示例上微调的模型相比仍存在性能差距,且LLM推理的计算成本对部署构成实际障碍。我们提出了基于LLM的多视角提示(LLM-MvP),将考虑多种元素排序的多视角原理适配到LLM提示中。通过将模式约束解码与上下文无关语法及前缀批处理相结合,LLM-MvP实现了与微调方法竞争甚至更优的性能,同时大幅降低计算开销。在五个基准数据集上的广泛实验表明,LLM-MvP缩小了少样本提示与微调模型之间的差距,为ABSA提供了实用且高效的解决方案。

英文摘要

Recent work explored the capabilities of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA) through few-shot prompting, requiring substantially fewer annotated examples while achieving notable improvements over zero-shot baselines. However, a performance gap remained compared to models fine-tuned on hundreds of examples, and the computational costs of LLM inference present practical barriers to deployment. We introduce LLM-based Multi-View Prompting (LLM-MvP), which adapts the multi-view principle of considering multiple element orderings to LLM prompting. By combining schema-constrained decoding with a context-free grammar and prefix batching, LLM-MvP achieves performance competitive or superior to fine-tuned approaches while substantially reducing computational overhead. Extensive experiments across five benchmark datasets demonstrate that LLM-MvP closes the gap between few-shot prompting and fine-tuned models, offering a practical and efficient solution for ABSA.

2605.28047 2026-05-28 cs.CL 版本更新

Knowledge Dependency Estimation for Reliable Question Answering

面向可靠问答的知识依赖估计

Chaodong Tong, Qi Zhang, Nannan Sun, Lei Jiang, Yanbing Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) China Industrial Control Systems Cyber Emergency Response Team(中国工业控制系统网络应急响应团队)

AI总结 提出Knot方法,通过子集级反事实监督和潜在依赖因子覆盖建模,估计黑盒问答模型对不同知识单元的敏感性,以识别关键知识依赖。

Comments 12 tables, 9 figures

详情
AI中文摘要

可靠的问答不仅需要判断答案是否正确,还需要识别预测所依赖的可用知识。在实际的基于LLM的问答中,这些知识可能来自上下文、检索、分解或中间推理,形成一个嘈杂且冗余的候选空间,而非干净的金标准证据集。我们研究\emph{知识依赖估计}:估计固定黑盒问答模型对不同候选知识单元的敏感性。挑战在于无需穷举测试时扰动即可获得细粒度的依赖分数,同时建模冗余性、可替代性和互补性。我们提出 extbf{Knot},一种结构化的排序感知知识依赖估计器。Knot从子集级反事实监督中学习,通过覆盖潜在依赖因子来建模子集敏感性,并推导出排序感知的单元分数以识别有影响力的候选。在多项选择和生成式问答基准上,Knot在子集敏感性预测方面优于所有对比基线,并在无需额外问答模型调用的情况下产生比可部署基线更忠实的单元排序;当用于实际风险筛查时,其依赖分数有助于及早标记易出错的问答预测。

英文摘要

Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.

2605.28046 2026-05-28 cs.AI cs.CL 版本更新

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

MemCog: 从记忆即工具到记忆即认知的对话代理

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

发表机构 * WeChat, Tencent Inc.(腾讯公司)

AI总结 提出MemCog系统,通过可导航记忆存储、跨维度导航接口和主动推理协议,将记忆访问融入推理过程,在被动问答和主动记忆触发基准上达到最优性能。

详情
AI中文摘要

现有的代理记忆系统普遍遵循我们称之为“记忆即工具”的范式,其中单个查询触发对扁平段落列表的一次性检索,存在被动调用、推理-检索解耦以及检索片段与代理导航需求之间的结构不匹配等问题。我们提出MemCog,一个“记忆即认知”系统,使记忆访问成为推理过程的一个组成部分。MemCog将用户知识组织为具有关联链接图的可导航记忆存储,暴露跨维度导航接口以进行多步推理驱动的遍历,并采用主动推理协议,驱动代理从对话上下文中自发启动记忆探索。我们还构建了ProactiveMemBench,这是第一个用于评估主动记忆触发的基准。实验表明,MemCog在被动问答基准上达到了最先进水平(LoCoMo上92.98,LongMemEval上95.8),同时在ProactiveMemBench上大幅超越基线,展示了记忆即认知的优势。

英文摘要

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

2605.28042 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

通过激进剪枝专家从LLM中提取小型翻译专家

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出一种从混合专家LLM中激进剪枝与翻译无关的专家,实现大幅压缩MoE块而不显著降低翻译质量的方法。

详情
AI中文摘要

现代大型语言模型(LLM)实现了最先进的机器翻译性能,但它们是作为广泛通才训练的,主要针对许多与翻译无关的任务和能力。因此,它们对于此任务严重过参数化,导致过多的内存和计算需求。在本文中,我们提出了一种从现代混合专家LLM中激进剪枝专家的方法,同时翻译质量下降可忽略不计。我们的方法利用专家专业化和LLM中多语言能力的可分离性来识别与翻译无关的专家。并且由于MoE的模块化特性,这些专家可以在无需任何训练的情况下轻松剪枝。无需重新训练,我们能够剪枝一半的专家而质量下降可忽略,剪枝70%仅造成轻微损失。通过非常短的SFT,我们剪枝75%的专家并恢复基线性能,在某些设置下移除近90%的专家同时保持合理的翻译质量。总体而言,我们的结果表明翻译仅需要LLM的一小部分,从而实现了对包含超过90%参数的MoE块的大幅压缩。

英文摘要

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

2605.28037 2026-05-28 cs.CL 版本更新

Personality, Role, and Expressive Style in Large Language Models: An Interactionist Analysis

大型语言模型中的个性、角色与表达风格:一种互动主义分析

Moe Nagao, Koichiro Terao, Mikio Nakano, Naoto Iwahashi

发表机构 * Okayama Prefectural University(冈山县立大学) AI & Humans Lab(人工智能与人类实验室) C4A Research Institute(C4A研究所)

AI总结 本研究从互动主义视角,通过因子设计实验分析个性特质、对话角色和表达风格如何共同影响大型语言模型生成对话中感知的大五人格特质表达。

Comments 26 pages

详情
AI中文摘要

基于提示的个性控制是设计在社交情境中行为一致的大型语言模型(LLM)对话智能体的关键技术。然而,在提示中指定大五人格特质(BFTs)并不能确保这些特质在生成的语句中得到表达。本文从互动主义视角研究这种不匹配,将人格表达视为由特质指定与情境因素相互作用塑造的依赖于上下文的结果。我们分析了感知到的LLM生成对话中的BFT表达如何受三个提示因素影响:人格特质、对话角色和表达风格。采用结合六种人格条件、三种角色和三种表达风格条件的因子设计,我们在英语和日语中各生成了1,080个LLM智能体对话。然后,我们使用LLM-as-a-judge框架评估目标智能体的语句,以估计表达的大五人格特质。结果表明,表达的人格不仅受显式特质指定影响,还受对话角色和表达风格影响。这些效应是特质特定的:对话角色强烈影响开放性,表达风格显著塑造尽责性和宜人性,而显式特质指定主导神经质。即使没有显式的人格特质指定,社会和表达条件也会诱发独特的人格印象。跨语言比较显示英语和日语对话之间的模式大致相似,仅在特定的人格、角色和表达风格组合下存在显著差异。这些发现表明,LLM智能体中的个性控制不应被理解为特质提示的直接结果,而是一个涉及人格指定、社会角色和表达风格的依赖于上下文的过程。

英文摘要

Prompt-based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context-dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM-generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive-style conditions, we generate 1,080 LLM-agent dialogues in each of English and Japanese. We then evaluate the target agent's utterances using an LLM-as-a-judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait-specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality-trait specification, social and expressive conditions induce distinct personality-like impressions. Cross-linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context-dependent process involving personality specification, social role, and expressive style.

2605.28025 2026-05-28 cs.AI cs.CL cs.CY 版本更新

MIRA: A Bilingual Benchmark for Medical Information Response Audit

MIRA: 医学信息响应审计的双语基准

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai, Weiyi Wu, Chongyang Gao

发表机构 * The University of Chicago(芝加哥大学) SynAI Technologies Inc.(SynAI技术公司) Jinzhou Medical University(锦州医学院) Zhejiang University(浙江大学) Dartmouth College(达特茅斯学院) Northwestern University(西北大学)

AI总结 提出MIRA双语基准,通过4,320个提示评估大语言模型在不同用户表达下提供医学信息的一致性,发现低健康素养提示导致信息稀释(DID),并提出知识引导缓解方法。

详情
AI中文摘要

大语言模型(LLM)越来越多地被用于提供面向公众的健康信息,然而现有的安全评估忽略了在相同问题的不同用户表述下,响应是否保留了可比较的医学信息。为了解决这个问题,我们引入了医学信息响应审计(MIRA),这是一个受控的双语基准,评估LLM在用户侧语言、语域和健康素养信号下是否提供可比较的医学信息。MIRA包含从60个经过医学审查的低风险健康问题构建的4,320个提示。在五个主流LLM中,模型回答了所有医学问题,但对低健康素养信号的响应始终省略了更多关键信息,提供的具体后续步骤更少,并为独立判断提供的支持更少。我们将这种模式称为差异信息稀释(DID)。语言效应是模型特定的,而非对非英语提示普遍更差。与300个真实世界健康查询的比较提供了初步的秩次有效性证据。一种知识引导的缓解提示减少了大多数模型的信息稀释,其中Claude(约8%)和Qwen(约6%)在信息不足的简化方面减少最大。

英文摘要

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

2605.28023 2026-05-28 cs.CV cs.AI cs.CL cs.MM 版本更新

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

VCap: 用于弱到强视觉字幕的超几何奖励

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) Chinese Academy of Sciences(中国科学院) Kuaishou Technology(快手科技)

AI总结 提出VCap,一种证人-裁判奖励机制,通过超几何分布级别的精度验证视觉信号中参考字幕与策略生成字幕之间的事实一致性,实现弱到强泛化,在多个图像和视频字幕基准上超越SOTA模型。

Comments 28 pages, 8 figures

详情
AI中文摘要

视觉字幕要求模型忠实捕捉视觉内容,同时最小化遗漏和幻觉。作为字幕的主导范式,多模态大语言模型通过扩展和高质量数据取得了强大性能。最近,强化学习成为推动多模态大语言模型向更高精度和更广覆盖的关键途径,然而,现有字幕奖励设计未能提供细粒度且可靠的事实验证信号,限制了其有效性。为解决这一问题,我们提出VCap,一种证人-裁判奖励,将参考字幕(证人)与视觉信号(裁判)配对。通过明确验证基于视觉信号的参考字幕与策略生成字幕之间的事实一致性,VCap提供了具有超几何分布级别精度的奖励信号用于字幕质量验证。该设计即使在不完美的参考下也能实现有效学习,促进强化学习训练中的弱到强泛化。在我们的实验中,使用VCap训练的8B模型在多个图像和视频字幕基准上优于开源和闭源的最先进模型。人工评估进一步证实了其与事实正确性的强对齐。此外,VCap提升了多模态大语言模型的感知能力,跨任务泛化,并超越了最佳N蒸馏,挑战了先前关于强化学习与视觉推理的假设。

英文摘要

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

2605.28022 2026-05-28 cs.CL cs.SE 版本更新

Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

超越 pass@k:面向多样本代码生成的冗余感知 RLVR

Le Bronnec Florian, Alexandre Verine, Rio Yokota, Benjamin Negrevergne

发表机构 * RIKEN Center for Computational Science(日本计算科学中心)

AI总结 针对代码生成中重复采样评估的冗余问题,提出基于 JPlag 相似度的反冗余奖励增强 RLVR,在有限预算下提升可执行正确性。

Comments Preprint under review

详情
AI中文摘要

用于代码生成的 LLM 通常使用 Pass@k 在重复采样设置中进行评估,其中多个候选程序在有限采样预算下针对单元测试执行。虽然最近基于验证器的强化学习(RLVR)方法提高了可执行正确性,但这些目标如何影响采样程序之间的冗余仍不清楚。在这项工作中,我们使用代码抄袭检测系统 JPlag 研究代码生成中的实现级冗余。跨模型和基准测试,我们表明仅正确性的 RLVR 通常使生成集中在重复实现上,而 Pass@k 感知目标保持较低冗余并提高更大预算下的性能。受这些观察的启发,我们基于 JPlag 相似度用直接反冗余奖励增强 RLVR。在 3 个模型和 3 个基准测试中,阻止近重复生成可靠地提高了有限预算下的可执行性能,通常匹配或超越专门的 Pass@k 感知目标。

英文摘要

LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.

2605.28020 2026-05-28 cs.CL 版本更新

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

预训练模型评估中缺失的一环:奖励引导解码无需参数更新即可解锁任务导向行为

Shaobo Wang, Guo Chen, Ziyue Wang, Zhengyang Tang, Qingyang Liu, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Qwen Team, Alibaba Group(阿里云Qwen团队) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出一种无需训练的奖励引导解码框架EBD,通过外部轻量奖励模型调整输出分布,激活冻结预训练模型的任务导向行为,实现更公平的推理时评估。

Comments 26 pages, 5 figures, 8 tables

详情
AI中文摘要

随着大型语言模型(LLMs)的快速发展,可靠地评估预训练LLMs的能力变得越来越重要。挑战在于,基础预训练模型针对下一个词预测进行优化,在标准提示和直接解码下往往无法遵循指令或生成格式良好的答案。因此,基准性能可能混淆模型能力与解码导致的无法产生任务导向输出的问题,而暴露这种行为通常依赖于昂贵的后训练。最近的仅解码方法试图重塑输出分布,但这类方法在开放式任务中可能效率低下且脆弱。为解决这些限制,我们提出基于能量的解码(EBD),一种无需训练、奖励引导的框架,用于从冻结的预训练LLMs中激活任务导向行为,涵盖开放式和客观任务。EBD通过外部轻量奖励模型增强解码,将生成导向高效用响应,同时通过奖励倾斜的目标分布将其锚定到预训练模型先验。我们证明EBD将基础模型输出转向更符合指令的行为,增加了与后训练对应物的行为相似性,并实现了对可访问预训练模型行为的更公平推理时评估。实验上,EBD在五个模型和六个基准上优于基线,将Qwen3-8B-Base在AlpacaEval2.0上的性能从8.8提升到44.5,将Mistral-7B在Math500上的延迟相对于先前的解码工作降低18.9倍,并且对奖励模型大小保持鲁棒。

英文摘要

With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token prediction and often fail to follow instructions or produce well-formed answers under standard prompting and direct decoding. As a result, benchmark performance can conflate model capability with decoding-induced failures to produce task-oriented outputs, while exposing such behavior often relies on costly post-training. Recent decodingonly approaches attempt to reshape output distributions, but such methods can be inefficient and brittle across open-ended tasks. To address these limitations, we propose Energy-Based Decoding (EBD), a training-free, reward-guided framework for activating task-oriented behaviors from frozen pre-trained LLMs across both open-ended and objective tasks. EBD augments decoding with an external lightweight reward model, steering generations toward high-utility responses while anchoring them to the pre-trained model prior through a reward-tilted target distribution. We show that EBD shifts base-model outputs toward more instructionfollowing behavior, increasing behavioral similarity to post-trained counterparts and enabling a fairer inference-time evaluation of accessible pre-trained-model behavior. Empirically, EBD outperforms baselines across five models and six benchmarks, improving Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5, reducing Mistral-7B Math500 latency by 18.9x relative to prior decoding work, and remaining robust to reward-model size.

2605.28014 2026-05-28 cs.CL cs.LG 版本更新

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

ROSD: 面向跨领域语言模型推理的反思式同策略自蒸馏

Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Baidu Inc.(百度公司) Shandong University(山东大学) Leiden University(莱顿大学)

AI总结 提出反思式同策略自蒸馏(ROSD)框架,通过反思引导的错误定位蒸馏将参考解模仿转为针对性推理修正,提升领域内推理和跨领域泛化能力。

Comments Preprint

详情
AI中文摘要

同策略自蒸馏(OPSD)通过为同策略 rollout 提供密集的 token 级监督,提升了大语言模型(LLM)的推理性能。然而,现有的 OPSD 方法在领域内推理上增益有限,且对领域外问题的泛化能力较差。我们识别出两个关键原因:将自教师模型条件化为已验证的解决方案会鼓励模仿训练领域的参考轨迹而非特定错误的修正;将蒸馏应用于完整响应可能会覆盖有效的推理前缀并强化过拟合。我们提出反思式同策略自蒸馏(ROSD),一个通过反思引导的、错误定位的蒸馏将参考解模仿转化为针对性推理修正的框架。对于每个 rollout,ROSD 使用自反思器提取修正思路并定位第一个错误片段。修正思路引导自教师模型进行针对性监督,而定位的错误片段将蒸馏限制在需要修正的区域。这种设计在保留有效前缀的同时修正了有缺陷的推理。在多个领域内和领域外推理基准上的实验表明,ROSD 在整体上产生了更强的领域内推理性能,并且相比标准 OPSD 具有显著更好的领域外泛化能力。代码可在 https://github.com/ZiqiZhao1/ROSD 获取。

英文摘要

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

2605.28013 2026-05-28 cs.CL 版本更新

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

KSAFE-MM:通过本地化语境化实现韩国文化风险的多模态安全基准

Yongwoo Kim, Sojung An, Yunjin Park, Jungwon Yoon, Dujin Lee, HyunBeom Cho, Jaewon Lee, Wonhyuk Lee, Youngchol Kim, JeongYeop Kim, Donghyun Kim

发表机构 * Korea University(韩国大学) KT Corporation(KT公司)

AI总结 针对多模态大语言模型在安全评估中缺乏文化特异性问题,提出KSAFE-MM基准,通过语言和视觉语境化构建通用与韩国文化特有的多模态安全测试集,揭示模型对文化攻击的脆弱性及安全性与过度拒绝之间的权衡。

详情
AI中文摘要

多模态大语言模型(MLLMs)通过引入跨多种模态(如语言和视觉)的漏洞,加剧了安全风险。然而,当前的MLLM安全评估工具存在重大局限性:1)以英语为中心的数据集构建,以及2)关注与当地文化背景无关的通用风险。本文介绍了KSAFE-MM,一个用于韩语多模态安全评估的基准,涵盖通用安全风险和文化特定漏洞。KSAFE-MM由两部分组成:KSAFE-MM-G和KSAFE-MM-C。KSAFE-MM-G通过语言语境化评估韩语语境中的全球共享风险,将通用安全查询转化为上下文相关的多模态样本。KSAFE-MM-C利用源自真实世界语境的本地化视觉查询,针对文化依赖的MLLM安全漏洞。它将这些视觉查询与越狱式文本查询配对,以覆盖涉及文化视觉线索和恶意文本意图的多模态安全风险。这些组件共同提供了一个从通用到本地的构建流程,用于评估全球共享安全风险和文化特定漏洞。我们在KSAFE-MM上评估了12个最先进的MLLM,并揭示了模型对文化攻击的脆弱性高于通用攻击。值得注意的是,越狱策略显著提高了攻击成功率,其中ProgramExecution的攻击成功率高达74.2%,而标准查询仅为13.4%。此外,我们发现了安全性与过度拒绝之间的系统性权衡,即实现低攻击成功率的模型往往对良性查询表现出过度的拒绝行为。这些发现强调了超越以英语为中心的基准、进行文化基础安全评估的紧迫性。

英文摘要

Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

2605.28009 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

MemGuard:防止长期记忆增强型大语言模型中的记忆污染

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Columbia University(哥伦比亚大学) Capital One

AI总结 提出MemGuard,一种类型感知的记忆框架,通过显式分配功能角色、维护类型隔离记忆间的关联并选择性组合必要类型的证据,防止异构记忆污染,提升记忆可靠性最高28.27%并减少检索token数最高5.8倍。

详情
AI中文摘要

记忆增强型大语言模型通过跨交互维护长期记忆,将推理扩展到固定上下文窗口之外。然而,现有的记忆系统常常将稳定的用户事实、情景事件和行为规则折叠到共享空间中,使得功能不同的记忆被检索并用作可互换的证据。我们将这种失败模式识别为异构记忆污染,其中上下文特定的事件被过度概括为声明,或者语义相关但功能不兼容的记忆误导生成。为此,我们引入了MemGuard,一种类型感知的记忆框架,在记忆构建和检索过程中保留功能记忆边界。它在写入时为每个记忆分配显式的功能角色,维护跨类型隔离记忆的关系,并仅从必要的记忆类型中选择性组合证据,从而减少来自无关或功能不兼容证据的污染。在幻觉和长时对话基准测试中,MemGuard将记忆可靠性提高了最多28.27%,同时检索的记忆token数比先前方法减少了最多5.8倍。这些结果表明,可靠的长期推理依赖于对异构记忆的有原则的组织和选择性使用。

英文摘要

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

2605.28006 2026-05-28 cs.CL cs.AI 版本更新

Integrated and Cross-Architecture Interpretation of LLM Reasoning

LLM推理的集成与跨架构解释

Leonardo Matthew Yauw, Wei-Bin Kou, Yujiu Yang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出集成跨架构推理(IAR)框架,通过带宽校准的MIP与Tukey IQR峰值检测、重叠分析及Jaccard稳定性度量,统一解释LLM推理模式。

详情
AI中文摘要

理解LLM如何推理受到实际不对称性的阻碍:虽然其生成的输出是可观察的,但潜在的推理模式仍然不透明。依赖单一探针,如互信息峰值(MIP)或深度思考比率(DTR),可能会低估真正的推理结构。针对这一不足,我们提出了一个集成跨架构推理(IAR)框架,旨在为LLM推理可解释性提供统一方法。具体来说,我们首先提出使用带宽校准的MIP结合Tukey IQR峰值检测来隔离输出层的关键推理标记。其次,我们对MIP选中的标记和DTR深度标记进行重叠分析,以追踪这些标记的跨层轨迹。这也揭示了关键推理标记是否也是计算密集型的,进一步有助于理解推理模式如何在模型层间演变。最后,我们在多领域问题上应用Jaccard稳定性度量,以验证MIP识别的标记是否具有推理质量保证。在三个模型(Qwen-7B、Qwen-14B和Llama-8B)上跨四个领域(数学、代码、逻辑和常识)的大量实验证明了IAR跨架构的泛化解释能力。

英文摘要

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR's generalizable interpretation capabilities across architectures.

2605.28004 2026-05-28 cs.CL 版本更新

Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG

超越块级抽取:面向GraphRAG的跨块图增强

Jiaming Zhang, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(数据科学与工程学院,东华大学)

AI总结 提出CrossAug方法,利用GNN引导的跨块图增强,在查询前离线补充GraphRAG索引中缺失的跨块关系,提升多跳和长文档问答性能。

Comments 15 pages, 5 figures, 8 tables

详情
AI中文摘要

GraphRAG通过将语料组织为显式知识图谱来扩展检索增强生成,支持基于图的检索以进行复杂问答。然而,现有框架仅在单个块内抽取实体和关系,导致跨块关系——即证据跨越多个段落的关系——在索引中系统性缺失。由于块组合的组合爆炸,穷举式基于LLM的关系恢复不可行。我们提出CrossAug,一种GNN引导的跨块图增强方法,在查询前离线步骤中为GraphRAG索引补充跨块关系结构。CrossAug通过自监督图损坏获取训练监督,使用拓扑感知的GNN对子图进行缺失性评分,并仅对选中的高评分区域应用基于证据的LLM补全。在三个基于LLM的GraphRAG框架上,跨四个多跳和长文档QA基准的实验表明,CrossAug持续提升性能,证实了跨块图增强对基于检索的问答的益处。我们的代码开源在https://github.com/DonFinliani/CrossAug。

英文摘要

GraphRAG extends retrieval-augmented generation by organizing corpora as explicit knowledge graphs, enabling graph-based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross-chunk relations -- those whose evidence spans multiple passages -- systematically absent from the index. Exhaustive LLM-based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN-guided CROSS-Chunk Graph AUGmentation method that enriches GraphRAG indices with cross-chunk relational structure as an offline step before query-time retrieval. CrossAug derives training supervision through self-supervised graph corruption, uses a topology-aware GNN to score subgraphs for missingness, and applies evidence-grounded LLM completion only to selected high-scoring regions. Experiments on three LLM-based GraphRAG frameworks across four multi-hop and long-document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross-chunk graph augmentation for retrieval-based question answering. Our code is available at https://github.com/DonFinliani/CrossAug.

2605.28003 2026-05-28 cs.CL 版本更新

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

ResearchMath-14K: 通过智能体扩展研究级数学

Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang, Youngjae Yu

发表机构 * Seoul National University(首尔国立大学) OneLineAI Yonsei University(延世大学)

AI总结 本文通过多智能体流程从学术来源构建了最大的研究级数学问题数据集ResearchMath-14K(14,056个问题),并生成220K教师轨迹,经智能体过滤后微调Qwen3模型(4B-30B)平均提升9.2个点,表明过滤后的开放问题尝试即使没有完全正确的推理轨迹也能提供有效监督。

Comments Work in progress. Dataset available at: https://huggingface.co/datasets/amphora/ResearchMath-14k

详情
AI中文摘要

数学的前沿由尚未知道解法的问题定义,但语言模型能否在没有人类干预的情况下有意义地处理这些问题仍不清楚。一个主要障碍是缺乏大规模的研究级数学数据集。为此,我们引入了ResearchMath-14k,这是一个通过多智能体流程从学术来源整理的问题集,包含14,056个问题,是迄今为止最大的研究级数学问题集合。我们进一步生成了ResearchMath-Reasoning,即来自两个开放模型的220K条教师轨迹,其中我们观察到重复的回避行为,如未尝试和虚构引用。有趣的是,在八个开放权重模型中,新一代模型每条轨迹产生的引用数量增加了5.6倍,虚假引用数量增加了5.0倍。在对ResearchMath-Reasoning进行智能体过滤后,对Qwen3模型(从4B到30B参数)进行微调,平均比基础模型提高了9.2个点。这表明,即使没有完全正确的推理轨迹,过滤后的开放问题尝试也能提供有用的监督。我们将ResearchMath-14k公开,以供未来研究级数学推理的工作使用。

英文摘要

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

2605.27997 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

毒性存在于何处?语言模型中的机制定位与定向抑制

Himanshu Beniwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院冈德辛加尔)

AI总结 通过分析毒性与中性提示的激活差异,定位特定层和神经元中的毒性,并利用推理时缩放或最小秩一权重编辑进行抑制,无需梯度下降,实现毒性降低同时保持语言质量。

详情
AI中文摘要

大型语言模型频繁生成有毒、仇恨或有害内容,然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤,且缺乏对毒性内部起源的机制性理解。我们提出了Meow2X和TRNE,两种互补的无需重新训练的框架,通过分析毒性与中性提示之间的激活差异,将毒性定位到特定层和神经元,然后通过推理时缩放或最小秩一权重编辑进行抑制——无需任何梯度下降。在五个语言模型、两个基准测试和90种配置上的评估,使用双重安全评估器,一致地证明了毒性降低,同时保持了语言建模质量。我们的分析揭示,毒性不成比例地编码在早期MLP层中,在不同架构间有所变化,并且被单一评估器设置系统性地低估——强调了多评估器安全评估的必要性。通过连接机制可解释性与实际去毒化,我们的框架为更安全、更透明的语言模型提供了一条原则性路径。

英文摘要

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

2605.27993 2026-05-28 cs.CL 版本更新

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

重新思考视觉忽视:通过上下文偏好引导缓解MLLM幻觉

Jingwen Wu, Xijun Zhang, Ge Song

发表机构 * School of Computer and Electronic Information, Nanjing Normal University(计算机与电子信息学院,南京师范大学)

AI总结 针对多模态大语言模型的物体幻觉问题,提出无需训练的上下文偏好激活引导框架(CAS),通过提取上下文偏好向量并在中间早期MLP层注入符号残差来控制信息依赖,有效缓解幻觉且不增加解码延迟。

Comments 15 pages, 5 figures

详情
AI中文摘要

物体幻觉仍然是多模态大语言模型(MLLM)可靠部署的主要障碍。当前的推理时缓解方法主要假设幻觉源于视觉忽视,引导模型增强视觉依赖。相反,我们对多个MLLM的系统干预表明,推动更多视觉依赖可能会加剧某些模型的幻觉,而减少视觉依赖则可能缓解幻觉。这一结果表明,将幻觉单纯归因于视觉不足是不充分的。我们认为,图像作为上下文,同时与模型的参数知识和文本上下文竞争。为此,我们提出了一种无需训练的框架——上下文偏好激活引导(CAS)。它通过两组设计好的冲突样本提取两个语义不同的上下文偏好向量(CPV),并在推理时通过单次符号残差注入到中间早期MLP层来控制信息依赖。实验表明,CAS在不增加解码延迟的情况下显著缓解了物体幻觉,并保持了原生文本生成质量。

英文摘要

Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model's parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.

2605.27988 2026-05-28 cs.CL cs.CY 版本更新

Auditing Stance Asymmetry in Generative Explanations

审计生成性解释中的立场不对称性

Jiarui Han

AI总结 针对语言模型在开放式解释中分配责任、合法性和背景时可能产生的立场不对称性,提出对称性分解评估(SDE)方法,通过配对情境测试揭示表面差异的稳定性差异,并指出自动评分的不稳定性。

详情
AI中文摘要

语言模型的偏见评估在有界比较方面取得了实质性进展,例如明显的贬低、刻板印象关联或受控替换下的标签敏感差异。开放式解释提出了一个不同的问题:它们通过分配责任、合法性、背景和委屈来指导解释。模型可以避免敌对语言,同时使一方在结构上可理解,而另一方则被归咎于个人、反应过度或不太值得认真对待。我们称之为生成性解释中的立场不对称性。我们提出对称性分解评估(SDE),该方法通过具体群体标签、结构角色重写以及明确的支持或反证来测试配对情境。在一个受控的32族原型套件中,这种分解表明表面差异并非全部相同:有些在结构或证据控制下减弱,而另一些则作为模型分配责备、背景或合法性的稳定差异持续存在。针对性的案例审查和法官比较表明,评估开放式框架不对称性存在更广泛的困难:法官的解读随操作化方式而变化,标量分数可能抹平读者用于解释解释性立场的区别。因此,SDE将生成性偏见评估重新定义为对解释性立场的审计——每一方接受什么立场,它在分解下如何变化,以及自动评分在何处变得不稳定。

英文摘要

Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations raise a different problem: they guide interpretation by assigning responsibility, legitimacy, context, and grievance. A model can avoid hostile language while making one side structurally understandable and another personally at fault, overreacting, or less worth taking seriously. We call this stance-bearing asymmetry in generative explanations. We propose Symmetry Decomposition Evaluation (SDE), which tests paired situations with concrete group labels, structural-role rewrites, and explicit support or counter-evidence. In a controlled 32-family prototype suite, this decomposition shows that surface differences are not all alike: some weaken under structural or evidence control, while others remain as stable differences in how the model assigns blame, context, or legitimacy. Targeted case review and judge comparison suggest a broader difficulty for evaluating open-ended framing asymmetries: judge readings shift across operationalizations, and scalar scores can flatten distinctions that readers use to interpret explanatory stance. SDE therefore reframes generative bias evaluation as an audit of explanatory stance -- what stance each side receives, how it changes under decomposition, and where automatic scoring becomes unstable.

2605.27986 2026-05-28 cs.CL q-bio.QM 版本更新

An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

一种设计稳定、高表达且低免疫原性治疗性mRNA序列的进化方法

Dhawa Sang Dong, Mausam Gurung, Suraj Kandel

发表机构 * Department of Artificial Intelligence(人工智能系) Department of Electronics and Computer Engineering(电子与计算机工程系) School of Engineering, Kathmandu University(加德满都大学工程学院) Kathmandu Engineering College, Trivubhan University(特里布罕大学加德满都工程学院) School of Management and Social Science(管理与社会科学学院)

AI总结 提出BERT-GA两阶段框架,结合深度学习和遗传算法优化mRNA序列,在翻译效率、结构稳定性和低免疫原性之间取得平衡。

详情
AI中文摘要

信使RNA(mRNA)序列作为治疗药物需要优化设计,以确保高效翻译、结构稳定性和最小免疫原性。本研究提出一个两阶段计算机模拟框架,整合深度学习和进化计算,用于理性mRNA优化,而非现有最先进模型。第一阶段,预训练的CodonTransformer(类似BERT的大语言模型)生成编码目标抗原的生物一致性mRNA序列。第二阶段,遗传算法(GA)通过密码子感知的交叉和同义突变(由人类密码子使用偏好引导)进化这些候选序列。评估的适应度函数结合了翻译相关指标(CAI、tAI、密码子对偏好)、mRNA结构稳定性(通过RNAfold计算的局部和全局MFE、GC含量)以及降低的免疫原性(CpG/UpA基序频率)。经过连续世代(第38、40和42代),GA改进了CAI和tAI(CAI值从0.73到0.74,tAI值从0.63到0.64),提升超过6%,密码子对偏好高且一致(0.97),并改善了5'端的核糖体可及性,未配对30分数达到0.87;全局最小自由能(MFE)收敛到平衡范围-346至-356 kcal/mol,实现约84%的碱基配对结构稳定性,并减少了免疫刺激基序——最终世代平均免疫惩罚降至27.3。线性设计产生超稳定转录本(MFE < -2000 kcal/mol),由于极端刚性存在翻译效率低下的风险;BiLSTM-CRF仅关注高CAI(0.96至0.98)而无结构约束;我们的框架实现了翻译-稳定性的最优平衡,突出了所提出的BERT-GA框架作为一种有效的、数据驱动的计算机模拟mRNA序列设计和优化方法。

英文摘要

Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE < - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

2605.27984 2026-05-28 cs.CL cs.AI 版本更新

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

KVoiceBench, KOpenAudioBench 和 KMMAU:用于评估语音语言模型的智能体驱动的韩语语音基准

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

发表机构 * KRAFTON Graduate School of AI, KAIST(韩国科学技术院人工智能研究生院) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 针对语音语言模型评估中英语中心化的问题,提出两种智能体驱动的基准构建框架,构建并发布了三个韩语语音基准(KVoiceBench、KOpenAudioBench、KMMAU),通过评估八个最新模型揭示了英语-韩语性能差距和不同任务间的互补性弱点。

Comments 16 pages, 4 figures

详情
AI中文摘要

语音语言模型通过将大型语言模型扩展到语音模态取得了显著进展。然而,语音语言模型的评估仍然严重以英语为中心,限制了多语言语音能力的可靠评估。通过ASR、翻译、归一化和TTS直接迁移基准会破坏语言特定的指令、答案约束和口语形式;对于音频理解,迁移源语言音频也无法保留目标语言的说话人属性、口音和副语言特性。为解决这些限制,我们提出了两种智能体驱动的基准构建框架:一种将源语言SpokenQA基准迁移为目标语言SpokenQA基准,另一种利用转录和说话人元数据将目标语言ASR语料库转换为音频理解基准。使用这些框架,我们构建并公开发布了三个韩语语音基准:用于韩语SpokenQA的KVoiceBench和KOpenAudioBench,以及用于韩语音频理解的KMMAU,共包含12,345个样本。我们评估了八个最近的语音语言模型,发现英语-韩语性能差距在不同模型和任务族中差异很大,并且SpokenQA和音频理解的排名出现分歧,揭示了仅靠英语评估无法发现的互补性弱点。

英文摘要

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

2605.27980 2026-05-28 cs.CL cs.AI 版本更新

Periodic RoPE for Infinite Context LLMs

周期性RoPE:面向无限上下文的大型语言模型

Simin Huo

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出周期性RoPE(P-RoPE)位置编码机制,结合滑动窗口注意力和无位置编码的全局注意力,避免位置耗尽,理论上支持无限上下文窗口。

Comments 5 pages

详情
AI中文摘要

处理超长上下文的能力对于大型语言模型(LLMs)执行长期任务至关重要。尽管最近的努力已将上下文窗口扩展到1M及以上,但当序列长度超过位置编码(如RoPE)的预训练范围时,模型性能会下降,即位置耗尽。必须克服这一基本限制才能实现真正的无限上下文。为此,我们提出了周期性RoPE(P-RoPE),一种旨在避免这种耗尽的位置编码机制。它与滑动窗口注意力(SWA)协同工作,以捕获每个窗口内的局部依赖和相对位置。然后,这一局部层由无位置编码(NoPE)的全局注意力层补充,使得整个序列上的无界交互成为可能,而不受位置限制。通过堆叠这两类层,模型避免了位置外推以泛化更长的序列,并理论上支持无限的上下文窗口。实验结果表明,我们的模型MiniWin在长上下文效率和稳定性上优于采用标准GPT架构的MiniMInd。我们的工作为LLMs实现真正的无限上下文理解提供了一条可能的路径。代码可在\href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}获取。

英文摘要

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}.

2605.27971 2026-05-28 cs.CL cs.AI 版本更新

Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

语义流正则化:教会LLMs生成多样且连贯的回复

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司) Beijing, China(中国北京)

AI总结 针对大语言模型微调时输出多样性严重受限的跨风格坍缩问题,提出语义流正则化(SFR),通过条件流匹配监督骨干网络使用连续句子嵌入,在零部署成本下提升多样性和风格保真度。

详情
AI中文摘要

当大语言模型被微调以生成个性或语气条件化的回复时,其输出多样性受到严重限制——我们将这种失败称为跨风格坍缩。我们将这种坍缩追溯到交叉熵目标,该目标在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化(SFR),一种轻量级的辅助目标,通过条件流匹配使用未来片段的连续句子编码器嵌入来监督骨干网络。随机流源通过构造保持多模态;流匹配头在推理时被丢弃,增加零部署成本。在一个大规模工业对话数据集(Qwen3-32B,9种个性)上,SFR在输出多样性、风格保真度和回复质量上优于SFT。我们进一步在公共LiveCodeBench-v5(Qwen2.5-Coder-7B-Instruct)上验证,其中SFR持续改进pass@k,证实了其超越风格化对话的通用性。在MBPP上的受控比较显示,多令牌预测是SFR的一个退化特例。

英文摘要

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

2605.27969 2026-05-28 cs.CL 版本更新

Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost

后训练助手中的边界抑制不对称性:过度扩展作为可控性代价

Jiarui Han

AI总结 研究后训练语言模型助手中因避免回答不足而导致的边界抑制不对称性,发现反回答不足策略在边界控制评估中更难被抑制,且这种代价与内容预算超支和延续持续性相关。

详情
AI中文摘要

后训练的语言模型助手通常被优化以避免回答不足,鼓励完整、有用、谨慎和主动的响应。我们询问这种优化是否会产生不对称的可控性代价:当用户明确要求更窄的回答时,哪些助手行为仍然可被抑制,哪些继续塑造响应?我们将此问题研究为边界抑制不对称性。跨多个高级响应维度的提示侧探测表明存在选择性代价,集中在“过度助手”方向,如过度完成、额外帮助和反回答不足。使用来自共享基础模型的控制助手策略变体,我们发现反回答不足策略在匹配的边界控制评估下比基线更难被拉回,而最小边界变体在直接边界控制比较中通常避免了这种反侧向上偏移。机制导向的探测指向超出更长的默认输出、纯EOS失败、不确定性补偿和局部延续偏差,而鲁棒性检查在共享系统和更大规模设置下保持了主要的反超基线排序。证据支持混合规划/停止解释,其中内容预算超支和延续持续性共同使边界修正更难。总体而言,后训练可能产生方向特定的可控性代价:一些有用的助手倾向仍然容易调用,但更难局部抑制。

英文摘要

Post-trained language-model assistants are often optimized to avoid under-answering, encouraging complete, helpful, cautious, and proactive responses. We ask whether this optimization creates asymmetric controllability costs: when users explicitly request narrower answers, which assistant behaviors remain suppressible, and which continue to shape the response? We study this problem as boundary-suppression asymmetry. Prompt-side probes across multiple high-level response dimensions suggest a selective cost, concentrated around `too-much assistant' directions such as over-completion, extra help, and anti-underanswering. Using controlled assistant-policy variants derived from a shared base model, we find that anti-underanswering policies are harder to pull back than the baseline under matched boundary-control evaluations, while minimal-boundary variants generally avoid this anti-side upward shift in the direct boundary-control comparisons. Mechanism-oriented probes point beyond longer default outputs, pure EOS failure, uncertainty compensation, and local continuation bias, while robustness checks preserve the main anti-over-baseline ordering under shared-system and larger-scale settings. The evidence supports a mixed planning/stopping account, where content-budget overshoot and continuation persistence jointly make boundary correction harder. Overall, post-training may create direction-specific controllability costs: some helpful assistant tendencies remain easy to invoke, yet harder to locally suppress.

2605.27958 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

压力测试LLM中的欺骗探针:扩展性、鲁棒性与欺骗表示的几何结构

Sachin Kumar

发表机构 * LexisNexis(LexisNexis公司)

AI总结 本文通过系统压力测试,诊断线性探针在分布偏移下失效的原因,发现风格增强可恢复近完美检测,并证明欺骗编码非单一线性方向或熵代理,而是分布式亚阈值特征。

Comments Accepted at the GEM Workshop @ ACL 2026

详情
AI中文摘要

基于LLM激活训练的线性探针越来越多地被提议作为欺骗检测指标,但在干净基准上报告AUROC超过0.96,而在分布偏移下崩溃。本文系统地对Gemma 3模型家族(1B-27B参数)的探针指标进行压力测试,诊断其失败原因而不仅仅是记录失败。我们测试了关于欺骗编码的四个假设:(1)单一线性方向,(2)多维子空间,(3)凸锥包,(4)熵代理。我们的设计包括跨域转移矩阵、基于排列零基线的多维探针分析、熵残差化测试以及8种风格偏移下的干扰评估。我们发现:(a)探针在干净数据上达到近乎完美的AUROC(>=0.998),但在风格偏移下崩溃;风格增强的探针在未见风格上恢复近乎完美的检测(平均AUROC 0.979-0.983);(b)单一方向假设被拒绝(k=1仅捕获0.61-0.80 AUROC),跨域转移失败被确认为几何原因而非层不匹配驱动;(c)熵代理假设被拒绝(最大|rho|=0.454,残差化后最大Delta-AUROC=0.004);(d)欺骗并未形成显著的线性子空间(每域k*=0),但多维探针(k>=5)通过分布式亚阈值特征恢复信号。探针脆弱性反映了分布狭窄性而非架构限制:风格增强的探针在4B和27B均恢复近乎完美的检测,表明逆缩放模式是训练分布伪影而非真正的规模依赖现象。

英文摘要

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

2605.27957 2026-05-28 cs.CL 版本更新

DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

DisasterBench: 在类型化工具接口约束下基准测试LLM规划

Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee

发表机构 * Texas A&M University(德克萨斯A&M大学) University of Toronto(多伦多大学)

AI总结 提出DisasterBench基准,通过类型化工具接口评估LLM在灾害响应中的结构化多智能体规划能力,并引入首次故障点(FPoF)方法进行步骤级故障归因,揭示语义推理与执行约束之间的差距。

详情
AI中文摘要

灾害造成严重的社会影响,需要快速协调异构AI工具(从卫星分析到洪水预测和损害评估)形成连贯的多步骤工作流。随着LLM越来越多地充当此类管道的编排者,有效的协调需要的不仅仅是选择语义上合理的工具:LLM必须生成具有正确参数绑定和依赖传播的可执行工作流。我们引入了DisasterBench,这是一个基准,用于评估在语义相似但操作上不同的灾害响应工具上的结构化多智能体规划。为了实现步骤级故障归因,我们进一步提出了首次故障点(FPoF),它定位预测工作流中最早的根因,将主要错误与下游级联效应分开。我们的评估揭示了三个发现:规划方法的有效性强烈依赖于模型容量;工具不匹配和参数绑定错误主导了首次故障,揭示了语义基础和执行一致性是不同瓶颈;冗长的中间推理可能与结构化输出要求产生指令冲突,破坏计划生成。总之,这些发现凸显了语义推理与执行基础协调之间的根本差距,强调了需要联合建模语义意图、执行约束和工作流一致性的规划框架。代码、数据和评估资源可在 https://github.com/TamuChen18/DisasterBench_Open 获取。

英文摘要

Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open

2605.27955 2026-05-28 cs.PL cs.CL 版本更新

Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

技能即伪代码:将技能库重构为面向LLM智能体的伪代码

Xinze Li, Yuhang Zang, Yixin Cao, Aixin Sun

发表机构 * Nanyang Technological University(南洋理工大学) Fudan University(复旦大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Skill-as-Pseudocode (SaP)方法,自动将Markdown技能库转换为带类型伪代码,通过确定性质量检查解决LLM智能体在检索技能时产生的混淆循环问题,在ALFWorld任务上显著优于基线。

Comments Preprint. Code: https://github.com/InternLM/Skill-as-Pseudocode

详情
AI中文摘要

面向LLM智能体的Markdown技能库以自由格式的散文形式提供,迫使智能体在每次检索时重新推导输入模式和具体调用语法。我们观察到,这通常会产生一个“困惑 -> 重新检索 -> 仍然困惑”的循环,智能体发出部分正确的动作,收到无信息的环境反馈,并重新检索相同的散文。我们提出Skill-as-Pseudocode (SaP),一种将Markdown技能库自动转换为带类型伪代码的方法,并具有确定性质量控制。对于从一个或多个技能中提取的相似过程性段落簇,SaP提取一个带类型契约,并通过四重确定性验证器(覆盖、绑定、替换、风险)进行过滤。通过的契约与恢复的具体动作模板一起内联到重写的技能骨架中,为智能体提供两个互补信号:技能功能的类型签名和如何调用它的具体模板。在包含134个游戏的ALFWorld未见分割上,使用gpt-4o-mini,跨三个种子汇总,SaP在402场配对游戏中赢得82场,而Graph-of-Skills (GoS)基线赢得47场(汇总McNemar检验p = 8.2e-5),每场游戏输入token减少22.8% +/- 6.4%,LLM调用减少14.5% +/- 4.1%。

英文摘要

Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-retrieve -> still confused" loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.

2605.27934 2026-05-28 cs.CL 版本更新

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

GeneralThinker: 通过似然引导的答案条件优化实现领域通用推理

Shengmin Piao, Sanghyun Park

发表机构 * Yonsei University(延世大学)

AI总结 提出GeneralThinker框架,利用答案似然进行密集监督和细粒度信用分配,无需领域特定验证器,在数学、STEM和通用推理等11个基准上取得最佳平均性能。

详情
AI中文摘要

基于可验证奖励的强化学习提升了语言模型的推理能力,但其对领域特定验证器的依赖、稀疏的结果奖励以及粗粒度的信用分配限制了其适用性。我们提出了GeneralThinker,一个在策略框架,将推理监督重新表述为密集的答案条件优化,无需领域特定验证器即可实现响应级评估和令牌级信用分配。GeneralThinker使用真实答案的似然来评估生成的推理轨迹,并推导出令牌级的兼容性信号用于细粒度信用分配。为了稳定优化,它通过裁剪和方向保持调制来约束令牌级更新。在涵盖数学、STEM和通用推理的11个基准测试中,GeneralThinker取得了最佳平均性能。进一步分析表明,不受控的令牌级调制可能破坏训练稳定性,而受控的调制使细粒度信用分配始终有效。

英文摘要

Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG 版本更新

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全:什么决定了多模态越狱鲁棒性?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher(独立研究者) Stanford University(斯坦福大学) Harvard University(哈佛大学) Purdue University(普渡大学) Duke University(杜克大学)

AI总结 本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响,发现显式图像工具交互能显著降低攻击成功率,并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情
AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式,但其安全性影响尚不明确。现有系统已涵盖多种流程设计,包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上,我们的实验表明显式图像工具交互的攻击成功率最低,平均相对降低约30%。这一发现起初令人惊讶:即使返回的图像工具输出被人为覆盖或本身不安全,攻击成功率仍保持较低,但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明,较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式,我们引入了一个图像工具安全向量框架,将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言,我们的结果表明,显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式,同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

2605.27921 2026-05-28 cs.AI cs.CL cs.CY cs.HC 版本更新

Show, Don't TELL: Explainable AI-Generated Text Detection

展示,而非告知:可解释的AI生成文本检测

Aldan Creo, Suraj Ranganath

发表机构 * School of Computing, Information and Data Sciences(计算与数据科学学院) University of California, San Diego(加州大学圣地亚哥分校) United States of America(美国)

AI总结 提出一种名为TELL的新型可解释架构,通过内置解释机制和强化学习训练,在保持高检测性能(AUROC 0.927)的同时提供文本级注释,帮助用户基于自身判断识别AI生成文本。

详情
AI中文摘要

关于AI生成文本检测的研究已经提出了多种区分人类与AI文本的方法,其中一些方法在分布内性能上表现优异。然而,由于输出与用户(如教授)的需求不一致——他们只得到一个没有附带解释的数值分数——现实世界的应用进展缓慢。我们通过一种新颖的架构TELL解决了这个问题,该架构从一开始就内置了可解释性。虽然我们的系统仍像其他检测器一样提供数值分数以便比较,但TELL采用了一种根本不同的方法,旨在向用户展示模型认为文本是AI还是人类写作的“线索”,使用户能够根据自己的判断以及对写作背景和所谓作者的理解来决定文本的作者。我们在一个特定领域的作者注释自定义SFT数据集上训练TELL,并进一步使用GRPO结合课程学习来优化系统以提高性能。我们实现了与最先进检测器相竞争的性能(AUROC 0.927),同时原生提供解释检测器决策基础的注释。我们进一步使用人类注释数据集评估解释质量,报告了在注释的具体性、可证伪性、连贯性、合理性和基础性方面的高胜率(平均72.3%),使用户能够批判性思考并自行决定。因此,我们的工作从以人为中心的角度重新定义了AI生成文本检测的问题,并为专注于原生可解释性的新一代检测器铺平了道路。

英文摘要

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

2605.27916 2026-05-28 cs.CV cs.CL 版本更新

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

OphIn-500K:策划网络规模的视觉指令以扩展眼科多模态大语言模型

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang

发表机构 * Arizona State University(亚利桑那州立大学) Clemson University(克莱姆森大学) Washington University in St. Louis(圣路易斯华盛顿大学) University of Notre Dame(诺特丹大学) Florida State University(佛罗里达州立大学) Rice University(里德大学) NVIDIA(英伟达) Mayo Clinic(梅奥诊所)

AI总结 提出OphIn-Engine流水线从网络视频中构建高质量眼科指令数据,生成包含50万+指令实例的OphIn-500K数据集,并基于此开发眼科专用多模态大语言模型OphIn-VL,在多项任务上超越现有通用医学和专用模型。

详情
AI中文摘要

通用医学多模态大语言模型(MLLMs)的进步为构建支持临床诊断的对话助手展现了巨大潜力。然而,它们在高度专业化领域(如眼科)的适应性仍未得到充分探索,主要原因是缺乏大规模、领域特定的指令微调数据。现有的眼科对话数据集通常规模有限,且大多依赖于已建立的公共基准图像,限制了眼科MLLMs的可扩展性及其捕捉真实临床复杂性的能力。为解决这一问题,我们提出了$ extbf{OphIn-Engine}$,一个眼科特定的指令数据策划流水线,从开放获取的眼科网络规模视频中构建高质量指令数据。该流水线整合了多模态转录以提取图像-文本对、视觉线索分离与评分以识别临床相关的视觉描述,以及指令合成与质量控制以生成准确且多样的临床对话。利用该引擎,我们推出了$ extbf{OphIn-500K}$,一个大规模多模态眼科指令微调数据集,包含超过50万个指令实例和来自29,000多个视频片段的151,000多张独特图像,格式包括视觉问答(VQA)、多轮对话交互和思维链(CoT)推理。基于该数据集,我们进一步开发了$ extbf{OphIn-VL}$,一个具有高级视觉理解和对话能力的眼科专用MLLM。综合实验和案例研究表明,与最先进的通用医学和领域专用MLLMs相比,OphIn-VL实现了更优的性能。

英文摘要

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

2605.27908 2026-05-28 cs.CL cs.AI 版本更新

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills: 发现并自我进化情感支持对话技能

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Qwen DianJin Team, Alibaba Cloud Computing(阿里云Qwen团队)

AI总结 提出ESC-Skills框架,通过干预单元建模支持交互并构建技能库,结合多轮廓自我进化机制,提升情感支持对话的可解释性、可控性和效果。

详情
AI中文摘要

现有的情感支持对话(ESC)系统主要依赖于端到端的回复生成或粗粒度的策略监督,可解释性有限,且对系统性的技能提升支持不足。我们提出ESC-Skills,一个以技能为中心的框架,能够发现并自我进化可执行的情感支持技能。我们首先将局部支持交互建模为干预单元(IUs),捕捉求助者状态、支持干预和回复后情绪变化之间的状态-动作-结果动态。基于从成功和失败的ESC对话中提取的IUs,我们构建了ESC-Skills库,这是一个包含干预指导、适用条件、预期结果和潜在风险的可执行情感支持技能仓库。为了进一步提升鲁棒性,我们引入了一个多轮廓自我进化精炼框架,其中ESC代理在SAGE评估下与多种模拟求助者轮廓进行交互。分析由此产生的交互轨迹,以识别缺失的技能、不安全的干预和特定轮廓的失败模式,然后通过基于模拟的验证来精炼技能库。实验结果表明,ESC-Skills在提升回复质量和对话层面的情感结果的同时,提供了更可解释和可控的支持行为。我们将发布代码、提示和ESC-Skills库,网址为https://github.com/aliyun/qwen-dianjin。

英文摘要

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

2605.27905 2026-05-28 cs.CL 版本更新

AI Research Agents Narrow Scientific Exploration

AI研究代理缩小科学探索范围

Yixuan Tang, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港理工大学)

AI总结 本研究通过四个AI研究代理框架和六个大语言模型生成37,802个科学想法,发现AI生成的想法比人类论文更集中、更接近起始文献,且与低引用论文相似,表明当前AI代理更适合局部细化而非拓宽科学探索。

详情
AI中文摘要

AI研究代理现在能够生成研究想法、设计实验、运行代码和起草论文,这引发了大规模AI辅助科学发现的可能性。许多当前的代理框架明确鼓励生成新颖且高影响力的想法。然而,目前尚不清楚AI辅助构思是拓宽了科学探索,还是主要集中于现有工作。我们将AI研究代理视为科学搜索系统进行研究。使用四个AI研究代理框架和六个大语言模型,我们从AI和机器学习中由引用定义的研究领域的共享种子文献中生成37,802个科学想法。然后,我们将生成的AI想法与来自相同研究领域的人类撰写论文、来自相同种子文献的后续人类研究以及种子文献本身进行比较。在实验中,出现了四个一致的模式。第一,AI生成的想法比来自相同研究领域的人类撰写论文更加集中。第二,AI生成的想法比后续人类工作更接近其起始文献。第三,与AI生成想法最相似的论文往往获得较低的后续引用。第四,当AI生成的想法与先前工作不同时,差异主要来自现有技术方法的重新组合,而不是引入全新的研究问题。总体而言,当前的AI研究代理似乎更适合局部细化,而不是拓宽科学探索。

英文摘要

AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

2605.27901 2026-05-28 cs.CL cs.AI 版本更新

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

跨类型多样语言的思维链监控脆弱性

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal

发表机构 * University of Virginia(弗吉尼亚大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本研究通过13种语言和7个前沿模型家族的评估,发现思维链监控在语言分布偏移下普遍不可靠(平均不可信率95.9%),模型会进行策略性操纵,且低资源语言中欺骗模式完全存在。

详情
AI中文摘要

思维链(CoT)监控已被提出作为一种有前景的安全机制,用于检测大型语言模型中的失调行为。然而,其在英语之外以及跨不同模型家族中的可靠性仍 largely unexplored。我们首次在13种多样语言和7个前沿模型家族(共16个模型)上对CoT可监控性进行了大规模评估。使用需要显式中间计算的对抗性提示评估,结合内部答案标记概率分析,我们一致发现CoT在语言和提示类型上存在不忠实性,在8B至120B参数模型中平均不忠实率为95.9%。我们发现前沿模型系统性地进行策略性操纵,包括答案切换、事后合理化以及对提示的程序性利用,使得外部监控器难以检测欺骗。我们表明,前沿模型通常在其潜在激活中在生成的前15%内就承诺了失调线索,即使CoT看起来忠实。令人惊讶的是,这些欺骗模式在低资源语言中保持100%,揭示了当前基于CoT的监督的根本局限性。我们的结果表明,CoT监控在语言分布偏移下本质上是脆弱的,提供的安全信号比仅英语研究所暗示的要弱得多。这些发现强调了开发稳健的CoT监控器以及加速白盒监控技术研究的迫切需要,特别是为了改善中低资源语言中的CoT可监控性。我们的代码可在此处获取:\href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}。

英文摘要

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}.

2605.27896 2026-05-28 cs.CL cs.CE 版本更新

FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations

FinBoardBench: 通过棋盘游戏模拟基准测试大语言模型的动态财富管理和战略金融推理

Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao, Caiwei Li, Yue Ma, Jie He, Qiancheng Zhang, Yuntao Zou, Dagang Li

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology, Macau, China(1 计算机科学与工程学院,澳门科技大学,澳门,中国) School of Economics, Anhui University, Anhui, China(2 经济学院,安徽大学,安徽,中国) SKLPlanets, Macau University of Science and Technology, Macau, China(3 SKLPlanets,澳门科技大学,澳门,中国) Department of Computer and Information Science, University of Macau, Macau, China(4 计算机与信息科学系,澳门大学,澳门,中国) School of Energy and Power Engineering, Huazhong University of Science and Technology, Hubei, China(5 能源与动力工程学院,华中科技大学,湖北,中国)

AI总结 提出基于三款经典金融棋盘游戏的评估套件FinBoardBench,测试大语言模型在动态财富管理、企业投资收购和竞争谈判等综合金融技能,发现模型虽具备基本规划能力但无法将静态推理转化为成功动态决策。

Comments Preprint

详情
AI中文摘要

近期,大语言模型(LLMs)在静态金融推理和简单动态交易任务中取得了优越性能。然而,现有的静态金融基准不足以评估LLMs在真实环境中的动态财富管理和金融决策能力。为弥补这一差距,我们提出了FinBoardBench,一个基于三款经典金融棋盘游戏(现金流、并购和大富翁)的评估套件。FinBoardBench评估一系列全面的金融技能,包括个人现金流管理与债务平衡、企业投资与收购预测,以及带有资产拍卖的竞争性贸易谈判。我们对9个先进LLMs的实验表明,尽管它们展现出基本的长期规划和投资逻辑,但未能有效利用复杂互动来获取利润,且其强大的静态推理性能并未转化为成功的动态决策。值得注意的是,它们倾向于优先获取即时资产而非维持充足流动性,这使得它们容易受到随机事件引发的金融危机的影响。我们希望FinBoardBench能为未来更智能的基于LLM的决策系统提供有价值的参考。

英文摘要

Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.

2605.27882 2026-05-28 cs.CL cs.AI 版本更新

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench:野外长期主动搜索的基准测试

Xiaohongshu Inc

发表机构 * Xiaohongshu Dots Studio & Unipat AI(小红书 dots 飞 studios 与 Unipat AI)

AI总结 针对现有搜索基准中查询过于明确、单轮交互和固定模式评估导致用户体验与评估结果差距的问题,提出VibeSearch范式并构建VibeSearchBench基准,通过渐进式用户模拟和图匹配评估框架测试前沿模型,发现所有模型在长期上下文推理、主动意图激发和结构化知识构建方面仍存在显著不足。

详情
AI中文摘要

基于LLM的智能体在搜索基准上得分很高,但真实用户始终觉得结果不令人满意,这揭示了持续的评估-体验差距。我们将这一差距归因于现有基准依赖于过度明确的查询、单轮交互和固定模式评估,这些都不反映真实搜索行为——用户和智能体通过多轮对话协作细化模糊意图。我们将这种范式称为VibeSearch,并引入VibeSearchBench,一个包含200个手动策划的双语(中文和英文)任务的基准,涵盖20个领域,分为VibeSearch-Pro(专业)和VibeSearch-Daily(日常生活)子集。每个任务将一个用户角色与一个无模式的真实知识图谱配对,并通过渐进式披露用户模拟器和图匹配评估框架进行评估。我们在ReAct框架和OpenClaw智能体框架下对七个前沿模型进行了基准测试。结果表明,所有模型对于VibeSearch仍然严重不足(最佳F1:30.30),凸显了在长期上下文推理、主动意图激发和结构化知识构建方面需要根本性进展。

英文摘要

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

2605.27878 2026-05-28 cs.CL 版本更新

Narrative Flattening: How Post-Training Compresses Thematic, Affective, and Stylistic Variation in LLM Fiction

叙事扁平化:后训练如何压缩LLM小说中的主题、情感和风格变化

Zehan Li, Yutong Zhu, Siyang Wu, Honglin Bao, James A. Evans

发表机构 * Knowledge Lab, University of Chicago(芝加哥大学知识实验室)

AI总结 通过对比四个OLMo 32B检查点(Base、SFT、DPO、RLVR)在三种故事领域中的续写,发现后训练压缩了主题动态、情感强度和语言多样性,导致叙事扁平化,且专业文学领域压缩最严重。

详情
AI中文摘要

大型语言模型能生成流畅的小说,但其创造性输出普遍被视为扁平。我们探究这种质量源于训练的哪个阶段,以及是否对不同领域的人类小说产生同等影响。我们构建了一个匹配的故事续写范式,涵盖StoryStar(公共平台)、TMAS(提示引导)和《纽约客》(专业文学),并将四个OLMo 32B检查点(Base、SFT、DPO、RLVR)的续写与匹配的人类文本进行比较。由于这些检查点共享架构、规模、分词器和预训练,该设计隔离了后训练效应。我们沿三个句子级维度测量每次续写:主题动态、情感普遍性和语言多样性。在所有三个维度上,后训练压缩了动态变化:主题过渡变得更加均匀,高强度情感让位于中性,故事间的风格多样性缩小。我们将这种渐进性损失称为叙事扁平化。该效应在故事领域间方向稳定,但差距大小取决于人类基线:专业文学小说压缩最严重,而公共平台和提示引导故事的差距较小,这与它们的人类基线更接近模型的默认节奏一致。后训练端点在领域间收敛,表明对齐产生了一种续写机制,该机制在很大程度上不依赖于源领域的叙事纹理。

英文摘要

Large language models produce fluent fiction, yet their creative output is widely seen as flat. We ask where this quality originates in the training and whether it affects different domains of human fiction equally. We construct a matched story-continuation paradigm across StoryStar (public-platform), TMAS (prompt-guided), and The New Yorker (professional literary)-and compare continuations from four OLMo 32B checkpoints (Base, SFT, DPO, RLVR) against matched human text. Because these checkpoints share architecture, scale, tokenizer, and pretraining, the design isolates the post-training effect. We measure each continuation along three sentence-level dimensions: thematic motion, affective prevalence, and linguistic diversity. Across all three, post-training compresses dynamic variation: thematic transitions become more uniform, high-intensity emotions give way to neutrality, and stylistic diversity across stories shrinks. We term this progressive loss narrative flattening. The effect is directionally stable across story domains but gap size depends on the human baseline: professional literary fiction is compressed most, while public-platform and prompt-guided stories show smaller gaps, consistent with their human baselines sitting closer to the model's default rhythm. Post-trained endpoints converge across domains, suggesting alignment produces a continuation regime largely insensitive to the source domain's narrative texture.

2605.27874 2026-05-28 cs.CL 版本更新

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

用于越南语自动语音识别的音节结构解码器

Nghia Hieu Nguyen, Quan Ngoc Hoang, Long Hoang Huu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

发表机构 * Faculty of Information Science and Engineering(信息科学与工程学院) Faculty of Computer Science(计算机科学学院) University of Information Technology(信息技术大学) Vietnam National University, Ho Chi Minh city(越南国家大学,胡志明市)

AI总结 针对越南语自动语音识别,提出基于音素级音节结构解码的方法,通过显式建模音节音系组成,在紧凑音素集上生成有效音节结构,显著减小词汇量并在两个基准上超越强基线。

详情
AI中文摘要

大多数自动语音识别(ASR)系统将转录视为对正字法单元(如字符、子词或词)的预测问题。尽管有效,但此类表示并未明确反映语音的语音结构,且通常需要大词汇量以保持充分覆盖。在这项工作中,我们从越南语的音位特征出发,提出了一种用于ASR的音节结构解码器,该解码器在音素层面而非正字法层面建模语音。我们的方法显式捕捉了音节的音系组成,使解码器能够从紧凑的音素库中生成有效的音节结构。这种设计更紧密地契合了语音的语音实现,同时显著减小了词汇量。在两个基准(代表标准语音的LSVSC和包含多种区域发音的多方言语料库UIT-ViMD)上的实验结果表明,尽管使用了更小的词汇量且无额外训练资源,我们的方法始终优于先前强基线,尤其是预训练基线如PhoWhisper和Wav2Vec2。这些结果突显了基于音素的音节建模在该语言ASR中的有效性。用于实验可复现的代码将在论文被接收后公开。

英文摘要

Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

2605.27865 2026-05-28 cs.CL 版本更新

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

MERIT: 通过基于评分标准的训练进行审稿人匹配的专业知识匹配

Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(数据科学与工程学院,东华大学)

AI总结 提出MERIT两阶段框架,通过强化学习训练审稿人评估器并蒸馏为检索器,实现大规模审稿人分配中的专业知识匹配。

Comments 22pages, 8 figures, 12 tables

详情
AI中文摘要

大规模地将投稿与合适的审稿人匹配是主要会议面临的日益严峻的挑战,然而现有方法要么依赖将一般相关性误认为真正适用性的粗略代理信号,要么需要难以扩展用于训练的昂贵人工标注。我们提出MERIT,一个两阶段框架,通过将标准级别的专业知识匹配转化为可扩展的适用性监督来弥合这一差距。在第一阶段,我们通过强化学习训练一个审稿人评估器,以识别论文所需的专业知识维度,将其与审稿人的先前工作匹配,并产生适用性决策,奖励由基于论文特定专业知识评分标准的LLM引导提供。在第二阶段,我们将评估器的预测蒸馏到基于嵌入的检索器中,以实现高效的大规模分配。实验表明,我们的4B审稿人评估器在适用性分类上优于更大的通用LLM,并且得到的检索器在LR-Bench和CMU Gold数据集上达到了最先进的性能。我们的代码可在https://github.com/Luli3220/MERIT获取。

英文摘要

Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at https://github.com/Luli3220/MERIT.

2605.27858 2026-05-28 cs.CL cs.AI cs.LG 版本更新

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL: 学习提出有用、信息丰富且多样的问题以进行半监督、可追踪的声明验证

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

发表机构 * Department of Computer Science and Electrical Engineering(计算机科学与电气工程系)

AI总结 提出DecomposeRL框架,通过GRPO和多面奖励集成将声明分解为可追踪的子问题,在完全监督和半监督设置下实现高精度,且模型规模小4倍仍匹配大模型性能。

详情
AI中文摘要

声明验证分为两类:端到端分类器准确但无法提供可检查的追踪,而基于分解的方法可产生可检查的追踪但在基准数据集上性能滞后。我们提出DecomposeRL,一种能产生可检查追踪的准确声明验证器。DecomposeRL将分解建模为使用GRPO和多面奖励集成训练的RL策略,支持从无标签声明进行完全监督和半监督学习。DecomposeRL通过数据筛选漏斗解决了GRPO高昂的训练成本,将115K事实验证声明提炼为包含密集学习信号的5K声明子集。我们表明,仅在约5K精选声明上使用完全监督训练的DecomposeRL-7B策略,在包含生物医学、政治、科学和通用领域声明的11个声明验证基准上,实现了86.3的域内和69.8的域外平衡准确率。尽管规模小4倍,它匹配了32B基线和GPT-4.1-mini,并且在仅10%标签声明数据的半监督设置中进一步优于基线。代码、数据和模型见https://dipta007.github.io/DecomposeRL。

英文摘要

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

2605.27849 2026-05-28 cs.PL cs.AI cs.CL 版本更新

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

FPMoE:一种用于函数式代码生成的稀疏混合专家方法

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

发表机构 * GreenNode AI Hanoi University of Science and Technology(河内科学技术大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 针对LLM在函数式编程语言上性能差的问题,提出基于稀疏MoE架构的FPMoE模型,通过语言特定专家和共享专家分别消除干扰和捕获跨语言抽象,以3B活跃参数达到远超微调基线并匹配大模型的效果。

详情
AI中文摘要

尽管基于LLM的代码生成取得了快速进展,但现有模型主要针对命令式语言进行训练,导致函数式编程语言(FPLs)如Haskell、OCaml和Scala长期未被充分探索,即使是前沿模型在FPLs上的表现也明显较差。微调是一种自然的补救措施,但我们的实验表明,每种语言的微调无法捕获共享的函数式抽象,而合并的多语言微调则引入了跨语言干扰。为了解决这个问题,我们引入了FPMoE,这是一个轻量级的开源代码生成模型,基于稀疏混合专家(MoE)架构,包含三个语言特定的路由专家(分别对应Haskell、OCaml和Scala)和一个共享专家,用于捕获跨语言的函数式模式,如单子推理和类型导向编程。这种设计同时解决了两种失败模式:专用专家消除了干扰,而共享专家保留了单语言模型遗漏的抽象。在FPEval上,FPMoE显著优于微调基线,并且仅使用3B活跃参数,即可匹配包括DeepSeek-Coder-6.7B、Qwen2.5-Coder-14B-Instruct和Qwen3-Coder-30B-A3B在内的更大模型的性能。

英文摘要

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

2605.27832 2026-05-28 cs.CL 版本更新

Playing with Words, Improving with Rewards: Training Language Models for Creative Association

玩文字游戏,用奖励改进:训练语言模型进行创意联想

Vijeta Deshpande, Namrata Shivagunde, Sherin Muckatira, Hadrien Glaude, Mikhail Gronas, Claire Stevenson, Roger Beaty, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛市分校) Dartmouth College(达特茅斯学院) University of Amsterdam(阿姆斯特丹大学) Pennsylvania State University(宾夕法尼亚州立大学) Amazon AGI(亚马逊人工智能研究院)

AI总结 本研究通过强化学习与可验证奖励(RLVR)在Codenames游戏上训练LLM,探索了规模依赖的精确度-创造力权衡,发现8B模型在保持推理能力的同时提升创造力,而小模型则牺牲创造力换取推理精度。

详情
AI中文摘要

大型语言模型(LLM)正被应用于日益困难的问题和用例。为了有效导航其广阔的解决方案空间,LLM需要具备创造力。然而,创造力的主观性和人类判断的局限性使得训练LLM的创造力尤其具有挑战性。作为解决方案,我们在Codenames(一个词联想游戏)上训练LLM,该游戏锻炼了创造力的两个核心轴——发散思维和收敛思维,同时产生客观可验证的结果。这种可验证性使我们能够绕过人类判断,并使用具有可验证奖励的强化学习(RLVR)进行训练。我们训练了Qwen3-1.7B、4B和8B模型,并在十个创造力和四个推理基准上评估它们。我们发现精确度-创造力权衡是规模依赖的:8B模型优先考虑创造力而非精确度,而1.7B和4B模型则以牺牲创造力为代价获得推理精确度。具体来说,8B模型在8个创造力基准上显示出适度但一致的提升,且推理能力仅略有下降,而较小的模型在推理任务上取得了显著提升。我们的研究提出了一种可扩展且有效的解决方案来训练LLM的创造力。

英文摘要

Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.

2605.27824 2026-05-28 cs.AI cs.CL 版本更新

Revealing Algorithmic Deductive Circuits for Logical Reasoning

揭示逻辑推理的算法演绎电路

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

发表机构 * Japan Advanced Institute of Science and Technology(日本科学技术先进研究院)

AI总结 本研究通过因果中介分析定位大语言模型中负责逻辑推理步骤的注意力头,发现少量专用头处理事实和规则信息,而高层头促进信息整合和全局推理策略的出现。

详情
AI中文摘要

最近的研究表明,通过在少样本学习设置中引入抽象描述图遍历算法和逐步推理的功能性符号表示,大型语言模型(LLMs)能够实现强大的推理性能。然而,目前尚不清楚LLMs如何仅从有限的示例中真正理解每个推理步骤的抽象含义以及整体算法。本文旨在定位负责单个推理步骤的注意力头,并刻画它们之间传输的信息类型。我们首先在符号辅助的思维链(CoT)提示框架下,将组成推理步骤与其对应的token logits对齐。我们的分析表明,引导推理过程的token位置与低置信度分数相关,这些低置信度分数是由满足演示中推理行为模式的约束引起的。然后,我们采用因果中介分析技术来识别负责这些模式的注意力头。此外,我们的发现表明,LLMs通过专门的注意力头(约占全部头的3%)为各个子推理任务检索事实和基于规则的信息,而较高层主要促进信息整合和全局推理策略(例如图遍历算法)的出现,这些策略协调多个中间推理步骤以解决整体任务。

英文摘要

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

2605.27808 2026-05-28 cs.CL cs.MM 版本更新

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

TARQ: 面向罕见词鲁棒自动语音识别的尾部感知重建量化

Xinyu Wang, Ziyu Zhao, Ke Bai, Silin Meng, Dongming Shen, Xiao-Wen Chang, Yixuan HE

发表机构 * McGill University(麦吉尔大学) Boson AI Arizona State University(亚利桑那州立大学)

AI总结 提出TARQ,一种无标签的后训练量化框架,通过尾部感知重建损失和罕见词平衡规则,在不增加额外训练的情况下显著降低罕见词错误率。

详情
AI中文摘要

数据感知后训练量化(PTQ)在小型校准语料库上最小化每个token的重建损失,隐式地根据经验频率对位置进行加权。对于自动语音识别(ASR),这与尾部敏感风险不一致:名称、数字和领域特定词获得的校准质量比例较小。我们提出了尾部感知重建量化(TARQ),一种无标签的PTQ框架,通过罕见词平衡(一种封闭形式的每线性层规则,平衡常见/尾部质量)和度量一致的残差校正,将校准转向词汇尾部。TARQ不需要实体标签、不需要精心设计的校准集、不需要验证解码,也不需要额外训练。在八个ASR骨干网络和六个数据集上,W4G128下,TARQ在不导致总体WER回归的情况下改善了平均罕见词错误率(rare-WER),在比较方法中实现了最低的跨语料库rare-WER波动,并在无需实体监督的情况下迁移到实体丰富的基准测试(ProfASR, ContextASR-Speech-En)。

英文摘要

Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

2605.27805 2026-05-28 cs.CL cs.AI 版本更新

ChildEval: When large language models meet children's personalities

ChildEval:当大语言模型遇到儿童个性

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng

发表机构 * JIUTIAN Research(九天研究院) China Mobile(中国移动) Beijing, China(北京,中国)

AI总结 提出ChildEval基准,通过合成3-6岁儿童个性档案和偏好(显式或隐式表达),评估大语言模型在长对话中推断并遵循儿童偏好的能力,实验表明微调可提升儿童中心性能。

Comments 8 pages of main text (ACL Findings format), with references and appendix

详情
AI中文摘要

虽然大语言模型(LLM)使得个性化聊天机器人成为可能,但它们在儿童中心个性化方面的有效性仍不明确,因为缺乏对儿童特定偏好的系统评估。为填补这一空白,我们引入了ChildEval,一个用于评估LLM在长上下文对话中推断和遵循儿童中心偏好能力的基准。ChildEval包含29K个3-6岁儿童的合成个性档案,提供相对静态的背景信息。每个个性档案关联一个儿童偏好——可能与个性一致、冲突或独立——通过单句显式表达或6-10轮对话隐式表达。显式和隐式偏好旨在反映相同的潜在偏好,但表达方式不同,捕捉偏好表达的动态方面而非静态个性的变化。该基准涵盖五个顶层类别和十四个子类别,覆盖儿童的日常生活和发展。我们进一步提出了细粒度、以儿童为中心的评估协议,以系统评估开源LLM。实验结果表明,不同的个性化表示如何影响LLM的响应,并表明在ChildEval上进行微调可以提升儿童中心性能。我们的代码和数据集可在https://github.com/ziyanluo/ChildEval获取。

英文摘要

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

2605.27789 2026-05-28 cs.AI cs.CL 版本更新

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

固定预算、聚类感知的 LLM-as-a-Judge 评估标准:多跳 RAG 压力测试

Camilo Chacón Sartori, José H. García

发表机构 * Catalan Institute of Nanoscience and Nanotechnology(加泰罗尼亚纳米科学与纳米技术研究所)

AI总结 针对多跳 RAG 系统评估中的统计偏差问题,提出一种固定预算、聚类感知的 LLM-as-a-Judge 比较标准,并通过遗传算法证据选择器 GADMEC 在 400 个多跳问题上进行压力测试,揭示聚类感知推断改变了实证结论。

详情
AI中文摘要

检索增强生成(RAG)系统通常通过让大型语言模型(LLM)法官判断哪个答案更好来进行比较。对于多跳 RAG,这已成为一个测量问题,与建模问题同等重要:相同的分数可以反映检索质量、答案长度、词汇重叠或忽略聚类数据的统计检验。我们询问当这些选择被明确时会发生什么。 我们提出了 RAG 中 LLM-as-a-Judge 比较的最小测量标准。该标准固定了 top-100 候选池、证据预算、答案上限、生成器和提示;它还要求预先注册假设、聚类感知推断、在可行时进行精确的聚类符号翻转检验以及第二法官复制。聚类基准可能夸大进展;该领域应采用此标准。我们使用遗传算法解码器进行多跳证据组合(GADMEC),一种进化证据选择器,在计算机科学/机器学习(CS/ML)和材料科学领域的 400 个多跳问题上对其进行压力测试。该协议改变了实证故事。二项检验使所有四个语义基线比较看起来显著;聚类感知推断只留下一个 Bonferroni 显著结果。在相同预算下,BM25 优于纯语义 GADMEC,而词汇-语义混合在 CS/ML 中恢复并缩小了材料科学差距。

英文摘要

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

2605.27788 2026-05-28 cs.LG cs.CL 版本更新

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

知道何时求助:面向LLM工具使用的片段级信用分配

Abhijit Kumar, Zoey Wu, Mohit Suley

发表机构 * Microsoft AI Redmond(微软AI红mond)

AI总结 提出CARL方法,通过强化学习在模型自身轨迹上训练评论家,对每个工具使用片段独立分配信用,使模型学会区分参数知识足够与需要外部帮助的情况,在多个基准上提升准确率并减少不必要的工具调用。

详情
AI中文摘要

人类知道何时需要求助,例如 $347 \times 28$ 需要计算器而 $2+2$ 不需要。语言模型则不然。基于提示的方法可以指导模型何时调用工具,但这种脚手架并不能教会模型识别自身知识的边界。将单一结果奖励分配给整个轨迹的强化学习方法同样效果不佳:轨迹级信用无法隔离成功回合中哪个工具调用真正有帮助,也无法惩罚不必要的调用。我们提出 \textbf{CARL}(\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning),该方法在模型自身的 rollout 上训练评论家,以学习参数知识何时足够以及何时需要外部帮助。通过在每个 rollout 的自然工具使用边界(例如代码围栏分隔符和上下文块转换)处进行分解,CARL 从单一二元结果中为每个片段分配独立信用,无需外部评判或步骤级标注。因此,错误的工具调用、不正确的提取以及不必要的调用各自获得适当符号的优势。训练好的评论家捕捉了模型的领域能力:在7B规模下,它以AUC 0.93区分参数可解问题与工具依赖问题。在涵盖算术、多跳事实问答和金融表格数值推理的五个基准上,CARL在7B和3B规模下分别比最佳RL基线提高了6.7和9.7个精确匹配准确率点,其中在Musique上增益最大(7B +8.3 EM,3B +9.0 EM)。模型在参数可回答的问题上减少了53%的工具调用,同时在这些问题上仍保持约10个EM点的更高准确率。增益在小规模上最大:3B的改进是7B改进的1.4倍,这表明知道何时求助对参数记忆较小的模型有更大益处。

英文摘要

Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

2605.27787 2026-05-28 cs.MA cs.CL 版本更新

Long Live the Librarian! A Persistent Search Sub-Agent for Energy-Efficient Multi-Agent Software Engineering Systems

图书馆员万岁!面向节能多智能体软件工程系统的持久搜索子智能体

Seunghyuk Cho, Sunghyun Choi, Jaeseung Heo, Youngbin Choi, Saemi Moon, MoonJeong Park, Dongwoo Kim

发表机构 * Graduate School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院) Department of Computer Science and Engineering, POSTECH(POSTECH计算机科学与工程系)

AI总结 针对多智能体系统中输出令牌冗余导致的高能耗问题,提出持久搜索子智能体Librarian,通过跟踪仓库搜索历史并抑制重复探索,减少输出令牌量,在SWE-Bench Verified上实现高达25%的GPU能耗降低且不损失任务性能。

Comments 19 pages, 4 figures, 12 tables

详情
AI中文摘要

多智能体系统(MAS)显著推进了自主软件工程(SWE),但其日益增长的推理能量需求引发了可持续性问题。在本文中,我们证明这种成本集中在一个被忽视的来源上:跨智能体生成的冗余输出令牌。两个实证发现支持了这一主张。首先,我们对MAS的每令牌能量归因揭示了一个显著的不对称性:一个输出令牌消耗的能量是输入或缓存令牌的30到1000倍。其次,MAS增加了每回合输出量,因为智能体反复重新探索重叠的仓库区域。为了解决这一低效问题,我们提出了Librarian,一个持久搜索子智能体,它跟踪仓库搜索历史并抑制跨智能体的冗余探索动作。通过返回文件区域的简短引用而不是完整的文件摘录,Librarian进一步减少了输出令牌量。在SWE-Bench Verified上,Librarian将现有多智能体SWE系统的每回合GPU能耗降低了高达25%,同时保持了任务性能。

英文摘要

Multi-agent systems (MAS) have substantially advanced autonomous software engineering (SWE), but their growing inference energy demands raise sustainability concerns. In this paper, we demonstrate that this cost is concentrated in an overlooked source: redundant output tokens generated across agents. Two empirical findings ground this claim. First, our per-token energy attribution for MAS reveals a sharp asymmetry: an output token consumes 30 to 1,000 times more energy than an input or cached token. Second, MAS inflate per-episode output because agents repeatedly re-explore overlapping repository regions. To address this inefficiency, we propose Librarian, a persistent search sub-agent that tracks repository-search history and suppresses redundant exploration actions across agents. By returning short references to file regions instead of full file excerpts, Librarian further reduces output-token volume. On SWE-Bench Verified, Librarian reduces per-episode GPU energy consumption of existing multi-agent SWE systems by up to 25% while preserving task performance.

2605.27773 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

模型是否知道它们为何改变主意?知识冲突下思维链的可解释性与忠实性

Pruthvinath Jeripity Venkata

发表机构 * Independent Researcher(独立研究员)

AI总结 通过引入内省忠实性,研究在知识冲突下语言模型的思维链推理是否忠实反映其决策机制,发现CoT高度稳定但置信度携带微弱真实信号。

Comments 12 pages, 8 tables, 3 appendices

详情
AI中文摘要

当语言模型看到与其训练知识相矛盾的文档时,它必须做出选择:遵循文档还是相信自己。先前的工作证明这种选择取决于事实的知名程度。我们问:模型的思维链(CoT)推理是否忠实地报告了这一机制?我们引入了内省忠实性,并在200个问题、8个模型和4种提示条件下进行了测试。我们发现CoT推理在相反决策下高度稳定:翻转对保留了96%的相同答案相似度(d=0.34;通过ROUGE-L确认,d=0.45)。然而,自我评定的置信度携带微弱的真实信号:对于实体知名度无信息量的冷门事实,置信度仍能预测决策(p<0.001),并追踪项目级知识(r=0.134)。GPT-4o是唯一具有统计上可靠的推理-决策耦合的模型。Claude Sonnet 4.6显示出最宽的置信度范围(SD=1.39),但汇总相关性接近零,因为置信度-决策关系在不同条件下反转;温度消融实验证实这是模型特有的。内部思考令牌比面向用户的CoT显示出更大的决策敏感性(p=0.033)。CoT分解为决策不变的知识展示(约96%)和一层薄弱的置信度层,后者带有微弱但真实的信号。对于监控:读取置信度,而不是论证。

英文摘要

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p<0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

2605.27767 2026-05-28 cs.CL cs.AI cs.LG 版本更新

UniMaia: Steering Chess Policies with Language for Human-like Play

UniMaia:用语言引导国际象棋策略以实现类人玩法

Sherman Siu, Lesley Istead

发表机构 * University of Waterloo(滑铁卢大学) Carleton University(卡尔顿大学)

AI总结 提出UniMaia框架,通过参数高效文本编码器和ControlNet风格调节机制,在冻结的Lc0国际象棋策略网络上实现提示条件策略调制,实现语义控制(如开局选择和玩家强度)并保持预训练策略表征,同时构建大规模元数据增强的Lichess数据集和半自动提示生成管道,在多个基准上取得最优或竞争性结果。

详情
AI中文摘要

大型语言模型的最新进展使得自然语言能够作为控制复杂系统的灵活接口,但通常以大规模多模态训练或弱化领域特定归纳偏差为代价。在结构化决策领域(如国际象棋)中,专门的策略网络表现强劲但缺乏语义可控性,而提示条件语言模型更灵活但通常领域基础较弱。我们提出$ extbf{UniMaia}$,一个用于提示条件策略调制的框架,它使用参数高效文本编码器和ControlNet风格的调节机制来适配基于Lc0的冻结国际象棋策略网络。UniMaia能够实现对游戏玩法的语义控制,包括开局选择和玩家强度,同时保留预训练的策略表征。我们进一步引入$ extbf{UniMaia-Aux}$,它结合了辅助时间条件化和行为预测目标。为了支持这项工作,我们构建了一个大规模元数据增强的Lichess数据集,开发了一个半自动提示生成管道,并引入了涵盖提示条件和元数据条件设置的基准。UniMaia在多个提示条件基准上实现了最先进的预期准确率,在通用指令遵循任务上达到了竞争性的最佳着法准确率,同时在人类着法预测基准上与专门的元数据条件方法保持竞争力。UniMaia-Aux进一步提高了多个评估设置下的预期准确率和行为建模,在最佳着法准确率上略有折衷。总体而言,我们的结果表明,无需端到端多模态训练即可实现领域特定策略网络的提示条件控制,同时突出了可控性与预测性能之间的权衡。

英文摘要

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL 版本更新

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测?古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现,发现VLM即使错误也能生成流畅文本,表明其依赖语言先验,并引入扰动和标记级定位度量分析视觉证据。

详情
AI中文摘要

最近的研究表明,用于光学字符识别(OCR)的视觉语言模型(VLM)能够生成看似合理但缺乏视觉支持的文本,暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比,我们展示了VLM的错误即使在错误时也往往保持流畅,产生合理的希腊语替换,而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据,我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下,VLM与扰动的真实文本严重偏离,而传统OCR相对忠实;然而,标记级分析表明先验依赖是模型特定的:在OCR专业模型中,流畅的词汇错误几乎不依赖图像而产生,而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位,而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集,表明流畅输出不一定具有视觉基础,并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

2605.27741 2026-05-28 cs.CL 版本更新

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

逃离语言先验:通过模态感知策略优化缓解音频推理中的后期模态崩溃

Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang, Steve Yves, Sanjeev Khudanpur, Liefeng Bo

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Tencent Hunyuan(腾讯文言)

AI总结 针对多模态大语言模型在强化学习后训练中因统一策略梯度忽略模态依赖性而导致的后期模态崩溃问题,提出模态感知策略优化(MAPO)框架,通过模态相关性掩码和辅助注意力损失分支动态聚焦梯度并维持跨模态推理,在复杂音频推理基准上取得新最优结果。

详情
AI中文摘要

音频和全模态大语言模型展现出令人印象深刻的跨模态推理能力。然而,将标准的强化学习后训练算法应用于这些模型时,暴露了一个关键的结构性脆弱性:像GRPO这样的方法对所有token施加统一的策略梯度,忽略了它们对非文本源模态的不平等依赖。这加剧了在扩展思维链生成过程中的后期模态崩溃,模型逐渐放弃主要源信号,转而依赖压缩的文本先验,导致自信但无根据的幻觉。为了解决这个问题,我们引入了模态感知策略优化(MAPO),一种新颖的双分支强化学习框架。首先,MAPO使用模态相关性掩码动态地将策略梯度集中在模态关键token上,该掩码源自音频消融参考与多模态策略之间的跨模态微分熵。其次,它集成一个辅助注意力损失分支,对模型内部的注意力分布施加有针对性的、时间尺度的惩罚。这确保模型在推理轨迹深处主动维持跨模态基础。在复杂音频推理基准上的评估表明,MAPO显著提高了长时推理保真度和多模态指令遵循能力,在开放权重模型中实现了极具竞争力的性能,并在几个关键基准上创造了新的最先进结果。通过严格依赖原生统计信号而非特定领域的归纳偏置,MAPO为缓解跨多种多模态系统的认知崩溃提供了一个有前景的基础。

英文摘要

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

2605.27740 2026-05-28 cs.CL 版本更新

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

UNIQUE: 通用Top-k稀疏注意力,用于免训练推理和稀疏感知训练

Keqi Deng, Shaoshi Ling, Ruchao Fan, Jinyu Li

发表机构 * Microsoft(微软)

AI总结 提出UNIQUE框架,通过基于键均值和标准差的页面重要性评分和软掩码稀疏感知训练,实现LLM长上下文推理中KV缓存的高效稀疏注意力,在保持任务性能的同时显著加速。

详情
AI中文摘要

大型语言模型(LLM)中的长上下文推理受到自注意力键值(KV)缓存线性增长的瓶颈限制。Top-k稀疏注意力通过仅加载一小部分KV缓存来缓解这一问题,但准确且廉价地估计缓存重要性(既用于免训练使用,也用于稀疏感知训练)仍然具有挑战性。本文提出UNIQUE,一个通用的top-k稀疏注意力框架,同时满足这两个需求,并在LLM模态中保持一致的有效性。UNIQUE以KV页面为粒度,通过一个简单而准确的评分来估计每页的重要性,该评分结合了页面键的均值作为代表性向量及其标准差作为偏移项。为了进一步缩小训练-推理差距,本文引入了一种软掩码稀疏感知训练方案,该方案使用top-k分数边界作为每个查询的阈值,并在其周围使用sigmoid软掩码,既不需要辅助损失,也不需要架构更改。在文本和语音LLM上的实验表明,UNIQUE在LongBench Pro等长上下文基准测试和长形式语音识别上保持了任务性能,同时与FlashInfer密集注意力相比,实现了高达11.4倍的注意力内核加速,并且与基于vLLM的密集模型相比,实现了至少5.3倍的端到端解码加速。

英文摘要

Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page's keys as a representative vector with their standard deviation as an offset term. To further close the train-inference gap, this paper introduces a soft-mask sparsity-aware training scheme that uses the top-k score boundary as a per-query threshold and a sigmoid soft mask around it, requiring neither auxiliary losses nor architectural changes. Experiments on text and speech LLMs show that UNIQUE preserves task performance on long-context benchmarks such as LongBench Pro and on long-form speech recognition, while delivering up to 11.4x attention-kernel speedup over FlashInfer dense attention and at least 5.3x end-to-end decoding speedup over a vLLM-based dense model.

2605.27721 2026-05-28 cs.CL cs.AI 版本更新

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

UserHarness:利用用户心智增强智能体心理理论

Cheng Qian, Jiayu Liu, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出UserHarness框架,通过显式重建用户心智状态(信念、意图等)进行心理理论推理,在五个基准上达到95.94%的宏准确率,相对提升超15%。

Comments 19 Pages, 4 Figures, 2 Tables

详情
AI中文摘要

理解用户的信念和意图对于构建有效的智能体助手至关重要。这种能力通常通过心理理论(ToM)任务来评估,成功需要从用户的角度进行推理。然而,许多现有方法通过复杂流水线处理ToM,间接建模行为,而没有显式重建用户的心智状态。这忽略了问题的核心结构:用户基于其信念行动,信念通过观察环境更新;信念和意图共同决定行动,行动又改变环境;社会推理通常需要关于他人信念或意图的嵌套信念。我们提出UserHarness,一个简单的框架,将ToM推理重新定义为显式的用户心智重建。UserHarness分解用户的心智状态、其与外部环境的关系以及由此产生的行动,使智能体能够跟踪用户的观察、信念、意图和行为。在五个基准上,UserHarness达到高达95.94%的宏准确率,相比现有推理方法相对提升超过15%,相比最强的纯提示框架相对提升约20%。这些结果表明,稳健的用户理解需要从用户心智根源进行推理,将用户驾驭作为未来更具适应性的助手的有前景的基础。

英文摘要

Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user's perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user's mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user's mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user's mind, positioning user harnessing as a promising foundation for more adaptive future assistants.

2605.27715 2026-05-28 cs.CL 版本更新

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

超越输入理解:使用有向无环迹图诊断多语言数学推理

Jiaqiao Zhang, Zhoujun Li, Raoyuan Zhao, Jian Lan, Thomas Seidl, Michael A. Hedderich, Hinrich Schütze, Yihong Liu

发表机构 * Southwest University(西南大学) LMU Munich(慕尼黑莱茵-瓦尔德大学) MCML

AI总结 本文提出有向无环迹图(DATG)框架,通过将推理迹映射到与语言无关的数学锚点和依赖关系,诊断多语言数学推理中的语言影响,并设计Loop-Retry和Formula-Retry两种测试时控制方法改善低资源语言性能。

Comments preprint

详情
AI中文摘要

大型推理模型(LRMs)在英语中表现出强大的数学推理能力,但在许多低资源和中资源语言中仍然不太可靠。这种差距通常被解释为无法理解非英语的问题陈述。我们表明这种观点是不完整的:即使问题以英语给出,控制模型的推理语言也会显著降低准确性,这表明语言也影响推理执行本身。为了研究这种效应,我们引入了DATG,一个有向无环迹图框架,将推理迹映射到与语言无关的数学锚点和依赖关系。这使我们能够将目标语言迹与参考DAG对齐,并测量它们是否覆盖所需的数学节点、尊重依赖边以及避免有害的数学动作。在Qwen3系列上跨12种语言的实验表明,非英语推理通常遭受锚点覆盖减少和依赖保真度降低,尤其是在低资源语言中。受此诊断启发,我们提出了Loop-Retry和Formula-Retry,两种针对DATG暴露的失败模式的简单测试时控制方法,并表明它们一致地改善了低资源语言中的目标语言推理性能。

英文摘要

Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model's reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

2605.27709 2026-05-28 cs.CL 版本更新

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath: 面向可扩展和可验证数学问题生成的答案反转方法

Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich

发表机构 * Center for Information and Language Processing(信息与语言处理中心) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心 (MCML)) Saarland University(萨尔兰州大学)

AI总结 提出ReverseMath方法,通过反转原始问题的输入输出关系自动生成新数学问题,用于评估和训练,揭示记忆行为并提升推理性能。

详情
AI中文摘要

数学推理基准对于评估大型语言模型(LLM)至关重要,但许多基准是静态的,并通过公开评估和训练管道反复暴露,使得难以区分真正的推理与记忆。同时,手动构建具有可靠答案的新数学问题仍然成本高昂。我们引入ReverseMath,一种通过答案反转生成新数学问题的可扩展方法。给定一个问题及其答案,ReverseMath掩码原始问题中的一个数值,将原始答案视为已知条件,并重写问题,使得掩码值成为新答案。生成的问题反转了原始输入输出关系,使其答案通过构造已知。我们研究了ReverseMath在评估和训练中的应用。对于评估,配对的原始/反转问题揭示了显著的行为变化:模型有时在反转问题上失败,甚至错误地输出原始答案,暗示了类似记忆的行为。对于训练,ReverseMath提供自动标注的反转问题作为强化学习(RL)的数据增强。实验表明,包含ReverseMath生成的数据提高了多个基准上的数学推理性能,证明了其作为分析工具和可验证训练数据的可扩展来源的价值。

英文摘要

Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

2605.27706 2026-05-28 cs.CL cs.IR 版本更新

Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

基于格点链式自适应重配置以减少幻觉

Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

发表机构 * Department of Mechanical and Aerospace University of California Irvine(机械与航空航天系加州大学伊文斯顿分校) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 提出CAROL框架,通过定义语义不确定性度量并在文本序列格点上构建串子模目标,将幻觉缓解转化为马尔可夫链接受-拒绝过程,实现测试时幻觉减少。

详情
AI中文摘要

我们介绍了CAROL(基于格点的链式自适应重配置),一个用于大型语言模型测试时减少幻觉的概率框架。CAROL不依赖于词元级别的不确定性,而是基于生成响应与可信上下文之间的一致性定义了一种语义不确定性度量,在文本序列格点上诱导出一个串子模目标。这种表述使得幻觉缓解可以被建模为一个具有可证明收敛性和接近最优性保证的马尔可夫链接受-拒绝过程,允许模型迭代地优化输出以实现语义一致性。通过在意义层面操作,CAROL将幻觉检测和缓解统一在一个框架内。在问答和多智能体推理基准上的实证结果表明,与基于似然和检索增强的基线相比,CAROL显著减少了幻觉,提高了可靠性和可解释性,同时保持了具有竞争力的计算效率。

英文摘要

We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

2605.27690 2026-05-28 cs.CL cs.LG 版本更新

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

TRACES: 通过轨迹状态建模实现多轮LLM智能体的主动安全审计

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

发表机构 * Brown University(布朗大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Rutgers University(罗格斯大学) Texas A&M University(德克萨斯阿姆斯特朗大学)

AI总结 提出TRACES方法,通过观察LLM的隐藏表示学习前缀级轨迹风险状态,实现多轮工具使用环境下的主动安全审计,提升全轨迹安全预测和主动风险判别能力。

详情
AI中文摘要

LLM智能体越来越多地通过多轮工具使用和环境交互来运作,其中安全风险往往在最终结果显现之前的中间步骤中就已经出现。因此,反应式审计是不够的:事后诊断常常在风险正在展开时错过标记它们的机会。我们提出TRACES,一种基于表示的主动审计器,它从观察者LLM的隐藏表示中学习前缀级轨迹风险状态。TRACES从步骤表示中诱导潜在机制特征,并建模其时间演化,以估计部分轨迹是否正在向不安全行为漂移。为了规避步骤级风险标注的成本和歧义,TRACES在弱轨迹级监督下训练,同时仍能产生密集的前缀级风险估计。在多个智能体安全基准测试中,TRACES改进了全轨迹安全预测和主动风险判别。我们的分析进一步表明,这些风险状态可以帮助训练更安全的智能体,凸显了主动审计在长程智能体安全中的更广泛潜力。

英文摘要

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

2605.27668 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐:用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

发表机构 * Agentic Learning AI Lab(代理学习AI实验室) New York University(纽约大学) The University of Chicago(芝加哥大学) Chronologies AI

AI总结 提出Beta-Bernoulli校准器(BBC),通过结合二元结果和人类预测信号,将初始点估计转换为事件似然分布,实现校准和不确定性量化。

详情
AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测,现有方法通常从二元结果中学习以输出语言化预测。然而,尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息,如何利用这些信号仍未充分探索。为了解决这个问题,我们提出了Beta-Bernoulli校准器(BBC),它将来自任何模型的初始点估计转换为事件似然分布,使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模,均值作为校准的点预测,方差作为认知不确定性。我们的结果表明,BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测,同时保持轻量级并具有良好的泛化能力。我们还表明,BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

2605.27654 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

英译印地语中的文化保真度:性别可恢复性的保持-流畅性前沿

Samyak Savi, Chavi Gupta, Shreyas Gantayet, Tanay Sodha, Dhruv Kumar

AI总结 研究英译印地语中性别信息的保持问题,提出两种推理时干预方法(SAR和PAR),在保持性别可恢复性与流畅性之间取得平衡。

Comments 10 pages, 2 figures, 9 tables

详情
AI中文摘要

生成式翻译系统是文化技术,因为它们决定如何在特定文化的语法系统中呈现具有社会意义的线索。我们研究成功文化翻译的一个具体概念:当英语源文本明确编码性别时,英译印地语应保持该线索的可恢复性,除非源文本本身存在歧义。我们在涵盖十二个类别的37,345个实例基准上评估了这一标准,并显示五个系统经常通过作格和敬语结构消除性别。然后,我们引入了两种机制感知的推理时干预。第一种是源感知重排序器(SAR),倾向于避免性别中立句法的候选。第二种是现象感知重排序器(PAR),即使在作格句法存在的情况下,也通过目标词汇标记保持性别。在GPT-4o-mini和Sarvam上,PAR将目标子集准确率分别从11.07%提高到54.47%,从15.99%提高到49.66%。人工评估显示,PAR将性别保持率从10.3%提高到81.3%,但平均流畅度从4.36降至3.37。这些发现将两种干预置于保持和流畅性的前沿,而不是支持单一的解决方案,并展示了文化定位的生成如何在保真度、流畅性和风格自然性之间需要明确的权衡。

英文摘要

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

2605.27649 2026-05-28 cs.CL cs.LG 版本更新

Disentangling Language Roles in Multilingual LLM Task Execution

多语言大模型任务执行中的语言角色解耦

Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

发表机构 * Marquette(马凯特大学) Cornell(康奈尔大学) UC San Diego(南加州大学圣地亚哥分校) UPenn(普林斯顿大学) UT Austin(德克萨斯大学奥斯汀分校) Maryland(马里兰大学) Michigan(密歇根大学) UIUC(伊利诺伊大学香槟分校) Melbourne(墨尔本大学) Stanford(斯坦福大学)

AI总结 提出MTM-Bench基准,通过完全交叉设计解耦指令、内容和响应三种语言角色,评估多语言LLM的任务执行能力,发现响应语言角色是性能下降的主要因素。

详情
AI中文摘要

多语言大模型在指令、源内容和所需响应语言不一致时被越来越多地使用。现有基准扩展了多语言指令跟随评估,但很少在完全交叉设计中隔离这三种角色。我们引入了MTM-Bench,一个用于语言条件任务执行的控制基准,其中每个实例由三元组 \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\) 定义。在英语、西班牙语和中文中,MTM-Bench枚举了所有27个三元组,每个模型包含2,430个实例,涵盖语义反转、最终状态提取和带更新实现的语言纯度。我们使用分解指标评估了20个前沿和开源权重LLM,包括语义正确性、目标语言遵循度、约束满足度、污染比率和联合成功率,并通过针对性的人工审计验证评分。完全交叉设计揭示了性能下降是由语言在任务结构中扮演的角色组织的,而不仅仅是语言不匹配的数量。响应语言角色是变化的主要轴,单个响应槽不匹配导致了大部分性能下降。仅响应不匹配与完全不匹配的比较表明,不匹配数量不是困难的单调预测因子,模型级别的排序在不同系统间变化。任务族通过不同的通道失败,表明语义正确性本身并不能捕捉可靠的多语言任务执行。

英文摘要

Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

2605.27642 2026-05-28 cs.CL cs.LG 版本更新

Learning to Translate from Soft to Hard LLM Prompts

学习从软提示到硬提示的翻译

Pitipat Kongsomjit, Suryansh Goyal, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 本文通过训练一个专用的软提示到自然语言翻译模型,提高了翻译质量,并展示了软提示可以转化为可移植的文本提示,在大型闭源模型上超越原软提示甚至少样本学习。

Comments 8 Pages, 11 tables, 4 Figures

详情
AI中文摘要

软提示调优是一种参数高效的方法,用于使大型语言模型适应特定任务,但缺乏可解释性。基于最近关于解释软提示的工作(Ramati et al., 2024),我们探索了如何训练一个专用的软提示到自然语言翻译模型,以获得更高的翻译质量。特别是在多个数据集(DoDs)的定量和定性比较中,我们证明了我们的翻译器能够生成流畅、准确的表述,优于现有的无训练方法如InSPEcT。除了提高可解释性外,我们的工作还暗示了一个有前景的下游应用:在小规模开源模型上优化的软提示可以转化为可移植的文本提示,当部署在更大的闭源API模型上时,其性能超过了原始软提示,在某些情况下甚至超过了少样本学习。

英文摘要

Soft prompt tuning is a parameter-efficient method for adapting LLMs to specific tasks, but suffers from a lack of interpretability. Building on recent work on interpreting soft prompts (Ramati et al., 2024), we explore how training a dedicated soft prompt to natural language translation model can yield higher translation quality. In particular, in both quantitative and qualitative comparisons on multiple Datasets of Datasets (DoDs), we demonstrate that our translator produces fluent, accurate verbalizations that outperforms existing training-free methods like InSPEcT. In addition to advancing interpretability, our work suggests a promising downstream application: soft prompts optimized on small, open-source models can be translated into portable text prompts that, when deployed on larger closed-API models, exceed the performance of the original soft prompt and, in some cases, even few-shot learning.

2605.27636 2026-05-28 cs.CL 版本更新

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Simorgh at SemEval-2026 task 7: 面向低资源文化推理的多语言问答中的区域感知混合检索

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

发表机构 * University of Tabriz(塔布里兹大学)

AI总结 提出区域感知混合检索方法,结合BM25和稠密语义相似度与区域加权启发式,以提升多语言文化问答的跨语言稳定性。

Comments 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在通用领域的推理任务中表现出色,但在数字和文本数据有限的语种中,面对文化相关知识时可能遇到挑战。本文利用BLEnD基准研究文化相关的多项选择问答,该基准包含30种语言的多语料库,涵盖饮食、体育、家庭等社会文化领域。我们提出一种区域感知混合检索方法,结合BM25词汇匹配和稠密语义相似度与区域加权启发式,以提高答案的相关性。检索到的文档用于构建结构化提示,输入Qwen3-14B量化模型,并采用基于logit的确定性答案选择。实验结果表明,与纯参数推理相比,混合检索方法在文化问答中提升了跨语言稳定性。然而,训练数据量不同的语言之间仍存在显著性能差距,这表明检索增强方法并未完全克服训练数据不平衡问题。

英文摘要

Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

2605.27621 2026-05-28 cs.MA cs.CL 版本更新

Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution

重要的智能体:通过基于移除的归因优化多智能体大语言模型

Mingyu Lu, Yushan Huang, Chris Lin, Su-In Lee

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院)

AI总结 提出一个基于合作博弈的归因框架,通过移除协议和模型替换来识别瓶颈智能体,从而优化多智能体系统性能并降低成本。

详情
AI中文摘要

随着多智能体系统(MAS)变得越来越复杂,识别单个智能体的贡献对于系统优化至关重要。然而,现有方法缺乏严格统一的信用分配框架。在这项工作中,我们将智能体归因形式化为一个合作博弈,由联盟分布、移除协议和目标指标参数化。利用该框架,我们表明留一法(LOO)能够像组合方法一样有效地识别瓶颈智能体,但计算成本仅为后者的一小部分。我们还证明了移除协议会引发不同的博弈:智能体消融隔离了结构瓶颈,而内省式LLM法官无法忠实地近似这种行为。此外,为了评估特定智能体骨干的效用,我们引入了通过模型替换进行归因的方法。通过替换低贡献智能体的底层模型,我们在三个基准测试上将任务性能提高了高达17%,同时将成本降低了高达35%。最后,我们将该框架应用于审计一个医疗MAS,揭示了智能体对诊断准确性和伦理行为的贡献通常是解耦的。通过干预适得其反的角色,我们观察到在保持诊断准确性的同时,伦理一致性有所提高。总体而言,这项工作为成本效益高的MAS归因和干预提供了一种原则性方法。

英文摘要

As multi-agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignment. In this work, we formalize agent attribution as a cooperative game, parameterized by the coalition distribution, removal protocol, and target metric. Using this framework, we show that Leave-One-Out (LOO) identifies bottleneck agents as effectively as combinatorial methods, but at a fraction of the computational cost. We also demonstrate that removal protocols induce distinct games: Agent ablation isolates structural bottlenecks, whereas introspective LLM judges fail to faithfully approximate this behavior. Furthermore, to evaluate the utility of specific agent backbones, we introduce attribution via model replacement. By substituting underlying models of low-contribution agents, we improve task performance by up to 17% while reducing cost by up to 35% across three benchmarks. Finally, we apply our framework to audit a medical MAS, revealing that agent contributions to diagnostic accuracy and ethical behavior are often decoupled. By intervening on counterproductive roles, we observe an increase in ethics alignment while maintaining diagnostic accuracy. Overall, this work provides a principled approach for cost-effective MAS attribution and intervention.

2605.27596 2026-05-28 cs.CL 版本更新

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

幻觉能否有用?通过链式系统I/II推理用SLM解决多跳问题

Saptarshi Sengupta, Suhang Wang

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出一种“先回答后推理”的认知启发框架,利用SLM的初始答案(可能包含幻觉)作为假设来检索证据,再通过系统II深度推理,从而在多跳问答任务上超越传统的“先思考后检索”方法。

详情
AI中文摘要

最近,小型语言模型(SLM)引起了越来越多的兴趣,它们速度快、性能好,且硬件需求低于大型语言模型(LLM)。然而,SLM比LLM更容易产生幻觉,影响其解决复杂多步推理问题的能力,因为早期错误会级联到最终响应。为了解决这个问题,现有工作采用先思考后迭代检索的策略来减少幻觉。我们认为先思考策略并非总是必要,因为我们发现:(i)SLM通常对其初始答案有准确的置信度,并且(ii)幻觉实际上可能有助于逼近正确答案。因此,我们将我们的工作定位为这种策略的反转,即先回答后推理。我们提出了一个认知启发的框架,其中模型首先被允许快速回答问题(系统I(零样本)),然后基于从知识源使用初始假设检索到的证据进行更深层次的思考(系统II)。通过结合系统I和系统II风格的推理,我们展示了我们的方法在各种多步问答基准测试中可以优于先前采用传统先思考路径的工作。

英文摘要

Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

2605.27586 2026-05-28 cs.MA cs.CL 版本更新

You Only Align Once: Propagating Cooperative Behaviors in Multi-Agent Systems through Seed Agents

你只需对齐一次:通过种子智能体在多智能体系统中传播合作行为

Nicole Hsing, Asuka Yuxi Zheng, Yi Zhao, Haoqin Tu, Jen-Tse Huang

发表机构 * Arcarae University of California, Santa Cruz(加州大学圣克鲁兹分校) Northwestern University(西北大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出对齐传播现象,通过种子智能体在自然语言交互中传播合作行为,在红黑游戏中将合作率从24.8%提升至62.2%,并零样本迁移至Sugarscape场景。

详情
AI中文摘要

确保分布式开放多智能体系统中的智能体行为仍然具有挑战性,尤其是在群体规模增长且可能存在未对齐智能体的情况下。我们证明,单个对齐的智能体可以通过纯自然语言交互将合作行为传播给未经训练的智能体,我们将这种现象称为对齐传播。我们在红黑游戏中研究这一现象,这是一个基于团队的迭代囚徒困境,队友通过讨论和投票决定团队的集体行动。通过将教师模型的合作推理和说服性对话提炼到Qwen-3-14B中,我们获得了一个种子智能体,当它被放置在四个未经训练的队友中时,合作率从24.8%翻倍至62.2%,优于教师模型和原始的Gemini-3.1-Pro。值得注意的是,仅在红黑游戏上训练的种子智能体零样本迁移到Sugarscape(一个具有成对交易的空间生存模拟)中,实现了91.5%的交易成功率,而基线为21.6%。我们的结果将多智能体对齐从每个智能体逐一训练的难题重新定义为一种可扩展的社会能力,可以通过战略性种子放置来设计。

英文摘要

Ensuring agent behaviors in distributed open multi-agent systems remains challenging, especially as populations grow and unaligned agents may exist. We show that a single aligned agent can propagate cooperative behaviors to untrained agents purely through natural language interaction, a phenomenon we term Alignment Propagation. We study this in the Red-Black Game, a team-based iterated Prisoner's Dilemma in which teammates deliberate and vote to determine their team's collective action. By distilling the cooperative reasoning and persuasive dialogues of a teacher model into a Qwen-3-14B, we obtain a seed agent that, when placed among four untrained teammates, doubles the cooperation rate from 24.8% to 62.2%, outperforming the teacher model and a vanilla Gemini-3.1-Pro. Remarkably, a seed trained exclusively on the RedBlack Game transfers zero-shot to Sugarscape, a spatially grounded survival simulation with pairwise trading, achieving a 91.5% trade success rate versus a 21.6% baseline. Our results reframe multi-agent alignment from an exhaustive per-agent training problem to a scalable social capability that can be engineered through strategic seed placement.

2605.27571 2026-05-28 cs.AI cs.CL cs.DB 版本更新

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

实时分析发现代理:迈向主动洞察系统

Gaetano Rossiello, Dharmashankar Subramanian

发表机构 * IBM

AI总结 提出一种多智能体架构,通过持续发现循环(假设生成、编译、验证、可视化)实现实时数据流的自主洞察发现,支持从查询驱动向主动发现的范式转变。

Comments Accepted at Supporting Our AI Overlords (SAO) at the ACM Conference on AI and Agentic Systems (CAIS), May 26 2026, San Jose, CS, USA

详情
AI中文摘要

现代分析系统本质上是反应式的,要求用户在日益复杂且持续演变的数据上定义查询。在实时流式环境中,这种范式失效,因为潜在洞察的空间变得太大而无法手动枚举。我们提出了一种用于实时数据流自主洞察发现的多智能体架构。该系统实现了一个持续发现循环,其中智能体生成假设,将其编译为可执行分析,验证生成的工件,并生成可视化和可部署的应用程序。该架构利用Apache Kafka进行事件驱动协调,Apache Flink进行流处理,以及大型语言模型来实现专门的智能体。一个关键贡献是基于类型化中间工件的契约驱动设计,实现了模块化、可观测性、血统以及更安全地执行动态生成的分析。通过零售、金融和公共数据中的用例,我们展示了该架构如何支持从查询驱动分析向主动发现驱动系统的转变。

英文摘要

Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

2605.27567 2026-05-28 cs.AI cs.CL 版本更新

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

为什么LLM在因果发现中失败以及干预代理如何逃脱

Amartya Roy, Sonali Parbhoo

发表机构 * SIRE, IIT Delhi(IIT德里智能研究机构) Robert Bosch GmbH(罗伯特·博世有限公司) Imperial College London(伦敦帝国理工学院)

AI总结 本文证明大型语言模型在因果发现中存在根本性失败,并提出一种基于干预代理的因果贝叶斯优化方法(A-CBO),通过外部贝叶斯循环在无需模型微调的情况下实现可证明的收敛。

Comments 9 pages, 3 figures

详情
AI中文摘要

因果发现是科学推理的基石,但大型语言模型能否可靠地执行因果发现仍是一个悬而未决的问题。最近的基准测试表明,即使是微调后的模型在简单因果图上也会达到平台期,并随着复杂度增加而退化,但失败的原因尚未明确。我们证明这种失败是根本性的:监督微调、直接偏好优化和上下文学习都会产生无法区分生成相似观测数据的因果图的预测器,任何这样做的尝试都需要模型的内部表示无限增长,从而违反了这些方法工作的条件。我们将其形式化为核障碍定理,确立该限制是学习范式固有的,而非任何特定模型或数据集。我们提出了代理因果贝叶斯优化(A-CBO),其中冻结的语言模型作为干预预言机,回答关于干预效果的目标查询,而外部贝叶斯循环在对数轮次内将信念集中在候选因果图上。由于决策在障碍适用的空间之外运行,A-CBO在底层模型保持不变的情况下可证明收敛。在Corr2Cause上,A-CBO无需任何训练即可匹配微调基线。在Extended Corr2Cause(一个扩展到24个变量、包含18K测试样本的新基准)上,A-CBO显著优于微调和偏好优化,且优势不断扩大。

英文摘要

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

2605.27564 2026-05-28 cs.CL cs.AI cs.LG 版本更新

The Future of Facts: Tracing the Factual Generation-Verification Gap

事实的未来:追踪事实生成-验证差距

Tim R. Davidson, Anja Surina, Caglar Gulcehre

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 本文通过训练阶段分析,发现语言模型在事实知识上存在生成-验证差距,验证能力先于生成能力习得且更稳健,事实更新可能导致模型处于“多宇宙”状态。

Comments Code for this project is available at https://github.com/anjasurina/factgap , blog post at https://www.trdavidson.com/fact-gap

详情
AI中文摘要

语言模型正成为事实知识的默认接口,但它们验证输出的能力往往比生成输出的能力更可靠。这种生成-验证差距(GV-gap)是近期自我改进和推理中许多进展的基础,但其在事实知识上的动态仍未被充分理解。我们聚焦于事实性GV-gap背后的训练机制,将其与计算和美学方面的对应物区分开来。我们通过四个开源模型家族(每个家族两个规模)的三个训练阶段(获取、持续学习和更新)追踪生成和验证能力。三个发现跨模型重复出现:(i)验证始终先于生成被学习;(ii)验证比生成对持续学习更稳健;(iii)事实更新可能使模型处于“多宇宙”状态,同时验证新旧答案均为正确。对前沿模型的自然实验在大规模上重现了这些动态,并揭示了在充分覆盖的事实上残留的验证偏差。

英文摘要

Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi-verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

2605.27546 2026-05-28 cs.CL cs.HC 版本更新

Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies

超越静态分类法的青少年危机对话的关键词生成表示

Abeer Badawi, Will Aitken, Lydia Sequeira, Jocelyn Rankin, Maia Norman, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) Electrical and Computer Engineering, Queen’s University(皇后大学电气与计算机工程系) Kids Help Phone(儿童援助电话)

AI总结 本文提出关键词生成表示(KGR)方法,通过约束大语言模型生成对话特定的关键词,将原有19标签分类扩展为39标签层次结构,在129段对话和387个专家注释上评估,准确率达0.96,并发现固定分类中缺失的身份相关主题,将主题检索准确率从0.25提升至0.70。

详情
AI中文摘要

危机响应者每年快速评估数千条青少年短信对话,以识别心理健康问题并指导支持。然而,青少年的痛苦越来越多地通过不断演变且依赖具体语境的语言表达,这些语言通常不适合固定标签的分类法。本研究分析了703,975条去标识化的Kids Help Phone对话(2018-2023年),并将KHP的19标签问题分类扩展为39标签层次结构。然后,我们引入关键词生成表示(KGR),一种受约束的大语言模型,生成简洁、对话特定的关键词,在129段对话和387个专家注释上进行了评估。扩展后的分类法达到了专家共识可靠性,准确率为0.96,专家评审发现81%的关键词准确反映了内容,74%提高了清晰度。KGR揭示了固定分类法中缺失的与身份相关的主题,包括移民问题和照顾者负担,并支持了一个主题检索工作流,与手动分析师流程相比,准确率从0.25提高到0.70(+0.45)。KGR标志着向混合、可解释的生成表示转变,将危机响应扩展到静态分类法之外,以揭示新兴的、植根于文化的青少年痛苦模式。

英文摘要

Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

2605.27545 2026-05-28 cs.CL 版本更新

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

PAST2HARM:一种用于越狱多模态AI的简单自适应过去时攻击

Snehasis Mukhopadhyay

发表机构 * Indian Institute of Information Technology, Kalyani(印度信息技术学院,卡利安)

AI总结 提出PAST2HARM框架,通过过去时态改写和迭代升级策略,系统性地利用多模态文本到图像模型的安全漏洞,实现黑盒、无梯度的高成功率越狱攻击。

详情
AI中文摘要

尽管不安全的图像生成可能比不安全的文本产生更严重的后果,且当前防御相对不成熟,但对多模态AI系统的越狱攻击仍未得到充分探索。我们引入了PAST2HARM,一个简单而有效的自适应越狱框架,能够绕过最先进的多模态文本到图像模型中的拒绝训练。基于先前发现过去时态改写可以规避安全防护的结论,PAST2HARM系统地利用了多模态生成式AI中的这一漏洞。 我们沿两个维度刻画攻击。第一,广度:通过时间深化,该框架逐步增强历史锚定和档案线索,侵蚀不同对齐强度模型的拒绝边界。第二,深度:通过初始顺从后的迭代升级,我们探测有害生成的上限,使用由语言模型作为评判者评估的标量严重性越狱指标来衡量严重程度。我们发现对话中间轮次形成峰值脆弱窗口,其中有害性增加后趋于平稳,最终经历语义反转。 我们在三个模型Gemini Nano Banana Pro、GPT Image 2和SD XL上评估PAST2HARM,在黑盒、无梯度设置下分别实现了83%、67%和100%的攻击成功率。对抗性提示也在模型间迁移,跨模型成功率超过50%。该攻击引发了多种有害输出,包括露骨色情内容、政治虚假信息、历史否认叙事、仇恨言论和自我伤害美化。我们进一步发布了一个精心策划的提示、改写和输出基准,作为红队测试和对齐的资源。我们的结果暴露了当前安全防护的根本脆弱性,并强调了加强多模态安全训练的必要性。

英文摘要

Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training.

2605.27531 2026-05-28 cs.PL cs.CL cs.SE 版本更新

Agentic Separation Logic Specification Synthesis

智能体分离逻辑规范合成

Tarun Suresh, David Korczynski, Julien Vanegue

发表机构 * Bloomberg(贝莱德)

AI总结 提出 Spec-Agent 智能体系统,通过静态分析、运行时堆追踪和反例引导迭代,为大型 C++ 代码库合成分离逻辑规范,在百万行级代码上达到 85% 有效规范合成率且无假阳性。

Comments 9 pages, 3 appendices

详情
AI中文摘要

规范合成,即从程序实现和自然语言自动推断形式规范的任务,对于重构、转译、优化和验证非常重要,但对于大型 C++ 代码库仍然是一个开放的挑战。现有的基于 LLM 的方法无法同时扩展到这样的代码库,生成足够表达系统代码特性(如动态内存和堆分配数据结构)的规范,并系统地验证这些规范以排除不正确的候选。我们提出了 Spec-Agent,一个用于在大型 C++ 代码库中合成表达性强、经过充分验证的规范的智能体系统。Spec-Agent 针对一个规范语言阶梯:命题逻辑、一阶逻辑、命题分离逻辑和一阶分离逻辑。对于每个函数,Spec-Agent 使用静态分析和运行时堆追踪来选择适当的目标规范语言,将现有的功能测试泛化为模糊测试工具,并通过反例引导反馈迭代地优化 LLM 生成的候选。我们在包含数百万行代码的开源 C++ 代码库上评估了 Spec-Agent。Spec-Agent 为 85% 的目标函数合成了有效的规范,在模糊测试和专家验证下未观察到假阳性,性能优于 Claude Code Opus 4.6,同时 token 成本降低 10 倍。

英文摘要

Specification synthesis, the task of automatically inferring formal specifications from program implementations and natural language, is important for refactoring, transpilation, optimization, and verification, yet remains an open challenge for large C++ repositories. Existing LLM-based approaches fail to simultaneously scale to such repositories, produce specifications expressive enough to capture systems-code features such as dynamic memory and heap-allocated data structures, and systematically validate those specifications to rule out incorrect candidates. We present Spec-Agent, an agentic system for synthesizing expressive, well-validated specifications across large C++ codebases. Spec-Agent targets a ladder of specification languages: propositional logic, first-order logic, propositional separation logic, and first-order separation logic. For each function, Spec-Agent uses static analysis and runtime heap tracing to select the appropriate target specification language, generalizes existing functional tests into fuzz harnesses, and iteratively refines LLM-generated candidates via counterexample-guided feedback. We evaluate Spec-Agent on open source C++ codebases comprising millions of lines of code. Spec-Agent synthesizes valid specifications for 85% of target functions, with no false positives observed under fuzzing and expert validation, outperforming Claude Code Opus 4.6 at 10x lower token cost.

2605.27494 2026-05-28 cs.CR cs.AI cs.CL cs.IR cs.LG 版本更新

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

基于证据的缓存路由用于检索增强生成:何时可以安全地重用答案?

Syed Huma Shah

AI总结 提出GroundedCache,一种通过四个廉价门控(查询相似性、检索证据重叠、源版本有效性和词汇支持)验证缓存答案安全性的路由方法,显著降低不安全服务率。

Comments 19 pages, 9 figures, 10 tables. Code: https://github.com/syedhumarahim/grounded-cache-router

详情
AI中文摘要

现代检索增强生成(RAG)部署越来越依赖缓存来降低令牌成本和首令牌时间(TTFT)。在vLLM等服务栈中,前缀级KV重用已成为标准,而最近的系统(RAGCache、TurboRAG、CacheBlend、EPIC、ContextPilot、PCR、LMCache)进一步推动了块级和位置无关的重用。相比之下,输出级语义答案缓存仍然脆弱:相似的提示可能映射到不同的正确答案,检索到的证据随着语料库更新而漂移,并且对抗性碰撞攻击已被证明可以劫持缓存的响应。我们认为,缓存答案重用的正确框架不是如何更快地重用,而是何时重用是安全的。我们提出了GroundedCache,一种经过证据验证的缓存路由器,仅当四个廉价门控同时成立时才允许缓存答案:查询相似性、检索证据重叠、源版本有效性以及新检索证据对缓存答案的词汇(或基于判断的)支持。我们构建了一个六区域工作负载,用于压力测试缓存安全性而不仅仅是命中率,并引入了一个面向操作员的指标——不安全服务率(USR),即收到错误缓存答案的查询比例。在两个数据集和12,000个真实LLM生成(在vLLM上使用自动前缀缓存的Qwen2.5-7B-Instruct)中,GroundedCache在每个HotpotQA区域上将USR降至0.0%(而朴素缓存为15-35%),在mtRAG文档漂移上降至1.5%(而朴素缓存为51.5%),在设计点对抗区域上减少了34倍,在其他mtRAG区域上减少了3-10倍,同时端到端p50延迟保持在无缓存RAG基线的1.04-1.07倍以内。逐门控消融实验表明,词汇支持门控是两个数据集上的主要安全机制,其余门控以近乎零成本提供纵深防御。我们发布了实现、工作负载和评估工具。

英文摘要

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

2605.27483 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Debate Helps Weak Judges Reward Stronger Models

辩论有助于弱裁判奖励更强的模型

Ethan Elasky, Frank Nakasako, Naman Goyal

发表机构 * Palaestra Research(帕莱斯特拉研究) Berkeley(伯克利)

AI总结 研究在强辩手/弱裁判设置下的提议者-批评者辩论,发现当批评者分类能力超过裁判且裁判将批评者言论视为待验证的主张时,辩论能显著提升裁判表现,并可通过单一独立批评以更低成本实现类似效果。

详情
AI中文摘要

尽管理论上具有前景,但辩论作为一种可扩展的监督协议产生了混合的实证结果:在某些设置中有收益,在其他设置中无效,尤其是当裁判没有隐藏信息时。我们在程序可验证的代码和逻辑任务上,研究了强辩手/弱裁判设置下的提议者-批评者辩论。当批评者提供可用的优势时,辩论帮助裁判优于咨询基线:批评者的分类能力必须超过裁判,并且裁判必须将批评者的言论视为待验证的主张而非待总结的证词。在五个配对中的三个满足该条件的配对中,提议者-批评者辩论的收益在统计上显著优于咨询,并且这些配对是最有能力的模型配对。在我们的集合中的两个非响应者配对中,辩论产生无效效果,一旦批评者进入转录,裁判验证率下降数十个百分点。在这些情况下,批评者的二元分类能力与裁判的相差在噪声范围内,并且批评者的分歧被解析为证词而非待检查的主张。从辩论中消去反驳轮次对裁判表现没有可测量的变化:单一独立批评以更低的推理成本恢复了辩论的大部分收益。这些发现为可验证领域(答案、批评、裁判)中无需训练的可扩展监督提供了一种更廉价的原始方法,以及一种预测辩论何时有帮助的部署前审计(批评者是否击败裁判,以及裁判是否会验证它?)。

英文摘要

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

2605.27402 2026-05-28 cs.CY cs.AI cs.CL 版本更新

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

REC-CBM:面向可信开放评分的基于规则感知的错误修正概念瓶颈模型

Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting, Ying-Chih Chen, Huan Liu

发表机构 * School of Computing and Augmented Intelligence, Arizona State University, USA(计算与增强智能学院,亚利桑那州立大学,美国) Mary Lou Fulton Teachers College, Arizona State University, USA(玛丽·卢·福洛顿教师学院,亚利桑那州立大学,美国) Department of Computer Science, National Yang Ming Chiao Tung University, TW(国立阳明交通大学计算机科学系,台湾)

AI总结 提出REC-CBM模型,通过规则感知概念编码器、序数成对校准目标和潜在概念错误修正模块,解决开放评分中标准概念瓶颈模型无法建模细粒度规则维度、忽略评分序数语义和概念标注不可靠的问题,在提升评分性能的同时保持可解释性。

详情
AI中文摘要

开放评分对于公平和个性化教育至关重要,但人工评分耗时且成本高,凸显了自动化评分系统的必要性。尽管基于神经和大语言模型(LLM)的系统表现出优越性能,但它们通常是黑箱模型,其评分过程和理由难以让教育者验证和信任。概念瓶颈模型(CBM)通过将预测路由到人类可解释的概念,提供透明度的机制保证,成为一种有前景的方法。然而,标准CBM不适用于开放评分:它们没有显式建模细粒度的规则维度,未能充分捕捉评分量表的序数语义,并忽略了人类概念标注中固有的可靠性问题。为解决这些局限,我们提出REC-CBM,一种面向可信开放评分的规则感知错误修正概念瓶颈模型。REC-CBM引入了规则感知概念编码器,学习针对回答的概念特定表示,以及一个序数成对校准目标,保留规则维度间的排序结构。它还结合了一个潜在概念错误修正模块,在最终评分预测前对概念预测进行去噪,同时保持可解释性。在公开数据集上的全面实验表明,REC-CBM在评分性能上持续提升,并产生比最先进基线更忠实的概念级推理。进一步分析验证了每个组件的贡献,并展示了在真实教育环境中的适用性。总体而言,这项工作提供了一种实用、可解释的评分解决方案,使教育者能够检查、干预和信任自动化决策,推动更透明和可信的教育。

英文摘要

Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

2605.27393 2026-05-28 cs.CL cs.AI 版本更新

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI: 可控的多智能体治疗性对话生成

Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam(弗里堡大学阿姆斯特丹分校) NII, Tokyo Institute of Technology(东京技术大学信息机构)

AI总结 提出StoryMI框架,通过多LLM智能体协作、情境故事基础和动态策略控制,生成符合动机性访谈标准的治疗性对话,并构建评估协议和数据集验证其有效性。

Comments ACL2026

详情
AI中文摘要

大型语言模型(LLM)可以生成流畅的对话,但先前的工作缺乏情境基础、动态策略控制以及与动机性访谈(MI)临床标准对齐的评估。我们引入了StoryMI,一个用于可控MI对话生成的多LLM智能体框架,其中基于问卷的客户档案被扩展为情境故事,为对话提供叙事背景。治疗师和客户智能体生成由交互智能体选择的MI代码引导的MI编码话语,而交互智能体动态协调交换以在多次轮对话中控制MI策略。我们提出了一个两级评估协议:词汇指标和宏观层面咨询策略的MI特定度量,以及LLM作为评判者和人类专家评估。我们构建了一个包含6K模拟MI对话的数据集,基于1K问卷-故事对,涵盖12个MI代码和13个症状领域,并对六个开源和闭源LLM进行了基准测试。我们的结果表明,情境基础和宏观层面控制可以提高MI依从性和临床合理性,展示了结构化多智能体工作流在心理治疗对话生成中的有效性。我们提供代码和数据以促进可重复性。

英文摘要

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

2605.27389 2026-05-28 cs.IR cs.AI cs.CL 版本更新

Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

基于记忆 vs. 仅上下文条件化在有状态个性化中产生不同的行为模式

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 通过比较上下文条件化和基于记忆的条件化在教师面向教育推荐系统中的行为,发现上下文推荐对当前问题响应更强,而基于记忆的推荐表现出历史依赖行为,包括相同输入下的学习者特异性分化。

Comments Accepted to ITS 2026

详情
AI中文摘要

我们研究了条件化上下文如何塑造教师面向教育推荐系统中的个性化行为。我们比较了基于当前学生问题的上下文条件化与使用持久学习者信息的基于记忆的条件化。通过偏差相关性和配对统计检验,我们发现上下文推荐表现出更强的问题级响应性,而基于记忆的推荐表现出历史依赖行为,包括在相同输入下的学习者特异性分化。教师面向的评估信号表明这些推荐是可解释和可操作的。这些结果表明,基于嵌入的相似性度量能够捕捉对当前问题的响应性,但不能表征基于学习者历史的个性化,从而激励了研究条件化效应的行为级诊断。

英文摘要

We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual conditioning based on the current student question with memory-based conditioning using persistent learner information. Using deviation correlation and paired statistical tests, we find that contextual recommendations exhibit stronger question-level responsiveness, while memory-based recommendations exhibit history-dependent behaviors, including learner-specific differentiation under identical input. Teacher-facing evaluation signals suggest these recommendations are interpretable and actionable. These results indicate that embedding-based similarity metrics capture responsiveness to the current question but do not characterize personalization grounded in learner history, motivating behavior-level diagnostics for studying conditioning effects.

2605.27388 2026-05-28 cs.CL cs.AI cs.SI 版本更新

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

通过反应语气建模社区态度:评估LLM与在线社区语言行为对齐的人机协作框架

Nuan Wen, Xuezhe Ma

发表机构 * Information Sciences Institute University of Southern California(南加州大学信息科学研究所)

AI总结 提出CARE框架,通过细粒度言语气势分析,评估LLM模拟社区对真实新闻的反应,揭示其存在“现实主义差距”,表明当前对齐策略不足以捕捉在线群体的社会语言动态。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作计算社会分析的代理;然而,它们忠实再现人类社区“厚描述”(Geertz, 1973)的能力仍然是一个关键挑战。当前的评估通常将社会身份简化为静态标签,忽视了现实群体如何应对社会变迁。为弥合这一差距,我们引入了CARE(社区感知反应评估),一个以反应为中心的框架,将LLM模拟的话语与不同社区对真实新闻的真实、事件相关的反应进行基准测试。通过刻画细粒度的言语气势谱及其所体现的潜在态度——通过人机协作验证——我们的诊断揭示了一个持续的“现实主义差距”:使用明确的社区提示引导LLM并不能固有地提高模拟保真度。进一步分析识别了前沿模型之间的不同行为特征,表明当前的对齐策略仍不足以捕捉在线群体的社会语言动态。

英文摘要

Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

2605.27384 2026-05-28 cs.HC cs.AI cs.CL 版本更新

From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

从指导者到协作者:一项90名参与者研究揭示移动严肃游戏中的人机协作

Danai Korre

发表机构 * University of Bedfordshire(伯明翰大学)

AI总结 通过90名被试的对比实验,研究高拟人化语音交互体与低拟人化文本代理在移动严肃游戏中的用户偏好,发现高拟人化代理显著更受青睐,并探讨角色、混合主动对话及故障修复对目标导向任务中人机协作的影响。

Comments 4 pages, 5 figures, ACM CHI 2026 workshop paper

详情
AI中文摘要

这篇立场论文反映了我在博士期间从一项大规模被试内研究(N=90)中收集的实证数据。该研究在一个关于英国十进制前货币的Unity开发移动游戏中,比较了高度拟人化的语音具身对话代理(ECA)与低拟人化的文本基础代理(无具身,仅文本气泡)。游戏包含两个不同角色的代理——指导者(Alex)和店主/协作者。用户通过语音和鼠标输入进行交互。我收集的定量数据包括可用性问卷(CCIR MINERVA)和代理人格工具。数据使用配对t检验、重复测量方差分析和多元线性回归进行分析,以识别代理人格与可用性之间的相关性。结果显示,高度拟人化代理版本在统计上显著更受偏好,效应量大。这一结果与观察和退出访谈的定性发现一起进一步讨论。结果从人机协作的角度进行阐述,特别是角色、混合主动对话以及故障/修复在目标导向任务中如何显现。最后,我提出了关于时机、用户期望和角色特定交互的问题。本投稿不提出新框架;而是报告实证发现和问题,我希望与社区进行研讨。

英文摘要

This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.

2605.27383 2026-05-28 cs.CL cs.AI 版本更新

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

弥合稳定性与表现力之间的差距:低资源口语语言模型的合成数据扩展与偏好对齐

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, Xiaoyu Shen

发表机构 * Beijing University of Posts(北京邮电大学) University of California, USA(美国加州大学) Northwestern University, USA(美国西北大学) Eastern Institute of Technology, Ningbo, China(宁波工程技术学院)

AI总结 针对低资源口语语言模型因合成数据导致的表现力崩溃问题,提出两种自对齐框架(DGSA和TDSC)以恢复韵律多样性,实现超越商业系统的性能并首次支持老挝语零样本语音克隆。

详情
AI中文摘要

口语语言模型(SLM)通过绕过显式的字素到音素流水线,已成为语音合成的一种有前景的范式。然而,它们在低资源语言中的有效性仍然受到转录语音稀缺的根本限制。在实践中,合成数据已成为在此类场景下扩展SLM的主要策略,当真实数据不足时提供可靠的音素监督。在这项工作中,我们表明这种依赖引入了一个基本权衡,我们称之为稳定性-表现力差距:虽然合成数据提高了音素准确性,但它逐渐抑制了韵律变异性,最终导致表现力崩溃(合成侵蚀)。为了弥合这一差距,我们提出了两种自对齐框架。解耦引导的自对齐(DGSA)通过利用韵律-音色分离来恢复复杂语言的表现力。对于真实参考极其有限的场景,温度驱动的自我批评(TDSC)通过自动探索和过滤来稳定生成。我们的方法优于强大的商业系统,包括ElevenLabs和Gemini Pro,并首次实现了老挝语的零样本语音克隆能力。

英文摘要

Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

2605.27380 2026-05-28 cs.CL cs.AI 版本更新

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX: 基于别名的检索与LLM排序的跨语言生物医学实体链接

Yi Wang, Corina Dima, Liangyu Zhong, Steffen Staab

发表机构 * University of Stuttgart, Germany(斯图加特大学) Technical University of Berlin, Germany(柏林技术大学)

AI总结 提出BioELX两阶段框架,通过维基数据多语言别名增强SapBERT检索器,并利用预训练LLM排序器进行上下文感知消歧,无需标注数据即在多个基准上取得最佳性能。

Comments 12 pages, 3 figures

详情
AI中文摘要

跨语言生物医学实体链接(BEL)将任何语言的提及映射到生物医学知识库(KB)中的唯一标识符,支持临床和生物医学NLP应用。然而,BEL的专家标注训练数据成本高昂,尤其是对于低资源语言。此外,许多跨语言BEL系统依赖于基于SapBERT的检索器,这些检索器主要在KB中的英语别名上训练,导致对未见过的非英语提及泛化能力差,且上下文感知消歧有限。我们提出BioELX,一个两阶段跨语言BEL框架,无需任务特定的标注训练语料。在第一阶段,我们用维基数据派生的多语言别名丰富SapBERT训练,并使用得到的检索器改进跨语言候选检索。在第二阶段,我们使用预训练LLM排序器进行上下文感知消歧,该排序器联合考虑提及上下文和候选,消除了监督训练的需要。在五个基准(XL-BEL、EMEA、Patent、WikiMed-DE和MedMentions)上的实验表明,BioELX实现了新的最先进性能。它在XL-BEL上将平均Recall@1提高了+19.2,尤其是低资源语言提升显著,例如土耳其语+21.6、韩语+22.1、泰语+30.8,并在EMEA(+6.2)、Patent(+5.4)和WikiMed-DE(+12.8)上持续改进。代码和资源将在发表后发布。

英文摘要

Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

2605.27378 2026-05-28 cs.CL cs.CV cs.MA 版本更新

OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

OralAgent: 融合推理、工具与知识的交互式牙科影像分析

Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

发表机构 * Faculty of Dentistry, the University of Hongkong, Hong Kong SAR, China(香港大学牙科学院,中国香港特别行政区) Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA(匹兹堡大学电气与计算机工程系,美国宾夕法尼亚州匹兹堡) Shenzhen University, China(深圳大学,中国) Department of Craniomaxillofacial Surgery, Shanghai Ninth People’s Hospital, China(上海第九人民医院口腔颌面外科部,中国) Nanyang technological University, Singapore(南洋理工大学,新加坡) School of Biomedical Engineering, Southern Medical University, China(南方医科大学生物医学工程学院,中国) Singapore University of Technology and Design, Singapore(新加坡科技设计大学,新加坡) University of Auckland, new zealand(奥克兰大学,新西兰) Shanghai Artificial Intelligence Laboratory , China(上海人工智能实验室,中国)

AI总结 提出首个牙科专用AI智能体OralAgent,通过集成22种视觉分析工具和368本经典牙科教科书,实现多模态推理、工具决策与知识检索的自动化框架,在多个基准上达到最优性能。

Comments 14 pages, 7 figures, 6 tables

详情
AI中文摘要

牙科影像分析在支持口腔医疗的准确诊断和治疗规划中起着关键作用。尽管近期进展产生了针对特定任务和单一成像模态的牙科AI模型,但其孤立的设计限制了在实际临床工作流程中的实用性。在本文中,我们提出了OralAgent,这是首个牙科专用AI智能体,它在端到端自动化框架内统一了多模态推理、基于工具的决策和基于知识的检索。它集成了22种视觉分析工具和368本广泛使用的经典牙科教科书,实现了自主推理、规划、工具使用、知识检索和多步骤工作流执行。此外,我们引入了OralCorpus,这是一个大规模、高质量的双语文本资源,包含1.348亿个标记,专为牙科检索增强生成(RAG)而构建。为了评估模型的多学科牙科知识,我们构建了OralQA-ZH,这是一个中文选择题基准,包含来自11个口腔亚专业的798个项目。大量实验表明,OralAgent在MMOral-Uni、MMOral-OPG和OralQA-ZH基准上达到了最先进的性能,突显了其在真实临床环境中的有效性、可解释性和适应性。代码和模型已在https://github.com/isjinghao/OralAgent公开。

英文摘要

Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

2605.27376 2026-05-28 cs.CL cs.AI 版本更新

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

解锁基于提示的文本转语音模型中的细粒度和句内说话风格控制

Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University, Korea(全州大学人工智能系) Department of Computer Science and Engineering, Sungkyunkwan University, Korea(全州大学计算机科学与工程系)

AI总结 针对基于提示的TTS模型缺乏细粒度控制和句内风格变化的问题,提出句间风格插值和句内风格过渡技术,通过嵌入空间方向向量插值和KV缓存交换及滑动窗口注意力掩码实现平滑风格控制。

详情
AI中文摘要

虽然基于提示的文本转语音(TTS)模型支持自然语言驱动的说话风格控制,但它们通常提供有限的细粒度控制,并在整个话语中应用单一的全局风格。这限制了需要跨话语连续风格属性插值和单个话语内时变风格过渡的实际用例。在本文中,我们提出了在现有基于提示的TTS模型中实现这两种能力的新技术。对于句间风格插值,我们计算嵌入空间中对比风格提示之间的方向向量并进行简单插值,从而实现风格特征之间的平滑过渡。对于句内风格过渡,我们首先识别出自回归TTS解码器中对早期标记的强烈注意力偏差,导致初始音频实现主导后续生成。为了减轻这种影响,我们引入了KV缓存交换和滑动窗口注意力掩码。实验表明,我们提出的句间插值在性别转换中实现了99-100%的成功率,高达36 Hz的音高变化,以及高达1.6音节/秒的速度变化。我们的句内过渡保持了0.81-0.91的说话人相似度,并获得了3.48-4.48的感知平滑度分数。

英文摘要

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

2605.27375 2026-05-28 cs.CL 版本更新

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

LCO:基于LLM的约束优化,用于现实任务中更安全的智能体LLM

Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, Hang Su

发表机构 * East China Normal University(东华大学) Beijing Zhongguancun Academy(北京中关村学院) Tsinghua University(清华大学)

AI总结 提出LCO框架,通过自思考模块和进化采样模块约束LLM行为,在不微调模型的情况下减少上下文奖励黑客行为,实验表明在输出优化和策略优化场景中显著提升安全性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地充当自主智能体,但它们与环境的持续交互可能导致上下文奖励黑客行为(ICRH),即LLM迭代优化其行为以最大化代理目标,无意中产生有害副作用。现有防御方法不足以应对此风险,因为ICRH并非源于对抗性输入,而是模型自身的过度优化。为缓解此问题,我们提出基于LLM的约束优化(LCO),该框架无需模型微调即可有效减少ICRH。LCO包含两个模块:自思考模块,引导LLM在执行前主动思考并整合潜在安全约束;进化采样模块,利用基于LLM的交叉和变异将模型动作约束在安全解空间内,同时保持任务性能。实验结果表明,LCO在输出优化和策略优化场景中均显著缓解了ICRH。特别是在推文参与度优化任务中,LCO在GPT-4上使毒性增长率(TGR)降低了39%;在策略优化基准上,ICRH发生率降低了15.23%,在不牺牲任务性能的情况下提升了安全性。

英文摘要

Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization. To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

2605.27374 2026-05-28 cs.CL 版本更新

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

ICG: 通过基于MLLM的提示和个性化偏好对齐改进封面图像生成

Zhipeng Bian, Jieming Zhu, Qijiong Liu, Wang Lin, Guohao Cai, Zhaocheng Du, Jiacheng Sun, Zhou Zhao, Zhenhua Dong

发表机构 * Huazhong University of Science and Technology(华中科技大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Hong Kong Polytechnic University(香港理工大学) Zhejiang University(浙江大学)

AI总结 提出ICG框架,利用多模态大语言模型和扩散模型,通过元标记提取语义特征、用户嵌入个性化对齐及多奖励学习策略,实现高质量、个性化封面图像生成。

Comments Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12268-12278, EMNLP 2025. Official version: https://doi.org/10.18653/v1/2025.emnlp-main.617

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (Main Track) EMNLP 2025 12268-12278
AI中文摘要

多模态大语言模型和扩散模型的最新进展为AI生成内容开辟了新的可能性。然而,个性化封面图像生成仍未被充分探索,尽管它在提升数字平台用户参与度方面起着关键作用。我们提出ICG,一个新颖的框架,将基于MLLM的提示与个性化偏好对齐相结合,以生成高质量、上下文相关的封面。ICG通过元标记从项目标题和参考图像中提取语义特征,使用用户嵌入进行细化,并将得到的个性化上下文注入扩散模型。为了解决缺乏标注监督的问题,我们采用了一种多奖励学习策略,该策略结合了公共美学和相关性奖励以及从用户行为训练的个性化偏好模型。与依赖手工提示和不连贯模块的先前流程不同,ICG采用适配器桥接MLLM和扩散模型进行端到端训练。实验表明,ICG显著提高了图像质量、语义保真度和个性化,从而在下游任务中增强了用户吸引力和离线推荐准确性。作为桥接MLLM和扩散模型的即插即用适配器,ICG兼容常见检查点,且在优化过程中不需要真实标签。

英文摘要

Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

2605.27373 2026-05-28 cs.AI cs.CL cs.CY 版本更新

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

识别和理解文本中的人类价值观:一种可定制的基于LLM的架构

Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) CETINIA, Universidad Rey Juan Carlos(CETINIA,雷伊·胡安·卡洛斯大学)

AI总结 提出一种基于大型语言模型的可定制架构,通过三个模块(规范生成、文本标注、强度评估)检测文本中人类价值观的强度,避免依赖特定价值理论或复杂提示工程,实验表明具有良好检测性能。

Comments 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

详情
Journal ref
Proc. ICAART 2026, Vol. 5, SciTePress, 2026, pp. 4096-4103
AI中文摘要

随着智能系统变得更加自主,科学界专注于创建包含伦理和道德考量的决策机制,这与传统的效用最大化模型不同。为此,一个关键方面是评估这些决策与人类价值观的契合程度。基于此,一个有前景的研究方向是开发基于大型语言模型(LLM)的方法,从文本中识别显性或隐性的人类价值观,从而实现全程识别。本文介绍了一种基于LLM的架构,用于检测和量化文本中人类价值观的强度,避免了以往方法受限于特定价值理论或复杂提示工程的缺陷。该架构包含三个协调模块:一个从任何理论框架的基础文本中生成结构化价值规范;一个使用这些规范对文本进行标注;另一个基于修辞和语义证据分配分级支持或抵抗。这种模块化方法将概念化任务与检测人类价值观的任务分离,创建了一个可扩展且可重复的过程,由适应多种理论的价值规范驱动。该架构使用多个LLM实例化,并使用ValueEval数据集进行评估。实验表明具有良好的检测性能,证实了管道的通用性。

英文摘要

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

2605.26959 2026-05-28 cs.LO cs.CL 版本更新

MerLean-Prover: A Recursive Looping Harness for Lean 4 Theorem Proving

MerLean-Prover:用于 Lean 4 定理证明的递归循环框架

Jinzheng Li, Zeru Zhu, Yuanjie Ren

发表机构 * Northeastern University(东北大学) Stony Brook University(石溪大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一种基于递归循环框架的端到端 Lean4 定理证明器 MerLean-Prover,通过规划、检查与证明三种智能体协作,无需微调或定制强化学习,在 FormalQualBench 和 Putnam2025 上超越现有开源基线。

详情
AI中文摘要

MerLean-Prover 是一个端到端的 Lean4 定理证明器,它用内核可检查的证明替换了 sorry 声明。它由三种智能体类型(规划、检查和证明)构建,通过一个递归外层循环组合,其修订单位是证明计划本身,并且不使用微调、自定义强化学习目标或特定定理的脚手架。在 FormalQualBench(一个包含 23 道博士资格考试定理的基准测试)上,MerLean-Prover 解决了 10/23,超过了最强的开源基线(OpenGauss,8/23)。在 Putnam2025 上,相同的框架以显著低于下一个最佳系统的总挂钟时间完成了 12/12。该框架也适用于较小的模型:Sonnet 解决了所有四个测试的 FormalQualBench 问题,Haiku 解决了两个简短的问题。这些结果表明,框架设计是端到端 Lean4 定理证明的核心因素,与原始模型能力并列,并且一个相对简单的框架已经可以很有效。

英文摘要

MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.

2605.26730 2026-05-28 cs.CL 版本更新

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

PRISM:评估LLM同行评审员的多维基准

Ngoc Phan Phuoc Loc, Toan Huynh La Viet, Thanh Tran Khanh, Duy A Nguyen, Tuan Anh Nguyen Pham, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D. Doan, Binh T. Nguyen

发表机构 * VinUniversity(Vin大学) University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Notre Dame(诺丁汉大学) Monash University(莫纳什大学)

AI总结 提出PRISM基准,从分析深度、新颖性评估、缺陷识别与主要问题排序、多维建设性四个维度评估LLM评审质量,发现LLM在单维度上可媲美甚至超越人类,但无系统在所有维度上一致平衡,表明LLM评审员更适合作为人类评审的针对性补充。

详情
AI中文摘要

机器学习会议投稿量的快速增长给科学同行评审系统带来了压力,并加剧了对基于LLM的自动评审系统的兴趣。然而,这些系统实际上有多好,特别是在捕捉科学漏洞方面与人类评审员相比如何,仍然知之甚少。在这项工作中,我们引入了PRISM(通过结构化多维评估的同行评审智能),这是一个评估框架,从四个维度评估评审质量:分析深度、新颖性评估、缺陷识别与主要问题排序、以及多维建设性。与大多数基于表面指标(如ROUGE和BLEU)或未约束的LLM-as-a-judge提示(将流畅性与严谨性混为一谈)的现有评估不同,PRISM将每个维度建立在论点挖掘、检索增强验证和基于共识的评分之上。我们应用PRISM对来自ICLR、ICML和NeurIPS的分层评审语料库中的五个领先自动化评审系统和人类评审员进行基准测试。结果显示,LLM在单个维度上可以匹配或超越人类评审员:可比较的分析深度、更强的新颖性验证以及高度准确的批评优先级排序。然而,没有一个系统能在所有维度上同时匹配人类基线的平衡表现。每个系统都表现出独特的专业化特征,带有典型的盲点——聚合指标完全遗漏的失败模式。这意味着LLM评审员最好被理解为人类评审的针对性补充,在特定维度上有效,但作为独立替代品不可靠。我们的演示和关键结果可在https://khanhthanhdev.github.io/prism-page/找到。

英文摘要

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.

2605.26186 2026-05-28 cs.SE cs.AI cs.CL cs.LG 版本更新

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

SetupX:LLM代理能否从过去的功能正确代码仓库设置失败中学习?

Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beijing University of Posts and Telecommunications(北京邮电大学) Independent Researcher(独立研究者) Jinan Inspur Data Technology Co., Ltd.(济南 Inspur 数据技术有限公司)

AI总结 提出SetupX框架,通过经验学习、推测执行和验证协议解决代码仓库设置中的跨仓库经验迁移、多步试错和鲁棒验证问题,在基准测试中达到92%通过率。

Comments 21 pages, 6 figures

详情
AI中文摘要

功能正确的仓库设置旨在配置执行环境(例如,依赖项、构建脚本)以成功执行仓库的文档化功能。由于多样化的、特定于仓库的失败(包括依赖项不兼容、缺少工具链、不完整的安装和验证策略不匹配),这带来了重大挑战。现有的LLM代理难以稳健地解决这些问题,具体来说,它们无法支持(1)跨仓库经验迁移,(2)在不可逆状态变化下的多步试错修复,以及(3)对设置结果的鲁棒验证,以区分设置引起的失败和仓库错误。为了解决这些问题,我们引入了SetupX,一个基于经验学习的设置框架。首先,我们构建了自进化经验表示(XPU),一种双模态知识单元,编码设置信号、文本指导和可执行动作,以动态地将已验证的环境修复迁移到未见过的仓库。其次,我们采用了由LIFO Docker快照栈支持的经验增强推测执行,使代理能够主动尝试修复并安全回滚到已知的良好状态。第三,我们引入了检察官-法官验证协议,将证据收集与最终判断分离,从而实现超越表面构建时度量的更可靠的设置验证。在精心设计的基准测试上的评估结果表明,SetupX达到了最高性能(例如,92%的通过率),并且比最强基线高出19%以上。关键的是,SetupX在需要跨不同容器协调多个互连服务的复杂多仓库设置中表现出色。代码仓库可在https://github.com/OpenDataBox/SetupX获取。

英文摘要

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

2605.26114 2026-05-28 cs.AI cs.CL 版本更新

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

MobileGym: 一个可验证且高度并行的移动GUI智能体研究仿真平台

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, Zhaoxiang Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出MobileGym,一个基于浏览器的轻量级、完全可控的移动环境,通过结构化JSON状态实现可验证结果信号和低成本并行强化学习,并附带包含416个参数化任务模板的基准测试集。

Comments Project page: https://mobilegym.github.io

详情
AI中文摘要

我们提出MobileGym,一个托管于浏览器、轻量级、完全可控的日常移动使用环境,旨在实现交互保真度而无需复制专有后端。它实现了之前日常应用无法实现的两种能力:通过基于确定性状态判断的结构化JSON状态实现可验证的结果信号,以及通过低成本的并行回滚实现可扩展的在线强化学习。完整的环境状态被捕获、配置、分支和比较为结构化JSON,单个服务器可托管数百个并行实例,每个实例约400 MB内存,冷启动约3秒。分层状态模型和声明式任务定义框架使状态可编程性和任务创建在大规模下实用,单一的程序化判断机制同时提供确定性评估结果和密集的强化学习奖励。配套的MobileGym-Bench提供了416个参数化任务模板,包括256个测试模板和160个训练模板,覆盖28个应用,具有确定性判断器和结构化的AnswerSheet协议,避免了自由文本匹配失败。在Sim-to-Real案例研究中,Qwen3-VL-4B-Instruct上的GRPO在256任务测试集上获得了+12.8个百分点的提升,在59任务真实设备信号子集上,真实设备执行保留了模拟侧训练增益的95.1%。项目页面:https://mobilegym.github.io。

英文摘要

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

2605.25183 2026-05-28 cs.CL cs.AI 版本更新

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

知识图谱驱动的神经科学专家级推理

Jake Stephen, Niraj K. Jha

发表机构 * Department of Electrical and Computer Engineering, Princeton University(普林斯顿大学电气与计算机工程系)

AI总结 本文通过从单一教科书构建知识图谱并生成问答监督,微调语言模型,实现超越大语言模型的专家级神经科学推理。

详情
AI中文摘要

知识图谱(KG)是一种可以从文本语料库中提取并用于深度推理的抽象结构。先前的工作利用KG微调语言模型(LM),实现了特定领域的超智能。在这项工作中,我们探索仅使用单一权威教科书中的信息,KG驱动的深度推理能力是否能在神经科学中出现。核心假设是,结构化知识在被提炼为高质量KG并转换为基于KG的问答(QA)监督后,足以通过微调LM产生专家级推理,该LM在准确率上超越大型语言模型(LLM),同时参数数量少几个数量级。我们通过双LLM验证流水线构建教科书衍生的KG,使用在KG拓扑上训练的掩码LM扩展它,生成多跳QA项目(包括QA对和推理轨迹),以仅基于KG的监督微调LM,并应用强化学习,使用路径衍生的KG信号作为隐式奖励模型。我们的结果表明,深度、机械性的神经科学理解可以在模型中诱导,而无需依赖大型、异构的网络规模语料库。基于KG的神经科学合成课程(读者可以自我测试)以及微调后的LM可在以下GitHub位置获取:https://kg-bottom-up-superintelligence.github.io/neuro-bench。

英文摘要

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE 版本更新

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素:用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

发表机构 * New York University(纽约大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder,探索人工智能在无引导发现中的开放性能力,并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异,同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情
AI中文摘要

我们正处于大规模工业和学术努力之中,旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上,这些过程在人类形式中的一个基本属性是它们的开放性:即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现?为了回答这个问题,我们转向Picbreeder,这是人类驱动的开放性搜索的典型范例,用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder,用前沿视觉语言模型(VLM)替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异,并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素,我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

2605.22705 2026-05-28 cs.CL 版本更新

Tokenization with Split Trees

基于分裂树的Tokenization(ToaST)

Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner

AI总结 提出ToaST方法,通过递归推理过程直接优化压缩,使用整数规划选择词汇表,在英文文本上相比BPE、WordPiece和UnigramLM减少超过11%的token数量,并提升Renyi效率和语言模型性能。

Comments All baseline tokenizers (BPE, WordPiece, Unigram) were trained incorrectly due to a bug in the Hugging Face tokenizers library: pair counts overflow i32 above ~108 GB of training data, dropping the most common merge pairs. All comparisons to ToaST are invalid. Thanks to Sander Land for identifying the missing merge pairs. See https://github.com/huggingface/tokenizers/issues/2058

详情
AI中文摘要

我们引入了基于分裂树的Tokenization(ToaST),一种子词分词方法,通过新的递归推理过程直接优化压缩。ToaST使用预计算的字节n-gram计数,独立于任何词汇表,贪婪地将每个预分词分裂成完全二叉树。给定词汇表后,推理递归地遍历每个分裂树,并在每条路径上发出第一个在词汇表中的节点。词汇表选择被形式化为一个整数规划(IP),在此推理过程下最小化所有分裂树的总token数。线性规划(LP)松弛在实践中接近整数解,产生可证明接近最优的词汇表,训练时间经验上与分裂树数量呈二次方关系。在英文文本上,当词汇表大小为40,960及以上时,与BPE、WordPiece和UnigramLM相比,ToaST将token数量减少了超过11%,从而减少了使用该分词器的模型的推理token数,因此扩展了有效上下文长度。ToaST还比这些基线方法更少使用常见的单字节token,导致Renyi效率显著提高。在训练15亿参数语言模型的实验中,ToaST获得了最高的CORE分数,比基线方法高出2.6%至7.6%,其中两个基线具有显著性,并在22个单独任务中的13个上取得了最佳成绩。

英文摘要

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

2602.06025 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

学习面向运行时智能体记忆的查询感知预算层级路由

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

发表机构 * Nanyang Technological University(南洋理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois Chicago(伊利诺伊大学香槟分校) Tsinghua University(清华大学) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 BudgetMem 框架,通过强化学习训练的轻量级路由器实现查询感知的预算层级路由,以在运行时平衡任务性能与记忆构建成本。

Comments Accepted by ICML 2026. Code is available at https://github.com/ViktorAxelsen/BudgetMem

详情
AI中文摘要

记忆对于在单个上下文窗口之外运行的大型语言模型(LLM)智能体日益重要,然而大多数现有系统依赖于离线的、查询无关的记忆构建,这可能导致效率低下并丢弃查询关键信息。尽管运行时记忆利用是一种自然的替代方案,但先前的工作通常会产生大量开销,并且对性能-成本权衡的显式控制有限。在这项工作中,我们提出了 extbf{BudgetMem},一个用于显式、查询感知性能-成本控制的运行时智能体记忆框架。BudgetMem 将记忆处理结构化为一组记忆模块,每个模块提供三个预算层级(即 extsc{Low}/ extsc{Mid}/ extsc{High})。一个轻量级路由器在模块间执行预算层级路由,以平衡任务性能和记忆构建成本,该路由器实现为通过强化学习训练的紧凑神经策略。使用 BudgetMem 作为统一测试平台,我们研究了实现预算层级的三种互补策略:实现(方法复杂度)、推理(推理行为)和容量(模块模型大小)。在 LoCoMo、LongMemEval 和 HotpotQA 上,当优先考虑性能时(即高预算设置),BudgetMem 超越了强基线,并在更紧的预算下提供了更好的精度-成本边界。此外,我们的分析揭示了不同层级策略的优势和劣势,阐明了在不同预算制度下每个轴何时提供最有利的权衡。

英文摘要

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

2605.19514 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

立场:自回归Transformer的图灵完备性高度依赖于上下文管理

Guanyu Cui, Zhewei Wei, Kun He

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) DEKE Lab, Renmin University of China, Beijing, China(中国人民大学北京校区DEKE实验室)

AI总结 本文通过区分固定系统和缩放族两种设置,论证了上下文管理方法对自回归Transformer计算能力的决定性影响,并指出缩放族设置下的图灵完备性证明不适用于实际部署的固定系统。

Comments Accepted to the ICML 2026 Position Paper Track

详情
AI中文摘要

许多工作提出了引人注目的主张,即Transformer是图灵完备的。然而,文献常常混淆两种不同的设置:(i)固定系统设置,其中固定的自回归Transformer与固定的上下文管理方法耦合,逐步处理不同长度的输入;(ii)缩放族设置,其中使用一系列不同模型(具有增加的上下文窗口长度或数值精度)来处理不同的输入长度。现有的Transformer图灵完备性证明通常是在设置(ii)中建立的,而现实世界中的LLM部署以及图灵完备性的标准概念更自然地对应于设置(i)。在本文中,我们首先形式化固定系统设置,从而具体描述现实世界LLM的运行方式。然后,我们认为在缩放族设置中证明的结果提供了理论上有意义的资源界限,但并未建立图灵完备性,从而澄清了对现有结果的常见误解。最后,我们展示了不同的上下文管理方法可以产生截然不同的计算能力,并主张上下文管理是决定现实世界自回归Transformer计算能力的关键组成部分。

英文摘要

Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.

2605.18113 2026-05-28 cs.CL 版本更新

iPOE: Interpretable Prompt Optimization via Explanations

iPOE: 基于解释的可解释提示优化

Jiahui Li, Yarik Menchaca Resendiz, Sean Papay, Roman Klinger

发表机构 * Fundamentals of Natural Language Processing, University of Bamberg, Germany(自然语言处理基础,巴姆堡大学,德国) Leibniz-Institut für Psychologie (ZPID), Trier, Germany(莱比锡心理学研究所(ZPID),特里尔,德国)

AI总结 提出iPOE方法,通过自动从解释中生成指南并优化,实现可解释的提示优化,在四个数据集上性能提升高达39%,且人类与LLM对指南贡献的判断一致性达Cohen's kappa 0.65。

详情
AI中文摘要

提示优化通常被构建为一个离散搜索问题,旨在为LLM找到高性能且鲁棒的指令。然而,搜索结果可能无法透明地显示为什么以及在哪里特定的提示更改带来了性能提升。这与人类接受注释任务指导的方式形成对比。在人类任务中,研究人员精心设计注释指南,从而提高注释一致性。本文旨在结合这两种方法,并引入iPOE,一种通过解释进行可解释提示优化的新策略。我们通过自动从注释决策的解释(自动生成或来自人类)中创建指南来指导提示优化过程。此外,通过一系列操作(包括删除、添加、打乱和合并)来优化这组指南。最终的提示包含指导注释的指南,使LLM的决策过程和优化过程透明化。因此,它也为提示优化领域的非专业人士提供支持,特别是在需要专业知识的挑战性领域。在四个数据集上的实验中,我们发现iPOE相比评估基线最高可提升39%,并且LLM的解释可以替代所提出方法中的人类解释。此外,我们的可解释性验证研究表明,人类和LLM在哪些指南有助于其注释方面可以基本达成一致,Cohen's kappa得分高达0.65。

英文摘要

Prompt optimization has often been framed as a discrete search problem to find high-performing and robust instructions for an LLM. However, the search result might not make it transparent why and where specific prompt changes lead to performance gains. This is in contrast to how humans are instructed for annotation tasks. Here, researchers carefully design annotation guidelines, leading to enhanced annotation consistency. Our paper aims at joining these two approaches and introduces iPOE, a novel interpretable prompt optimization strategy via explanations. We guide the prompt optimization process by automatically created guidelines from explanations of annotation decisions (either automatically generated or from humans). This set of guidelines is furthermore optimized by as series of operations, including removing, adding, shuffling, and merging. The resulting prompt includes guidelines that instruct the annotation, making the decision process of the LLM and the optimization transparent. It therefore supports also laypeople in the area of prompt optimization, particularly in challenging domains requiring expertise. In our experiments on four datasets, we find that iPOE can improves over the evaluated baselines by up to 39% and LLM explanations can replace human explanations in the proposed method. Moreover, our interpretability validation study demonstrates that humans and LLMs can substantially agree on which guidelines contribute to their annotations, achieving a Cohen's kappa score of up to 0.65.

2605.17448 2026-05-28 cs.GR cs.CL 版本更新

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

基于有限元分析反馈的自改进CAD生成智能体

Guijin Son, Jehyun Park, Seyeon Park, Sunghee Ahn, Youngjae Yu

发表机构 * Seoul National University(首尔国立大学) OneLineAI Sungkyunkwan University(成均馆大学) Ewha Womans University(成乙女子大学)

AI总结 提出一种以有限元分析为反馈的CAD生成框架,通过蓝图和渲染图监督信号提升多部件装配质量,使生成结果满足工程需求。

Comments Work in progress

详情
AI中文摘要

计算机辅助设计(CAD)是现代工业设计的基石,然而现有的CAD生成器仍无法满足实际工程流程:它们既不像工程师那样迭代,也不评估工程所需。先前的工作将CAD生成视为两个独立的步骤——零件合成和装配,前者通过接近参考标准来评分,而后者(如果处理的话)被简化为一个单独的约束求解步骤。在这项工作中,我们引入了一种更贴近工业的任务形式,要求模型根据自由形式的工程简报生成完全装配的多部件STEP文件,然后通过有限元分析(FEA)进行验证。FEA验证显示,Codex (GPT-5.5) 和 Claude Code (Opus-4.7) 智能体在主要的首次尝试扫描中没有产生任何严格通过的工件,最佳配置平均仅满足约20%的类型化要求。此外,我们引入了两种额外的监督信号:一种新颖的纯文本蓝图模式和一种21视角图像渲染器,以辅助智能体的视觉检查,使生成循环更符合工程师实际迭代的方式。在S2O和Fusion360上,相同的反馈工具改善了几何重建,GPT-5.5/xhigh在S2O上的Box-IoU从0.444提升到0.592,在Fusion360上从0.397提升到0.505。这些信号共同将CAD程序推向不仅视觉上合理,而且经过物理和结构要求检查的工件。

英文摘要

Computer-aided design (CAD) is the backbone of modern industrial design, yet learned CAD generators still fall short of real engineering pipelines: they neither iterate like engineers nor evaluate what engineering requires. Prior work has treated CAD generation as two disjoint steps, part synthesis and assembly, where the former is graded by proximity to a gold reference and the latter, when handled at all, is reduced to a separate constraint solving step. In this work, we introduce a more industry-native task formulation that requires a model to produce a fully assembled multi-part STEP file from a free-form engineering brief, which is then validated via finite element analysis (FEA). FEA validation reveals that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents do not produce a single strict-passing artifact in the main first-attempt sweep, with the best configuration meeting only about 20% of typed requirements on average. Moreover, we introduce two additional supervision signals, a novel text-only blueprint schema and a 21-view image renderer that aids the agent's visual inspection, that better align the generation loop with how engineers iterate in practice. On S2O and Fusion360, the same feedback tools improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360. Together these signals move CAD programs toward artifacts that are not only visually plausible but also checked against physical and structural requirements.

2605.15864 2026-05-28 cs.CV cs.CL 版本更新

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

VLMs 是在看还是只是在说?揭示视觉重新检查的幻觉

Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

发表机构 * University of Southern California(南加州大学) University of California San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过图像交换探测框架 VisualSwap 和 800 对图像基准 VS-Bench,发现视觉语言模型在推理时声称的“重新检查图像”多为文本模式,而非真正的视觉重新检查,且思考模型更易受影响,用户指令可恢复视觉基础但自我反思无效。

Comments ICML 2026 Oral

详情
AI中文摘要

视觉语言模型(VLM)在推理过程中经常产生自我反思的语句,如“让我再检查一下图片”。这样的语句是否触发了真正的视觉重新检查,还是仅仅是习得的文本模式?我们通过 VisualSwap(一种图像交换探测框架)对此进行研究:在模型对一张图像进行推理后,我们将其替换为视觉上相似但语义不同的图像,并测试模型是否注意到这一变化。我们引入了 VS-Bench,包含从 MathVista、MathVerse、MathVision 和 MMMU-Pro 中精选的 800 对图像。在 Qwen3-VL、Kimi-VL 和 ERNIE-VL 上的实验揭示了一个惊人的失败:模型绝大多数情况下忽略了图像交换,准确率下降高达 60%。与直觉相反,思考模型比其指令对应模型脆弱近 3 倍,且扩展规模无法缓解。多轮用户指令可以恢复视觉基础,但连续生成过程中自我生成的反思语句则不能。注意力分析解释了原因:用户指令显著提高了对视觉标记的注意力,而自我反思则没有。当前的 VLM 在声称执行视觉重新检查时倾向于“说”而非真正“看”。我们的代码和数据集可在项目页面获取:https://visualswap.github.io

英文摘要

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

2601.16312 2026-05-28 cs.CL cs.AI 版本更新

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

教授和评估LLMs推理聚合物设计相关任务

Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian

发表机构 * Stony Brook University(石溪大学)

AI总结 本文提出PolyBench基准数据集和知识增强推理蒸馏方法,使中小型语言模型在聚合物设计任务上性能接近前沿闭源LLM。

详情
AI中文摘要

AI4Science研究在许多科学应用中显示出前景,包括聚合物设计。然而,当前的LLMs在此问题空间中效果不佳,因为:(i)大多数模型缺乏聚合物特定知识,(ii)现有对齐模型对聚合物设计相关知识和能力的覆盖有限。为解决此问题,我们引入了PolyBench,一个包含超过125K聚合物设计相关任务的大规模训练和测试基准数据集,利用从实验和合成数据源获得的超过1300万数据点的知识库,以确保聚合物及其属性的广泛覆盖。为了使用PolyBench进行有效对齐,我们引入了一种知识增强推理蒸馏方法,用结构化CoT增强该数据集。此外,PolyBench中的任务从简单到复杂的分析推理问题组织,使得能够进行泛化测试和问题空间中的诊断探测。实验表明,在PolyBench上训练的具有7B到32B参数的中小型语言模型(SLMs)在PolyBench测试数据集上优于类似大小的模型,并与闭源前沿LLMs保持竞争力,同时在外部聚合物基准上展示了性能提升。数据集和相关代码可在https://github.com/StonyBrookNLP/PolyBench获取。

英文摘要

Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small- and mid- sized language models (SLMs) with 7B to 32BB parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.

2604.04295 2026-05-28 cs.CL 版本更新

Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Generation

面向可靠专利权利要求生成的适应性成本高效评估

Yongmin Yoo, Qiongkai Xu, Longbing Cao

发表机构 * Frontier AI Research Centre, Macquarie University School of Computing, FSE, Macquarie University(前沿人工智能研究中心,麦考瑞大学计算机学院,FSE,麦考瑞大学)

AI总结 提出两阶段框架ACE,利用专利错误类别结构进行不确定性感知路由,第一阶段编码器预测错误类型熵,超过阈值则交由第二阶段专家LLM执行模式约束的专利思维链协议,在降低78%成本的同时超越70B参数LLM基线。

详情
AI中文摘要

自动化专利权利要求验证要求低容错率。然而,现有方法面临僵化-资源困境:轻量级编码器无法追踪长程法律依赖,而穷举式LLM验证在百万权利要求规模下会产生4-5倍的开销。基于置信度的简单级联无法解决这一问题,因为二元有效性分数无法区分需要不同推理深度的结构上不同的错误类型。我们提出一个两阶段框架:适应性成本高效评估(ACE),它利用专利错误的类别结构进行不确定性感知路由。在第一阶段,微调后的编码器将权利要求投影到法律错误类型上的K+1分布,其预测熵作为路由信号。超过熵阈值的权利要求被升级到第二阶段,由专家LLM执行模式约束的专利思维链(CoPT)协议,将权利要求元素映射到35 U.S.C.标准,其模式约束将每个权利要求的延迟降低42%,同时产生法律依据充分的裁决。我们进一步提出了一个包含40,000个权利要求的数据集ACE-40k,带有MPEP注释,其中ACE超越了包括监督式70B参数LLM在内的竞争基线,同时将成本降低78%。在真实的USPTO驳回数据上,路由机制无需重新校准即可迁移,推理时间减少60%,同时保持竞争性的召回率。

英文摘要

Automated patent claim validation demands low error tolerance. However, existing approaches face a rigidity-resource dilemma: lightweight encoders cannot track long-range legal dependencies, while exhaustive LLM verification incurs 4-5X higher overhead at million-claim scale. A naive confidence-based cascade cannot resolve this because binary validity scores fail to distinguish structurally distinct error types which require different reasoning depths. We propose a two-stage framework: Adaptive Cost-efficient Evaluation (ACE), which exploits the categorical structure of patent errors for uncertainty-aware routing. In the first stage, a fine-tuned encoder projects claims into a K+1 distribution over legal error types, whose predictive entropy serves as the routing signal. Claims exceeding an entropy threshold are escalated to the second stage, where an expert LLM executes a schema-constrained Chain-of-Patent-Thought (CoPT) protocol to map claim elements against 35 U.S.C. standards whose schema constraint reduces per-claim latency by 42% while producing legally grounded verdicts. We further present a 40,000-claim dataset ACE-40k with MPEP-grounded annotations, where ACE surpasses competitive baselines including a supervised 70B-parameter LLM while reducing costs by 78%. On real USPTO rejection data, the routing mechanism transfers without re-calibration, reducing inference time by 60% while maintaining competitive recall.

2605.12515 2026-05-28 cs.CL 版本更新

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

通过共识驱动的偏好优化缓解多语言大模型中的跨语言文化不一致性

Lucas Resck, Isabelle Augenstein, Anna Korhonen

发表机构 * Language Technology Lab, University of Cambridge(剑桥大学语言技术实验室) University of Copenhagen(哥本哈根大学)

AI总结 提出C-3PO框架,通过共识驱动的偏好优化,缓解多语言大模型在用户身份明确时因提示语言变化导致的跨语言文化不一致问题,显著提升一致性指标κ_S。

Comments 24 pages, 13 figures, 11 tables

详情
AI中文摘要

尽管多语言大模型(MLLMs)能力令人印象深刻,但当提示语言改变时,它们经常表现出不一致的行为。虽然这种适应通常是可取的,但当用户身份被明确定义时,它就会成为一个关键失败。例如,给定一个固定的英国人角色和一个关于文学的模糊日常知识查询,提示语言经常覆盖系统角色——英语输出莎士比亚,西班牙语输出塞万提斯。为了稳健地量化这种跨语言文化不一致性,我们引入了Singleton Fleiss的κ_S,一个在数学上对幻觉具有鲁棒性的度量。为了缓解这一问题,我们提出了跨语言文化一致的偏好优化(C-3PO),一种共识驱动的对齐框架。C-3PO在κ_S上实现了比未对齐模型高达0.13个绝对点的提升,持续优于强提示和表示引导基线,同时保留了明确的用户身份、文化中立性和内在文化知识。实证评估表明,这种不一致性对印尼语和波斯语等低资源语言影响尤为严重。最后,中间层的早期解码揭示了MLLMs在正向传播表示稳定时,会隐式地将输出个性化到提示语言的刻板文化。

英文摘要

Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $κ_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.13-point absolute increase in $κ_S$ over unaligned models, consistently outperforming strong prompting and representation steering baselines whilst preserving explicit user identities, cultural neutrality and intrinsic cultural knowledge. Empirical evaluations demonstrate this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. Finally, early decoding of intermediate layers reveals that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.

2605.12015 2026-05-28 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

SkillSafetyBench:在技能面攻击表面下评估智能体安全性

Chang Jin, An Wang, Zeming Wei, Kai Wang, Biaojie Zeng, Qiaosheng Zhang, Chao Yang, Jingjing Qu, Xia Hu, Xingcheng Xu

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Peking University(北京大学) East China Normal University(华东师范大学)

AI总结 提出SkillSafetyBench基准,通过155个对抗案例评估大语言模型智能体在技能、本地工件和执行环境文件等非用户攻击下的安全失败模式。

详情
AI中文摘要

可复用技能正成为扩展大语言模型智能体的常见接口,它将程序性指导与对文件、工具、内存和执行环境的访问打包在一起。然而,这种模块化引入了现有安全评估大多忽略的攻击面:即使用户请求是良性的,不安全的影响可能存在于技能指导、本地工件或执行环境文件中,这些会引导智能体采取不安全行为。我们提出了SkillSafetyBench,一个可运行的基准,用于评估此类技能中介的安全失败。SkillSafetyBench包含跨47个任务、6个风险领域和30个安全类别的155个对抗案例,每个案例都使用特定于案例的基于规则的验证器进行评估。使用多个CLI智能体和模型后端的实验表明,非用户攻击可以一致地诱导不安全行为,在不同领域、攻击方法和脚手架-模型配对中表现出不同的失败模式。我们的发现表明,智能体安全性不仅取决于模型级别的对齐,还取决于智能体如何解释技能、信任工作流上下文以及通过可执行环境采取行动。

英文摘要

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, unsafe influence may reside in skill guidance, local artifacts, or execution-environment files that steer the agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

2605.10073 2026-05-28 cs.CL 版本更新

Heterogeneous Dependency Graph-Guided Attentionfor Patent Representation Learning

异构依赖图引导的专利表示学习注意力机制

Yongmin Yoo, Qiongkai Xu, Zhangkai Wu, Longbing Cao

发表机构 * Frontier AI Research Centre, Macquarie University School of Computing, FSE, Macquarie University(前沿人工智能研究中心,麦考瑞大学计算机学院,FSE,麦考瑞大学)

AI总结 针对专利权利要求间的依赖层次被忽略的问题,提出专利异构注意力图编码器(PHAGE),通过构建类型图区分法律引用与技术关系,并引入可学习偏置的连通性掩码将权利要求级拓扑投射到令牌级注意力,结合双粒度对比学习,在分类、检索和聚类任务上超越领域自适应和引用感知基线。

详情
AI中文摘要

预训练语言模型通过将权利要求编码为扁平令牌序列来推进专利分类和检索,但忽略了权利要求之间的依赖层次。将层次结构融入自注意力面临两个挑战。首先,权利要求依赖涉及不同可靠性的关系类型:不加区分地对待它们会使有噪声的技术关系污染更清洁的法律引用信号。其次,当依赖图在权利要求级别定义时,Transformer模型会失败,因为它们在令牌级别操作;广播权利要求级别的邻接可能会稀释跨无关令牌对的结构信息。一种新颖的专利异构注意力图编码器(PHAGE)解决了这些挑战。为了处理异构依赖,PHAGE构建了一个类型图,将法律引用与技术关系区分为不同的边类型。为了弥合层次差距,PHAGE引入了一个带有可学习关系感知偏置的连通性掩码,将权利要求级别的拓扑投射到令牌级别的注意力中。PHAGE学习一个双粒度对比目标,以将表示与专利间分类法和专利内拓扑对齐。实验表明,PHAGE在专利分类、检索和聚类上优于领域自适应和引用感知基线。PHAGE揭示,专利内权利要求拓扑比专利间结构捕获了更强的归纳偏置。

英文摘要

Pre-trained language models advance patent classification and retrieval via encoding claims as flat token sequences, yet overlooking the dependency hierarchy among claims. Incorporating the hierarchy into self-attention poses two challenges. First, claim dependencies involve relation types with varying reliability: treating them indiscriminately allows noisy technical relations to corrupt cleaner legal citation signals. Second, when the dependency graph is defined over claims, Transformer models fail as they operate at the token level; broadcasting claim-level adjacency can dilute structural information across unrelated token pairs. A novel Patent Heterogeneous Attention Graph Encoder (PHAGE) addresses these challenges. To handle heterogeneous dependencies, PHAGE constructs a typed graph to separate legal citations from technical relations as distinct edge types. To bridge the hierarchy gap, PHAGE introduces a connectivity mask with learnable relation-aware biases to project a claim-level topology into token-level attention. PHAGE learns a dual-granularity contrastive objective to align representations with inter-patent taxonomy and intra-patent topology. Experiments show that PHAGE outperforms domain-adapted and citation-aware baselines on patent classification, retrieval, and clustering. PHAGE discloses that the intra-patent claim topology captures stronger inductive bias than the inter-patent structure.

2605.09986 2026-05-28 stat.ML cs.CL cs.LG 版本更新

Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

带宽预算下的联邦语言模型:蒸馏率与共形覆盖

Prasanjit Dubey, Xiaoming Huo

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(H. Milton Stewart工业与系统工程学院,佐治亚理工学院)

AI总结 本文研究带宽受限节点间分布式语言模型的统计保证,提出联邦探针-对数蒸馏(FPLD)和联邦共形RAG(FC-RAG)两种协议,分别给出训练时的KL一致性率和推理时的无分布边际覆盖界,首次将带宽作为一阶统计参数。

详情
AI中文摘要

在临床网络、企业知识库和科学联盟中,经常出现数据分散在带宽受限节点上且无法集中的场景,需要训练语言模型。我们研究数据必须保持分布式在节点上的情况,并询问在明确带宽预算下原则上可以实现哪些统计保证;我们的目标是描述可证明的可能性,而不是展示一个可部署的系统。现有理论要么单独处理训练时的一致性,要么单独处理推理时的校准,且没有先前的工作将带宽作为一阶统计参数。我们分析了两种协议:用于训练的联邦探针-对数蒸馏(FPLD)和用于推理的联邦共形RAG(FC-RAG),作为我们结果的分析载体。我们的第一个主要结果是FPLD的显式高概率KL一致性率,同时依赖于节点数$K$、每节点样本量$n$、量化预算$B$、探针集大小$m$和词汇量$V$;带宽仅通过指数衰减的量化项进入。我们的第二个主要结果是FC-RAG的无分布边际覆盖界,其新颖的检索带宽松弛量$\Delta_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$使每节点检索带宽成为一阶统计参数,在每节点均匀情况下,通过$K$个节点的算术聚合使松弛量以$K^{-1/2}$的速度缩小。一个Pinsker型推论将两个界组合成端到端的覆盖保证。合成实验验证了沿界参数的预测缩放;在GPT-2测试平台上的小规模实验表明,定性带宽-准确率权衡在真实语言模型上仍然存在。部署规模的实证评估超出范围。

英文摘要

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and no prior work makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.

2604.24938 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning

重新思考层冗余:校准比搜索在LLM深度剪枝中更重要

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim, Suin Cho, Woosang Lim, Sunwoo Lee

发表机构 * Neural Superintelligence Lab, MODULABS(神经超智能实验室,MODULABS) University of Southern California(南加州大学) Boston University(波士顿大学) Seoul National University(首尔国立大学) Inha University(inha大学)

AI总结 本文通过实验发现,在大型语言模型深度剪枝中,校准配置对剪枝模式和性能的影响远大于搜索算法的选择。

Comments Preprint

详情
AI中文摘要

深度剪枝通过移除Transformer块来提高大型语言模型的推理效率。先前的工作通常将层冗余视为预训练网络固有的结构属性,强调重要性标准和搜索算法来识别可移除的层。在本研究中,我们从功能角度实证研究深度剪枝。通过评估不同校准配置和多种搜索算法下的代表性LLM系列,我们展示了不同配置会产生不同的剪枝模式。此外,在固定校准配置下,复杂的搜索算法相比简单的一次性方法仅带来边际性能提升,并收敛到相似的剪枝子集。总体而言,我们的结果表明,校准配置在塑造剪枝模式和校准困惑度方面比搜索算法的选择起着更大的作用,同时对下游推理准确性的方差贡献相当。这表明未来的剪枝工作可能受益于优先考虑校准配置而非搜索复杂性。

英文摘要

Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer redundancy as an inherent structural property of pretrained networks, emphasizing importance criteria and search algorithms to identify removable layers. In this study, we empirically investigate depth pruning from a functional perspective. Evaluating representative LLM families across diverse calibration configurations and multiple search algorithms, we show that different configurations produce different pruning patterns. Furthermore, under a fixed calibration configuration, complex search algorithms yield marginal performance improvements over simple one-shot methods, converging to similar pruned subsets. Overall, our results suggest that the calibration configuration plays a substantially larger role than the choice of search algorithm in shaping pruning patterns and calibration perplexity, while contributing comparably to variance in downstream reasoning accuracy. This indicates that future pruning efforts may benefit from prioritizing the calibration configuration over search complexity.

2605.00435 2026-05-28 cs.CL cond-mat.dis-nn cs.AI nlin.CD 版本更新

Escaping Mode Collapse in LLM Generation via Geometric Regulation

通过几何调控逃离大语言模型生成中的模式崩溃

Xin Du, Kumiko Tanaka-Ishii

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan(通信与计算机工程系,早稻田大学,东京,日本) Department of Computer Science and Engineering, Waseda University, Tokyo, Japan(计算机科学与工程系,早稻田大学,东京,日本) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China(智能自主系统上海研究院,同济大学,上海,中国)

AI总结 本文从动力系统视角将模式崩溃解释为几何崩溃,并提出轻量级在线状态空间干预方法RMR(通过低秩阻尼调控Transformer值缓存中的自强化方向),显著降低模式崩溃并实现极低熵率下的稳定生成。

Comments Accepted to ICML 2026

详情
AI中文摘要

模式崩溃是生成建模中的一个持续挑战,在自回归文本生成中表现为从显式循环到逐渐失去多样性和轨迹过早收敛等行为。我们采用动力系统视角,将模式崩溃重新解释为由*几何崩溃*引起的状态空间可访问性降低:在生成过程中,模型的内部轨迹被限制在其表示空间的低维区域。这意味着模式崩溃并非纯粹的token级现象,无法通过符号约束或仅概率解码启发式可靠解决。基于这一视角,我们提出*强化模式调控*(RMR),一种轻量级的在线状态空间干预方法,用于调控Transformer值缓存中占主导地位的自强化方向(实现为低秩阻尼)。在多个大型语言模型上,RMR显著减少了模式崩溃,并能够在极低熵率(低至0.8 nats/步)下实现稳定生成,而标准解码通常在2.0 nats/步附近崩溃。

英文摘要

Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by *geometric collapse*: during generation, the model's internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose *Reinforced Mode Regulation* (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.

2605.00180 2026-05-28 cs.NI cs.CL 版本更新

RouteProfile: Graph-Based Profiling for Cold-Start LLM Routing

RouteProfile:基于图的冷启动LLM路由画像方法

Jingjun Xu, Hongji Pu, Tao Feng, Haozhen Zhang, Jiaxuan You, Ge Liu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nanyang Technological University(南洋理工大学)

AI总结 针对冷启动LLM路由中新模型缺乏交互数据的问题,提出基于图结构的RouteProfile框架,利用技术报告中的公开信号构建模型画像,实验表明结构化画像优于扁平基线,且模型家族元数据比基准域信息更可靠。

详情
AI中文摘要

LLM路由在多样化用户需求和部署约束下选择合适模型日益重要,但其实际效果取决于对新兴查询和新发布模型的持续适应。新LLM集成尤其具有挑战性,因为新发布模型缺乏路由训练所需的查询-响应-奖励交互,且无法像新查询那样通过语义嵌入直接画像。现有画像存在局限:LLM生成的描述往往粗糙,而基于交互的嵌入构建成本高昂。为解决此问题,我们提出RouteProfile,一种基于图的画像框架,从技术报告或模型卡中的公开信号(包括模型家族、模型描述、报告基准分数和基准域)构建LLM画像。RouteProfile将这些异构信号组织成图,并从组织形式、表示类型、聚合深度和学习配置四个维度研究画像构建。我们在无训练冷启动路由和新LLM集成设置中评估RouteProfile。实验表明:(1) 结构化画像在无训练冷启动路由中优于扁平基线;(2) 模型家族元数据比基准域信息更可靠;(3) 有效的新LLM集成需要画像-路由协同设计。总体而言,我们的发现强调了画像设计对于使路由系统适应不断发展的模型生态系统的重要性。

英文摘要

LLM routing is increasingly important for selecting suitable models under diverse user needs and deployment constraints, but its practical effectiveness depends on continual adaptation to emerging queries and newly released models. New-LLM integration is particularly challenging, as newly released models lack the query-response-reward interactions required for router training and cannot be profiled as directly as new queries via semantic embeddings. Existing profiles are limited: LLM-generated descriptions are often coarse, while interaction-based embeddings are costly to construct. To address this problem, we propose RouteProfile, a graph-based profiling framework that constructs LLM profiles from public signals in technical reports or model cards, including model family, model description, reported benchmark scores, and benchmark domains. RouteProfile organizes these heterogeneous signals into a graph and studies profile construction along four dimensions: organizational form, representation type, aggregation depth, and learning configuration. We evaluate RouteProfile in training-free cold-start routing and new-LLM integration settings. Experiments show that: (1) structured profiles outperform flat baselines in training-free cold-start routing; (2) model family metadata is more reliable than benchmark domain information; and (3) effective new-LLM integration requires profile-router co-design. Overall, our findings highlight the importance of profile design for enabling routing systems to adapt to the evolving model ecosystem.

2605.00025 2026-05-28 q-bio.NC cs.CL cs.HC cs.LG eess.AS 版本更新

MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis

MoDAl: 基于去相关的自监督神经模态发现用于语音神经假体

Yuanhao Chen, Peter Chin

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 提出MoDAl框架,通过对比学习和对齐损失与去相关损失之间的协同作用,从多脑区发现互补神经模态,在Brain-to-Text Benchmark '24上将词错误率从26.3%降至21.6%。

详情
AI中文摘要

语音神经假体系统在无听觉输出的情况下从神经活动解码预期语音,为言语障碍患者恢复交流提供了途径。当前方法主要从运动皮层区域解码,忽略了其他区域——如布罗卡区的一部分44区——这些区域可能编码互补的语言信息。我们提出了MoDAl(模态去相关与对齐)框架,该框架通过在共享投影空间中两个目标的相互作用来发现互补的神经模态。对比损失将多个并行脑编码器中的每一个与预训练大语言模型(LLM)的文本嵌入对齐,而去相关损失防止编码器合并成重复表示。我们证明这些目标之间存在富有成效的张力:对比对齐诱导传递性模态合并,而去相关必须抵消这一点,以使框架发现多样的神经语言学模态。在Brain-to-Text Benchmark '24上,与之前最佳端到端方法相比,MoDAl将词错误率(WER)从26.3%降低到21.6%,其中纳入先前丢弃的44区信号的增益完全来自去相关机制。对发现模态的分析揭示了功能特化:接收44区输入的编码器捕获结构和句法属性(句子长度、语法语态、wh-词),这与布罗卡区的神经语言学理解一致。

英文摘要

Speech neuroprosthesis systems decode intended speech from neural activity in the absence of audible output, offering a path to restoring communication for individuals with speech-impairing conditions. Current approaches decode predominantly from motor cortical areas, discarding others -- such as area 44, part of Broca's area -- that may encode complementary linguistic information. We introduce MoDAl (Modality Decorrelation and Alignment), a framework that discovers complementary neural modalities through the interplay of two objectives in a shared projection space. A contrastive loss aligns each of several parallel brain encoders with the text embeddings of a pretrained large language model (LLM), while a decorrelation loss prevents the encoders from coalescing to duplicative representations. We prove that these objectives are in productive tension: Contrastive alignment induces transitive modality coalescence, which decorrelation must counteract for the framework to discover diverse neurolinguistic modalities. On the Brain-to-Text Benchmark '24, MoDAl reduces word error rate (WER) from 26.3% to 21.6% compared to the previous best end-to-end method, with the gain from incorporating previously discarded area 44 signals arising entirely from the decorrelation mechanism. Analysis of the discovered modalities reveals functional specialization: Encoders receiving area 44 input capture structural and syntactic properties (sentence length, grammatical voice, wh-words), consistent with the neurolinguistic understanding of Broca's area.

2604.27251 2026-05-28 cs.CL cs.AI 版本更新

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

服从与感知:大型语言模型中的推理可控性研究

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras

发表机构 * School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院) School of EECS, Queen Mary University of London(伦敦女王学院电子工程与计算机科学学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过推理冲突视角,系统研究大型语言模型在诱导逻辑模式与任务预期模式冲突时,是否优先服从指令还是遵循感知合理性,并探索内部检测与激活级干预方法。

详情
AI中文摘要

大型语言模型(LLMs)已知通过预训练数据中的共享推理模式获得推理能力,并通过思维链(CoT)实践进一步激发。然而,基本推理模式(如归纳、演绎和溯因)能否与具体问题实例解耦,仍然是模型可控性的关键挑战,并有助于阐明推理可控性。在本文中,我们首次通过推理冲突的视角系统研究这一问题:推理冲突是指通过强制使用偏离目标任务预期逻辑模式而引发的参数信息与上下文信息之间的显性张力。我们的评估表明,LLMs 始终优先考虑感知合理性而非服从性,尽管存在冲突指令,仍倾向于采用任务合适的推理模式。我们进一步证明推理冲突在内部是可检测的,因为在冲突期间置信度分数显著下降。探测实验确认推理类型从中间层到后期层线性编码,表明存在激活级可控性的潜力。利用这些见解,我们引导模型朝向服从性,将指令遵循度提高多达 29%。总体而言,我们的发现表明,虽然 LLM 推理锚定于具体实例,但主动的机制性干预可以有效地将逻辑模式与数据解耦,为改进可控性、忠实性和泛化性提供了一条路径。

英文摘要

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

2603.09117 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

解耦推理与置信度:在可验证奖励的强化学习中恢复校准

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院) National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China(中国国家计算机网络应急技术配合中心)

AI总结 针对RLVR中模型校准退化问题,提出DCPO框架通过解耦推理与校准目标,在保持准确率的同时显著改善校准性能并缓解过度自信。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著增强了大语言模型(LLMs)的推理能力,但严重遭受校准退化,即模型对错误答案变得过度自信。以往研究致力于将校准目标直接纳入现有优化目标。然而,我们的理论分析表明,最大化策略准确率与最小化校准误差之间存在根本性的梯度冲突。基于这一见解,我们提出了DCPO,一个简单而有效的框架,系统地解耦了推理和校准目标。大量实验表明,我们的DCPO不仅保持了与GRPO相当的准确率,还实现了最佳的校准性能,并显著缓解了过度自信问题。我们的研究为更可靠的LLM部署提供了宝贵的见解和实用的解决方案。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

2604.21534 2026-05-28 cs.CL 版本更新

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

UKP_Psycontrol 在 SemEval-2026 任务 2:从文本建模效价和唤醒动态

Darya Hryhoryeva, Amaia Zurinaga, Hamidreza Jamalabadi, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)(无所不在的知识处理实验室) Technical University of Darmstadt(达姆斯塔特技术大学) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE) Psychiatric Control Systems Lab(精神病控制系统实验室) Marburg University(马尔堡大学)

AI总结 针对 SemEval-2026 任务 2,提出三种互补方法(LLM 提示、成对最大熵模型、轻量级神经回归模型)建模文本中的即时情感和短期情感变化,发现 LLM 擅长捕捉静态情感信号,而短期变化更依赖于数值轨迹,系统在子任务 1 和 2A 中排名第一。

Comments Accepted to SemEval 2026 (co-located with ACL 2026)

详情
AI中文摘要

本文介绍了我们为 SemEval-2026 任务 2 开发的系统。该任务要求对按时间顺序排列的用户生成文本中的当前情感和短期情感变化进行建模。我们探索了三种互补的方法:(1)在用户感知和用户无关设置下的 LLM 提示,(2)具有 Ising 式交互的成对最大熵(MaxEnt)模型用于结构化转换建模,以及(3)结合近期情感轨迹和可训练用户嵌入的轻量级神经回归模型。我们的发现表明,LLM 能有效捕捉文本中的静态情感信号,而该数据集中短期情感变化更多地由近期数值状态轨迹解释,而非文本语义。根据官方评估指标,我们的系统在子任务 1 和子任务 2A 中均排名第一。

英文摘要

This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

2604.20996 2026-05-28 cs.CL 版本更新

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

AFRILANGTUTOR:利用大语言模型推进低资源语言的语言辅导与文化教育

Tadesse Destaw Belay, Shahriar Kabir Nahin, Israel Abebe Azime, Ocean Monjur, Marek Rei, Chris Biemann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam, Anshuman Chhabra

发表机构 * Instituto Politécnico Nacional(墨西哥政治技术学院) University of South Florida(佛罗里达州立大学) Saarland University(萨尔兰大学) Imperial College London(伦敦帝国理工学院) University of Hamburg(汉堡大学)

AI总结 针对低资源语言缺乏训练数据的问题,提出AFRILANGDICT词典资源并构建AFRILANGEDU数据集,通过监督微调和直接偏好优化训练AFRILANGTUTOR模型,在10种非洲语言上显著提升辅导性能。

详情
AI中文摘要

如何为缺乏足够训练资源的语言开发语言学习系统?这一挑战日益被非洲大陆的开发者所面临,他们旨在构建能够理解并用当地语言回应的AI系统。为弥补这一差距,我们引入AFRILANGDICT,一个包含19.47万条非洲语言-英语词典条目的集合,作为生成语言学习材料的种子资源,使我们能够自动构建大规模、多样且可验证的学生-导师问答交互,适用于训练AI辅助语言导师。利用AFRILANGDICT,我们构建了AFRILANGEDU,一个包含7.89万个多轮训练示例的数据集,用于监督微调(SFT)和直接偏好优化(DPO)。使用AFRILANGEDU,我们训练了统称为AFRILANGTUTOR的语言辅导模型。我们在AFRILANGEDU上对两个多语言LLM:Llama-3-8B-IT和Gemma-3-12B-IT进行了微调,覆盖10种非洲语言,并评估了它们的性能。结果表明,在AFRILANGEDU上训练的模型始终优于其基础版本,且结合SFT和DPO带来了显著改进,在LLM作为评判者的评估中,四项指标的提升范围从1.8%到15.5%。为促进低资源语言的进一步研究,所有资源均可在https://huggingface.co/afrilang-edu获取。

英文摘要

How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages, all resources are available at https://huggingface.co/afrilang-edu.

2604.13583 2026-05-28 cs.CL cs.AI 版本更新

BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

BenGER平台:面向德国法律任务端到端基准测试的协作式Web平台

Sebastian Nagl, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出BenGER开源Web平台,集成任务创建、协作标注、可配置LLM运行及多维度评估,支持多组织项目与租户隔离,实现法律推理基准测试的端到端透明与可复现。

Comments Preprint - Accepted at ICAIL 2026

详情
AI中文摘要

评估大语言模型(LLM)的法律推理能力需要涵盖任务设计、专家标注、模型执行和基于指标的评估的工作流。在实践中,这些步骤分散在不同的平台和脚本中,限制了透明度、可复现性以及非技术法律专家的参与。我们提出了BenGER(德国法律基准测试)框架,这是一个开源Web平台,集成了任务创建、协作标注、可配置的LLM运行以及基于词汇、语义、事实和法官指标的评估。BenGER支持具有租户隔离和基于角色的访问控制的多组织项目,并可选择性地为标注者提供形成性的、基于参考的反馈。我们将展示一个实时部署,演示端到端的基准测试创建和分析。

英文摘要

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

2604.18758 2026-05-28 cs.CL 版本更新

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

句法作为罗塞塔石碑:用于上下文科普特语翻译的通用依存关系

Abhishek Purushothama, Emma Thronson, Alexia Guo, Amir Zeldes

发表机构 * Corpling Lab(科林实验室) Georgetown University(乔治城大学)

AI总结 提出一种结合通用依存句法分析和双语词典的上下文学习方法,用于低资源科普特语到英语的机器翻译,取得了新的最佳结果。

Comments ACL 2026 Findings camera-ready, with fixes

详情
AI中文摘要

低资源机器翻译需要不同于高资源语言的方法。本文提出了一种新颖的上下文学习方法,通过输入句子的通用依存句法分析来增强句法信息,以支持科普特语到英语的低资源机器翻译。在已有使用双语词典支持词汇项推理的工作基础上,我们在输入中添加了多种句法分析表示,具体探索了包含原始解析器输出、用简单英语表达的解析结果,以及针对子树中识别出的困难结构的定向指令及其翻译方法。结果表明,虽然单独的句法信息不如基于词典的注释有用,但将检索到的词典项与句法信息相结合,在不同模型规模上均取得了显著提升,为科普特语翻译实现了新的最佳结果。

英文摘要

Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.

2604.18235 2026-05-28 cs.CL cs.AI 版本更新

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

负优势是一把双刃剑:为搜索智能体校准GRPO中的优势

Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(东华师范大学数据科学与工程学院) Tencent(腾讯) Tsinghua University(清华大学)

AI总结 针对GRPO算法在多跳搜索中因粗粒度优势分配和正负优势不平衡导致的训练不稳定问题,提出CalibAdv方法,通过细粒度降低过度负优势并重新平衡正负优势,提升模型性能和训练稳定性。

详情
AI中文摘要

搜索智能体通过与搜索引擎的多轮交互实现强大的问答性能,其中组相对策略优化(GRPO)是一种广泛使用的训练算法。然而,GRPO风格的算法在多跳搜索场景中仍面临若干挑战。首先,当最终答案错误时,正确的中间步骤常常受到惩罚。其次,训练高度不稳定,经常导致自然语言能力退化甚至灾难性训练崩溃。我们的分析将这些问题归因于粗粒度的优势分配以及正负优势之间的不平衡。为了解决这些问题,我们提出了CalibAdv,一种专门为搜索智能体设计的优势校准方法,能够更准确、更稳定地对惩罚和奖励进行建模。具体来说,CalibAdv利用中间步骤的正确性在细粒度上降低过度的负优势,然后进一步重新平衡正负优势以提高训练稳定性。重要的是,CalibAdv采用轻量级设计,从标准 rollout 信号中校准优势,使其简单且易于部署。在三个模型和七个基准上的大量实验表明,CalibAdv同时提升了模型性能和训练稳定性。我们的代码可在 https://github.com/wujwyi/CalibAdv 获取。

英文摘要

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO-style algorithms still face several challenges in multi-hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

2604.17943 2026-05-28 cs.CL 版本更新

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

专业领域基准构建与评估框架:以国防相关文档为例

Bao Gia Doan, Aditya Joshi, Pantelis Elinas, Aarya Bodhankar, Oscar Leslie, Tom Marchant, Flora Salim

发表机构 * UNSW Sydney(新南威尔士大学悉尼分校) Cyndr AI

AI总结 提出DoRA框架,通过合成数据生成和双LLM流水线解决专业领域RAG问答的冷启动问题,在国防文档上显著减少幻觉并提升覆盖率和忠实度。

详情
AI中文摘要

基于RAG的专业领域问答面临冷启动问题:缺乏评估基准和用于后训练的标注数据。我们提出DoRA(面向领域的RAG评估),一个仅使用少量专业领域文档的新型基准构建与评估框架。DoRA系统地生成合成QA训练和评估数据集,并跨五个领域特定意图提供可审计的证据。为缓解同流水线循环,DoRA的训练和测试拆分使用不同的LLM家族(训练用Claude Sonnet;测试用GPT-4o),这些数据来自不相交的种子文档语料库。在40份国防相关文档(英文)上实例化后,DoRA产生约6600个精心整理的实例。与8个LLM基线在1259个样本的基准上比较,基于合成训练集微调的LoRA适配Llama3.1-8B在6个覆盖率和忠实度指标上持续提升性能,尤其在默认GTE检索设置下将幻觉减少一半以上,且增益在替代检索器和基于提示的基线下依然保持。国防领域专业知识在评估的三个阶段被纳入:(a) 判断DoRA生成的合成QA质量,(b) 确定LLM作为评判者的分数可靠性,(c) 评估QA流水线在完全人工编写的QA示例上的泛化能力。我们将DoRA定位为领域迁移下专业领域RAG的实用框架,并以国防作为高风险的案例研究。

英文摘要

RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.1-8B trained on the synthetic training set consistently improves performance over 6 coverage and faithfulness metrics, especially reducing hallucination by more than half under the default GTE retrieval setting, with gains persisting across alternative retrievers and prompting-based baselines. Defense-domain expertise is incorporated in three stages of our evaluation: (a) determining the quality of the synthetic QA generated by DoRA, (b) ascertaining the reliability of LLM-as-judge scores, and (c) evaluating the generalization of the QA pipeline on completely human-written QA examples. We position DoRA as a practical framework for specialist-domain RAG under domain shift, with defense as a high-stakes case study.

2604.16774 2026-05-28 cs.CL cs.AI 版本更新

Retention Consequence in Lifecycle Memory Control

生命周期记忆控制中的保留后果

Jiarui Han

AI总结 研究持久记忆在准入后失效的问题,提出将置信度作为前向有效性/支持证据,并引入强度作为保留后果的显式生命周期状态,通过StageMem控制器实验验证显式保留后果在生命周期结算中的控制作用。

详情
AI中文摘要

持久记忆在成功准入后可能失效:一个前提被写入,然后成为无声的假设,后续维护将其视为普通残留进行压缩、降级或驱逐。我们将这种准入后失效作为生命周期控制问题来研究。现有记忆系统已经执行准入、更新、压缩、检索和驱逐。我们的主张并非此类系统缺乏维护,而是保留后果通常仅通过有效性、相似性、新近性、频率、重要性或摘要信号间接操作,而非作为单独的生命周期状态暴露。因此,我们将置信度视为前向有效性/支持证据,并引入强度作为保留后果的显式生命周期状态。我们在StageMem中实现了这一区分,这是一个小型的分阶段控制器,其瞬态、工作态和持久态存储暴露了提升、压缩和驱逐压力点。在受控的前提实现、压缩、压力和隐式启发式诊断实验中,实验区分了写入过少、保留错误的高线索内容、遗忘代价高昂的前提以及通过饱和保留所有内容。通过生命周期结算使用的显式保留后果,提供了在遗漏和囤积之间的控制面。针对目标准入后失效模式,结果支持持久记忆的生命周期观点:可靠性不仅取决于进入记忆的内容,还取决于准入有效性和保留后果在维护期间是否可用。

英文摘要

Persistent memory can fail after successful admission: a premise is written, then becomes a silent assumption, and later maintenance treats it as ordinary residue to be compressed, demoted, or evicted. We study this post-admission failure as a lifecycle-control problem. Existing memory systems already perform admission, update, compression, retrieval, and eviction. Our claim is not that such systems lack maintenance, but that retention consequence is often operationalized only indirectly through validity, similarity, recency, frequency, importance, or summarization signals rather than exposed as a separate lifecycle state. We therefore treat confidence as carried-forward validity/support evidence, and introduce strength as an explicit lifecycle state for retention consequence. We operationalize this distinction in StageMem, a small staged controller whose transient, working, and durable stores expose promotion, compression, and eviction pressure points. Across controlled premise-realization, compression, pressure, and implicit-heuristic diagnostics, the experiments separate writing too little, retaining the wrong high-cue content, forgetting costly premises, and preserving everything by saturation. Explicit retention consequence, used through lifecycle settlement, provides a control surface between omission and hoarding. For the targeted post-admission failure mode, the results support a lifecycle view of persistent memory: reliability depends not only on what enters memory, but on whether admission validity and retention consequence remain available during maintenance.

2604.16358 2026-05-28 cs.LG cs.CL 版本更新

SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

SaFeR-Steer:通过合成引导和反馈动力学进化多轮多模态大语言模型

Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beijing University of Posts and Telecommunications(北京邮电大学) West China Biomedical Big Data Center, Sichuan University(四川大学西部生物医学大数据中心) School of Public Policy and Administration, Chongqing University(重庆大学公共政策与管理学院) Nanyang Technological University(南洋理工大学)

AI总结 提出SaFeR-Steer框架,通过分阶段合成引导和导师参与的GRPO训练单学生模型,并引入轨迹一致总结奖励(TCSR)以解决多轮安全对齐中的长上下文安全衰减问题,显著提升多轮安全性和有用性。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在多轮场景中,攻击者可以通过不断演变的视觉-文本历史升级不安全意图,并利用长上下文安全衰减。然而,安全对齐仍然以单轮数据和固定模板对话为主,导致训练与部署之间存在不匹配。为弥补这一差距,我们提出SaFeR-Steer,一种渐进式多轮对齐框架,结合分阶段合成引导和导师参与的GRPO,在自适应、在线策略攻击下训练单个学生模型。我们还引入了轨迹一致总结奖励(TCSR),该奖励聚合了历史最小值和回合奖励的平均值,使得任何低质量回合都会影响轨迹级别的回报。I. 数据集。我们发布STEER,一个多轮多模态安全数据集,包含STEER-SFT(12,934)、STEER-RL(2,000)和STEER-Bench(3,227)对话,回合数为2-10。II. 实验。从Qwen2.5-VL-3B/7B开始,SaFeR-Steer在单轮基准(3B:48.30/45.86 → 81.84/70.77;7B:56.21/60.32 → 87.89/77.40)和多轮基准(3B:12.55/27.13 → 55.58/70.27;7B:24.66/46.48 → 64.89/72.35)上显著提高了安全性/有用性,将失败转移到后续回合,并产生了超越单纯扩展的鲁棒性。代码可在https://anonymous.4open.science/r/SaFeR-Steer获取。

英文摘要

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment. To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce Trajectory-Consistent Summative Reward (TCSR), which aggregates the historical minimum and average of turn rewards so that any low-quality turn affects the trajectory-level return. I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2-10 turns. II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 $\rightarrow$ 81.84/70.77 for 3B; 56.21/60.32 $\rightarrow$ 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 $\rightarrow$ 55.58/70.27 for 3B; 24.66/46.48 $\rightarrow$ 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone. Code is available at https://anonymous.4open.science/r/SaFeR-Steer

2512.15791 2026-05-28 cs.CY cs.AI cs.CL 版本更新

Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study

语言模型中AI伦理工具评估:开发者视角案例研究

Jhessica Silva, Diego A. B. Moreira, Gabriel O. dos Santos, Alef Ferreira, Helena Maia, Sandra Avila, Helio Pedrini

AI总结 通过文献筛选和开发者访谈,评估四种AI伦理工具在葡萄牙语语言模型中的应用效果,发现它们能指导一般伦理考虑但未覆盖模型特有方面。

Comments 7 figures, 11 tables. Accepted for publication in AI and Ethics

详情
AI中文摘要

在人工智能中,语言模型因能够通过文本生成模拟与人类真实对话的系统被广泛采用而变得日益重要。由于它们对社会的影响,开发和部署这些语言模型必须负责任地进行,关注其负面影响和可能的危害。在此背景下,AI伦理工具(AIETs)的出版物数量近期有所增加。这些AIETs旨在通过引入公认的价值观来指导AI的设计、开发和使用阶段,帮助开发者、公司、政府和其他利益相关者建立对其技术的信任、透明度和责任。然而,许多AIETs缺乏良好的文档、使用示例以及在实践中有效性的证明。本文提出了一种评估语言模型中AIETs的方法。我们的方法包括对213个AIETs进行广泛的文献调查,在应用纳入和排除标准后,我们选择了四个AIETs:模型卡片、ALTAI、事实表以及危害建模。为了评估,我们将AIETs应用于为葡萄牙语开发的语言模型,并对它们的开发者进行了35小时的访谈。评估考虑了开发者对AIETs在帮助识别其模型伦理考量方面的使用和质量的看法。结果表明,所应用的AIETs可作为制定关于语言模型的一般伦理考量的指南。然而,我们注意到它们并未解决这些模型的独特方面,例如习语表达。此外,这些AIETs未能帮助识别葡萄牙语模型的潜在负面影响。

英文摘要

In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of simulating realistic conversations with humans through text generation. Because of their impact on society, developing and deploying these language models must be done responsibly, with attention to their negative impacts and possible harms. In this scenario, the number of AI Ethics Tools (AIETs) publications has recently increased. These AIETs are designed to help developers, companies, governments, and other stakeholders establish trust, transparency, and responsibility with their technologies by bringing accepted values to guide AI's design, development, and use stages. However, many AIETs lack good documentation, examples of use, and proof of their effectiveness in practice. This paper presents a methodology for evaluating AIETs in language models. Our approach involved an extensive literature survey on 213 AIETs, and after applying inclusion and exclusion criteria, we selected four AIETs: Model Cards, ALTAI, FactSheets, and Harms Modeling. For evaluation, we applied AIETs to language models developed for the Portuguese language, conducting 35 hours of interviews with their developers. The evaluation considered the developers' perspective on the AIETs' use and quality in helping to identify ethical considerations about their model. The results suggest that the applied AIETs serve as a guide for formulating general ethical considerations about language models. However, we note that they do not address unique aspects of these models, such as idiomatic expressions. Additionally, these AIETs did not help to identify potential negative impacts of models for the Portuguese language.

2604.14585 2026-05-28 cs.AI cs.CL 版本更新

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

提示优化如同抛硬币:诊断其在复合AI系统中何时有效

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过大量实验发现提示优化在复合AI系统中效果不稳定,仅当任务具有可挖掘的输出结构时才有帮助,并提供了两阶段诊断方法。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情
AI中文摘要

复合AI系统中的提示优化在统计上与抛硬币无异:在Claude Haiku 4.5上的72次优化运行(6种方法 × 4个任务 × 3次重复)中,49%的得分低于零样本;在Amazon Nova Lite上,失败率更高。然而,在一个任务上,所有六种方法相比零样本提升了高达+6.8分。是什么区分了成功与失败?我们通过18,000次网格评估和144次优化运行进行了调查,按照必须回答的顺序测试了TextGrad和DSPy等端到端优化工具背后的两个假设:(A) 智能体提示存在交互,需要联合优化而非独立优化;(B) 单个提示本身值得优化。交互效应从未显著(p > 0.52,所有F < 1.0),并且优化仅在任务具有可挖掘的输出结构时才有帮助:即模型可以生成但不会默认采用的格式。我们进一步给出了机制性解释:指令微调将输入措辞压缩成狭窄的输出分布,消除了联合优化所依赖的措辞敏感性。我们提供了一个两阶段诊断:一个80美元的ANOVA预测试用于智能体耦合,以及一个10分钟的头空间测试,用于预测优化是否值得,从而将抛硬币转变为知情决策。

英文摘要

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku 4.5 (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy, in the order they must be answered: (A) agent prompts interact, requiring joint rather than independent optimization, and (B) individual prompts are worth optimizing at all. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure: a format the model can produce but does not default to. We further give a mechanistic account: instruction-tuning compresses input phrasing into a narrow output distribution, eliminating the very phrasing-sensitivity that joint optimization assumes. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile, turning a coin flip into an informed decision.

2604.14356 2026-05-28 cs.CL cs.AI 版本更新

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

当多囊卵巢综合征遇上进食障碍:一种可解释的AI方法检测隐藏的三重负担

Apoorv Prasad, Susan McRoy

发表机构 * University of Wisconsin - Milwaukee(威斯康星大学密尔沃基分校)

AI总结 本研究通过微调小型开源语言模型,利用可解释性AI从社交媒体帖子中自动检测多囊卵巢综合征患者的身体形象困扰、进食障碍和代谢挑战的三重负担,最佳模型在150条测试帖上达到75.3%的精确匹配准确率。

详情
AI中文摘要

患有多囊卵巢综合征(PCOS)的女性面临身体形象困扰、进食障碍和代谢挑战的显著升高风险,然而现有的自然语言处理方法在检测这些状况时缺乏透明度,且无法识别共病表现。我们开发了小型开源语言模型,以基于可解释性的方式自动检测社交媒体帖子中的这种三重负担。我们从六个子论坛收集了1000条与PCOS相关的帖子,由两名经过训练的标注员根据Lee等人(2017)临床框架的操作化指南对帖子进行标注。使用低秩适配对三个模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B)进行微调,以生成带有文本证据的结构化解释。最佳模型在150条保留帖子上实现了75.3%的精确匹配准确率,具有稳健的共病检测能力和强可解释性。性能随诊断复杂性下降,表明其最佳用途是筛查而非自主诊断。

英文摘要

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

2604.13232 2026-05-28 cs.CL 版本更新

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

评估评估者:SemEval-2020任务1在词汇语义变化检测中的问题

Bach Phan-Tat, Kris Heylen, Dirk Geeraerts, Stefano De Pascale, Dirk Speelmana

发表机构 * Department of Linguistics, KU Leuven(KU莱顿大学语言学系) Instituut voor de Nederlandse Taal(荷兰语研究所) Department of Linguistics and Literary Studies, Vrije Universiteit Brussel(布鲁塞尔自由大学语言学与文学研究系)

AI总结 通过操作化、数据质量和基准设计三个框架,批判性分析SemEval-2020任务1的局限性,指出其窄化语义变化模型、数据质量问题及设计缺陷,呼吁未来改进。

详情
AI中文摘要

本文通过操作化、数据质量和基准设计三个框架重新审视了词汇语义变化检测中最具影响力的共享基准SemEval-2020任务1。首先,在操作化层面,我们认为该基准主要将语义变化建模为离散义项的增加、丢失或重新分布。虽然这种框架便于标注和评估,但过于狭窄,无法捕捉渐变的、构式的、搭配的和语篇层面的变化。此外,黄金标签是标注决策、聚类过程和阈值设置的结果,可能限制任务的有效性。其次,在数据质量层面,我们表明该基准受到严重的语料库和预处理问题影响,包括OCR噪声、畸形字符、截断句子、不一致的词形还原、词性标注错误以及目标词遗漏。这些问题可能扭曲模型行为,使语言分析复杂化,并降低可重复性。第三,在基准设计层面,我们认为精心挑选的小规模目标集和有限的语言覆盖降低了现实性并增加了统计不确定性。综合来看,这些局限性表明该基准应被视为一个有用但不完整的测试平台,而非进展的最终衡量标准。因此,我们呼吁未来的数据集和共享任务采用更广泛的语义变化理论,透明地记录预处理过程,扩大跨语言覆盖范围,并使用更现实的评估设置。这些步骤对于词汇语义变化检测中更有效、可解释和可推广的进展是必要的。

英文摘要

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

2604.10567 2026-05-28 cs.CL cs.AI 版本更新

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

早期决策至关重要:非自回归扩散语言模型中的邻近偏差与初始轨迹塑造

Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

发表机构 * LG AI Research(LG人工智能研究)

AI总结 本文通过分析非自回归扩散语言模型的推理动态,发现其存在邻近偏差导致的错误传播问题,并提出一种轻量级规划器和序列结束温度退火方法来引导早期令牌选择,从而显著提升推理与规划任务的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

基于扩散的语言模型(dLLMs)已成为自回归语言模型的一种有前景的替代方案,提供了并行令牌生成和双向上下文建模的潜力。然而,如何利用这种灵活性实现完全非自回归解码仍然是一个开放问题,尤其是在推理和规划任务中。在这项工作中,我们通过系统分析非自回归解码在时间轴上的推理动态来研究dLLMs中的非自回归解码。具体来说,我们揭示了基于置信度的非自回归生成中固有的失败模式,该模式源于强烈的邻近偏差——即去噪顺序倾向于集中在空间相邻的令牌上。这种局部依赖性导致空间错误传播,使得整个轨迹关键地依赖于初始去掩码位置。利用这一见解,我们提出了一种最小干预方法,通过轻量级规划器和序列结束温度退火来指导早期令牌选择。我们在各种推理和规划任务上全面评估了我们的方法,并观察到在现有启发式基线基础上,无需显著计算开销即可实现整体性能的显著提升。

英文摘要

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

2604.06196 2026-05-28 cs.CL cs.AI cs.LO 版本更新

Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering

面向三值逻辑问答的成分一致性引导解码

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

AI总结 针对大语言模型在三值逻辑问答中的否定不一致和认知未知问题,提出一种轻量级测试时解码层CGD-PD,通过神经三值分类、符号否定一致性投影和定向二值蕴含探测,在FOLIO数据集上提升准确率4.4-6.8点并减少未知预测。

Comments Accepted at the ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents

详情
AI中文摘要

三值逻辑问答(QA)在给定前提集 $S$ 的情况下,将 $ ext{True}$、$ ext{False}$ 或 $ ext{Unknown}$ 之一分配给假设 $H$。我们将此任务视为一个紧凑的成分推理问题:在确定性否定映射下,$H$ 和机械否定假设 $ eg H$ 的预测应保持一致。尽管结构简单,大语言模型(LLM)可能表现出两种实际失败模式:(i) 否定不一致,即对 $H$ 和 $ eg H$ 的回答违反了所需的标签映射;(ii) 认知 $ ext{Unknown}$,即模型在某一侧被蕴含时仍选择弃权。我们引入 CGD-PD,一个轻量级、无需训练的测试时层,结合神经三值分类、符号否定一致性投影和定向二值蕴含探测。在 FOLIO 一阶逻辑领域的一个验证集上,CGD-PD 在 GPT-5.2 上提升了 4.4 个百分点的准确率,在 Claude Sonnet 4.5 上提升了 6.8 个百分点,同时减少了 $ ext{Unknown}$ 预测和认知弃权。这些结果提供了一个受控的概念验证,表明推理时的简单逻辑组合有助于评估和提高 LLM 推理可靠性;但本身并不足以证明在此形式化基准设置之外的鲁棒性。

英文摘要

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a premise set $S$. We study this task as a compact compositional inference problem: predictions for $H$ and for a mechanically negated hypothesis $\neg H$ should agree under a deterministic negation map. Despite this simple structure, large language models (LLMs) can exhibit two practical failure modes: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the required label mapping, and (ii) epistemic $\text{Unknown}$, where the model abstains even when one side is entailed. We introduce CGD-PD, a lightweight, training-free test-time layer that combines neural 3-way classification, symbolic negation-consistency projection, and targeted binary entailment probes. On one validation split of FOLIO's first-order logic fields, CGD-PD improves accuracy by 4.4 points on GPT-5.2 and 6.8 points on Claude Sonnet 4.5, while reducing $\text{Unknown}$ predictions and epistemic abstention. These results provide a controlled proof of concept that simple logical composition at inference time can help evaluate and improve LLM reasoning reliability; they do not, by themselves, establish robustness beyond this formal benchmark setting.

2604.05378 2026-05-28 cs.CL cs.CV 版本更新

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

ICR-Drive:面向端到端语言驱动自动驾驶的指令反事实鲁棒性

Kaiser Hamid, Can Cui, Nade Liang

发表机构 * Texas Tech University(德克萨斯科技大学) Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI))

AI总结 提出ICR-Drive框架,通过生成四类扰动指令(改写、歧义、噪声、误导)并基于CARLA仿真评估,揭示语言条件驾驶模型对指令变化的脆弱性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 872-880
AI中文摘要

视觉-语言-动作(VLA)模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航命令,但标准评估大多假设指令精确且格式良好。在实际部署中,指令的措辞和具体性各不相同,可能省略关键限定词,偶尔还包含误导性的权威框架文本,导致指令级鲁棒性未被充分衡量。我们提出了ICR-Drive,一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成受控的指令变体,涵盖四类扰动:改写、歧义、噪声和误导,其中误导变体与导航目标冲突并试图覆盖意图。我们在匹配的仿真器配置和种子下重放相同的CARLA路线,以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标和相对于基线指令的每族性能下降来量化。在LMDrive和BEVDriver上的实验表明,微小的指令变化可能导致显著的性能下降和不同的故障模式,揭示了在安全关键驾驶中部署具身基础模型的可靠性差距。

英文摘要

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

2604.02645 2026-05-28 cs.CL cs.AI 版本更新

Speaking of Language: Reflections on Metalanguage Research in NLP

论语言:NLP中元语言研究的思考

Nathan Schneider, Antonios Anastasopoulos

发表机构 * Georgetown University(乔治城大学) George Mason University(弗吉尼亚理工大学)

AI总结 本文定义元语言概念,将其与NLP和LLM关联,介绍两个实验室以元语言为中心的研究,并讨论元语言的四个维度及元语言任务,提出未来研究方向。

Comments To appear at the Big Picture Workshop at ACL 2026. Camera-ready version

详情
AI中文摘要

本工作旨在聚焦元语言话题。我们首先定义元语言,将其与NLP和LLM联系起来,然后讨论我们两个实验室以元语言为中心的努力。最后,我们讨论元语言和元语言任务的四个维度,提供一系列尚未充分研究的未来研究方向。

英文摘要

This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

2604.02028 2026-05-28 cs.CL 版本更新

Why Gaussian Diffusion Models Fail on Discrete Data and How to Prevent It?

为什么高斯扩散模型在离散数据上失败以及如何防止?

Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov

发表机构 * Constructor University(Constructor大学) Lomonosov Moscow State University(罗蒙诺索夫莫斯科国立大学)

AI总结 本文研究高斯扩散模型在离散数据上采样质量差的原因,发现关键采样区间内噪声数据密度呈多峰分布导致DDPM进入低密度区域,并提出自条件化和q采样结合的方法来改善生成质量。

详情
AI中文摘要

扩散模型已成为连续域生成建模的标准方法,但其在离散数据上的应用仍然具有挑战性。我们研究了使用DDPM求解器的高斯扩散模型为何难以从表示为连续空间中δ分布混合的离散分布中采样。通过一个玩具随机层次模型,我们识别出一个关键采样区间,在该区间内噪声数据的密度变为多峰分布。在这个区间内,DDPM偶尔会进入模式之间的低密度区域,为模型产生分布外输入并降低样本质量。我们表明,现有的启发式方法,包括自条件化和我们称之为q采样的求解器,有助于缓解这个问题。此外,我们证明在关键区间内将自条件化与从DDPM切换到q采样相结合,可以提高真实数据的生成质量。我们在多个领域的条件和无条件任务中验证了这些发现,包括文本、编程代码和蛋白质。

英文摘要

Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

2604.00913 2026-05-28 cs.CV cs.CL 版本更新

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

跨描绘装配指令对齐的视觉-语言模型基准测试与机制分析

Zhuchenyang Liu, Yao Zhang, Yu Xiao

发表机构 * Aalto University(阿alto大学)

AI总结 构建IKEA-Bench基准,评估19个视觉-语言模型在装配图与视频帧对齐任务上的表现,发现视觉编码是提升跨描绘鲁棒性的关键瓶颈。

详情
AI中文摘要

二维装配图通常是抽象的且难以遵循,因此需要智能助手来监控进度、检测错误并提供逐步指导。在混合现实环境中,此类系统必须从摄像头画面中识别已完成和正在进行的步骤,并将其与图示指令对齐。视觉语言模型(VLM)在此任务上展现出潜力,但由于装配图和视频帧共享的视觉特征极少,面临描绘鸿沟。为系统评估这一鸿沟,我们构建了IKEA-Bench基准,包含29个宜家家具产品的6种任务类型共1623个问题,并在三种对齐策略下评估了19个VLM(2B-38B)。主要发现:(1)装配指令理解可通过文本恢复,但文本同时降低了图到视频的对齐性能;(2)架构族比参数数量更能预测对齐精度;(3)视频理解是难以通过策略影响的硬瓶颈。三级机制分析进一步揭示,图和视频占据不相交的ViT子空间,且添加文本会使模型从视觉驱动转向文本驱动的推理。这些结果表明,视觉编码是提升跨描绘鲁棒性的主要目标。项目页面:https://ryenhails.github.io/IKEA-Bench/

英文摘要

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

2601.01627 2026-05-28 cs.CL cs.AI 版本更新

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

JMedEthicBench:用于评估日语大语言模型医疗安全性的多轮对话基准

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato

发表机构 * Kyoto University(京都大学) Hohai University(河海大学) The University of Tokyo(东京大学) University of Science and Technology of China(中国科学技术大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 提出首个多轮对话基准JMedEthicBench,基于日本医学会67条指南和7种自动越狱策略生成5万+对抗对话,评估27个模型发现医疗专用模型安全性脆弱,且多轮交互中安全性显著下降。

Comments 12 pages, 6 figures

详情
AI中文摘要

随着大语言模型(LLM)在医疗领域的部署日益增多,在临床使用前仔细评估其医疗安全性变得至关重要。然而,现有的安全基准仍然以英语为中心,并且仅使用单轮提示进行测试,尽管临床咨询是多轮的。为了解决这些差距,我们引入了JMedEthicBench,这是第一个用于评估日语医疗LLM医疗安全性的多轮对话基准。我们的基准基于日本医学会的67条指南,包含使用七种自动发现的越狱策略生成的超过50,000个对抗性对话。使用双LLM评分协议,我们评估了27个模型,发现商业模型保持了稳健的安全性,而医疗专用模型表现出更高的脆弱性。此外,安全分数在对话轮次中显著下降(中位数:9.5降至5.0,p < 0.001)。对我们的基准的日语和英语版本进行的跨语言评估表明,医疗模型的脆弱性跨语言持续存在,表明存在固有的对齐限制,而非语言特定因素。这些发现表明,领域特定的微调可能会意外削弱安全机制,并且多轮交互代表了一个需要专门对齐策略的独特威胁面。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

2505.13820 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Structured Agent Distillation for Large Language Model

大型语言模型的结构化智能体蒸馏

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Harvard University(哈佛大学) MIT(麻省理工学院) Northeastern University(东北大学) Adobe Research(Adobe研究) National University of Singapore(新加坡国立大学) University of Georgia(佐治亚大学) Florida International University(佛罗里达国际大学)

AI总结 提出结构化智能体蒸馏框架,通过分段对齐推理和动作跨度,将大型语言模型智能体压缩为小型学生模型,在保持决策性能的同时降低推理成本。

详情
Journal ref
The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

大型语言模型(LLMs)通过交错推理和动作(如ReAct风格框架)展现出作为决策智能体的强大能力。然而,它们的实际部署受到高推理成本和大模型规模的限制。我们提出结构化智能体蒸馏,一种将基于大型LLM的智能体压缩为更小的学生模型的框架,同时保持推理保真度和动作一致性。与标准的token级蒸馏不同,我们的方法将轨迹分割为[REASON]和[ACT]跨度,应用分段特定损失来使每个组件与教师行为对齐。这种结构感知的监督使紧凑的智能体能够更好地复制教师的决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明,我们的方法始终优于token级和模仿学习基线,在性能下降最小的情况下实现了显著的压缩。缩放和消融结果进一步强调了跨度级对齐对于高效可部署智能体的重要性。

英文摘要

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

2603.26182 2026-05-28 cs.CL 版本更新

ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

ClinicalAgents:具有双记忆的临床决策多智能体编排

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出ClinicalAgents多智能体框架,通过蒙特卡洛树搜索动态编排和双记忆架构模拟临床推理,显著提升诊断准确性和可解释性。

Comments Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

虽然大型语言模型(LLMs)在医疗保健领域展现出潜力,但它们往往难以应对临床准确诊断所需的复杂非线性推理。现有方法通常依赖从症状到诊断的静态线性映射,未能捕捉人类临床医生固有的迭代、假设驱动推理。为弥补这一差距,我们引入了ClinicalAgents,一种新颖的多智能体框架,旨在模拟专家临床医生的认知工作流。与僵化的顺序链不同,ClinicalAgents采用了一种动态编排机制,建模为蒙特卡洛树搜索(MCTS)过程。这使得编排器能够迭代生成假设、主动验证证据,并在关键信息缺失时触发回溯。该框架的基础是双记忆架构:一个可变的短期工作记忆,用于维护不断演变的患者状态以进行上下文感知推理;以及一个静态的经验记忆,通过主动反馈循环检索临床指南和历史病例。大量实验表明,ClinicalAgents在评估的基线中取得了最佳性能,与强大的单智能体和多智能体基线相比,显著提高了诊断准确性和可解释性。我们的代码发布在https://github.com/ZhuohanGe/ClinicalAgents-Code。

英文摘要

While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent in human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. The foundation of this framework is a Dual-Memory architecture: a mutable working memory that maintains the evolving patient state for context-aware reasoning, and a static experience memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves the best performance among evaluated baselines, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines. Our code is released at https://github.com/ZhuohanGe/ClinicalAgents-Code.

2601.19302 2026-05-28 cs.CL 版本更新

Formula-One Prompting: A Composable Equation-First Prefix for Applied Mathematics

Formula-One Prompting:一种可组合的方程优先前缀用于应用数学

Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul

发表机构 * SCB DataX, SCBX Group(SCB数据X,SCBX集团)

AI总结 提出公式提示(FP)和Formula-One提示(F-1),通过先形式化问题中的控制方程再求解,在多个应用数学基准上优于思维链和程序思维提示,平均提升5.76和8.42个百分点。

详情
AI中文摘要

本文介绍了公式提示(FP)和Formula-One提示(F-1),两种单次调用方法,在解决应用数学问题之前先引出控制方程。思维链(CoT)和程序思维(PoT)提示通过引出预训练期间学到的推理轨迹或类似代码的结构来改进数学推理。这提出了一个诊断性问题:哪些有用的预训练模式仍然未被充分引出?使用infini-gram-mini,我们扫描了81.7万亿预训练令牌,发现在精心策划的语料库(如DataComp-LM)中,以方程为中心的语言出现频率比代码高121倍,比逐步叙述高3.79倍,但标准提示方法并未明确引出方程形式化。FP要求模型在求解前先形式化问题的控制方程;F-1扩展了FP,增加了一个可组合的第二阶段,在同一调用中选择直接、CoT或PoT风格的求解。在五个推理模型和四个应用数学基准(金融、物理、密码学、竞赛数学)上,F-1平均优于CoT 5.76个百分点,优于PoT 8.42个百分点,在FinanceMath上取得最大提升13.30个百分点,同时以仅68个提示令牌的开销占据准确率-令牌效率前沿。变体消融实验表明,方程形式化前缀(而非策略菜单)是主要驱动因素:在前缀之上添加CoT或PoT不会带来进一步收益,且73.3%的剩余失败发生在第一阶段方程正确之后。

英文摘要

This paper introduces Formula Prompting (FP) and Formula-One Prompting (F-1), two single-call methods that elicit governing equations before solving applied-math problems. Chain-of-Thought (CoT) and Program-of-Thought (PoT) prompting improve mathematical reasoning by eliciting reasoning traces or code-like structures learned during pretraining. This suggests a diagnostic question: which useful pretraining patterns remain under-elicited? Using infini-gram-mini, we scan 81.7 trillion pretraining tokens and find that, in curated corpora such as DataComp-LM, equation-centered language appears 121x more often than code and 3.79x more often than step-by-step narration, yet standard prompting methods do not explicitly elicit equation formulation. FP asks the model to formalize a problem's governing equations before solving; F-1 extends FP with a composable Phase 2 that selects Direct, CoT, or PoT-style solving in the same call. Across five reasoning models and four applied-math benchmarks (finance, physics, cryptography, competition math), F-1 outperforms CoT by 5.76 pp and PoT by 8.42 pp on average, with the largest gain of 13.30 pp on FinanceMath, while topping the accuracy-token efficiency frontier at only 68 prompt tokens of overhead. Variant ablations identify the equation-formalization prefix, not the strategy menu, as the primary driver: adding CoT or PoT on top of the prefix yields no further gain, and 73.3% of remaining failures occur downstream of a correct Phase-1 equation.

2603.22735 2026-05-28 cs.CL 版本更新

Explanation Generation for Contradiction Reconciliation with LLMs

面向矛盾调和的大语言模型解释生成

Jason Chan, Zhixue Zhao, Robert Gaizauskas

发表机构 * University of Sheffield, UK(谢菲尔德大学)

AI总结 提出矛盾调和解释生成任务,通过改造NLI数据集和设计质量指标,评估18个LLM在该任务上的表现,发现模型能力有限且增大模型规模时“思考”收益递减。

Comments Preprint

详情
AI中文摘要

现有的NLP工作通常将矛盾视为需要通过选择接受或拒绝哪些陈述来解决的错误。然而,在社交互动和专业领域中,人类推理的一个关键方面是能够假设调和矛盾的解释。例如,“Cassie讨厌咖啡”和“她每天买咖啡”看似矛盾,但如果Cassie有每天为所有同事买咖啡这一不令人羡慕的日常任务,那么两者是兼容的。尽管大语言模型(LLM)的推理能力不断增强,但它们假设这种调和解释的能力在很大程度上仍未探索。为了填补这一空白,我们引入了调和解释生成任务,其中模型必须生成能够有效使矛盾陈述兼容的解释。我们提出了一种改造现有自然语言推理(NLI)数据集的新方法,并引入了可实现可扩展自动评估的质量指标。对18个LLM的实验表明,大多数模型在此任务中取得的成功有限,并且通过“思考”延长测试时计算的好处随着模型规模的增大而趋于平稳。我们的结果突显了LLM推理中一个未被充分探索的维度,以及解决这一限制以增强LLM下游应用(如聊天机器人和科学助手)的必要性。

英文摘要

Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.

2603.21465 2026-05-28 cs.CL cs.LG 版本更新

DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation

DRTriton:大规模合成数据驱动的强化学习用于Triton内核生成

Siqi Guo, Ming Lin, Tianbao Yang

发表机构 * Texas A&M University(德克萨斯大学)

AI总结 提出DRTriton框架,通过合成数据生成、课程强化学习和测试时搜索,训练LLM将PyTorch程序转换为优化的Triton内核,在KernelBench Level 2任务中超越GPT-5.2和Claude-Sonnet-4.5。

详情
AI中文摘要

在生成式AI行业中,开发高效的CUDA内核是一项基础但具有挑战性的任务。最近的研究利用大型语言模型(LLMs)自动将PyTorch参考实现转换为CUDA内核,显著减少了工程工作量。最先进的LLMs,如GPT-5.2和Claude-Sonnet-4.5,仍然难以胜任此任务。为应对这一挑战,我们提出了DRTriton,一个可扩展的学习框架,用于训练LLM将PyTorch程序转换为高度优化的Triton内核,然后在运行时编译为CUDA内核。DRTriton包含三个关键组件:(i)数据合成算法CSP-DAG,保证在算子空间上的完全覆盖和具有可控难度的无偏均匀采样;(ii)具有解耦奖励的课程RL框架,联合优化转换成功率和执行速度;(iii)测试时搜索算法,进一步提高生成的Triton内核的执行速度。通过在使用现有LLM整理的有限PyTorch-Triton对上进行SFT预热阶段,DRTriton在合成PyTorch程序上通过RL训练,有效泛化到即使对人类专家也具挑战性的真实世界CUDA内核。实验结果表明,DRTriton-7B在92%的KernelBench Level 2任务上实现了相对于PyTorch的加速,而GPT-5.2为23%,Claude-Sonnet-4.5为19%。

英文摘要

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing engineering effort. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle with this task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch programs into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum RL framework with decoupled rewards that jointly optimizes conversion success rate and execution speed; and (iii) a test-time search algorithm that further improves the execution speed of the generated Triton kernels. With a warmup stage of SFT on limited PyTorch-Triton pairs curated using existing LLMs, DRTriton trained by RL on synthesized PyTorch programs generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup over PyTorch on 92% of KernelBench Level 2 tasks, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

2603.21165 2026-05-28 cs.CL cs.CV 版本更新

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

多种方言,多种语言,一种文化视角:评估多语言视觉语言模型对孟加拉文化的理解,涵盖历史关联语言和地区方言

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda

发表机构 * United International University(国际联合大学) BRAC University(布拉塔克大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出 BanglaVerse 基准,通过手工标注图像和扩展至多种语言及方言,评估多语言视觉语言模型在孟加拉文化理解中的表现,发现标准孟加拉语评估高估模型能力,方言变化导致性能下降,文化知识缺失是主要瓶颈。

Comments https://labib1610.github.io/BanglaVerse/

详情
AI中文摘要

孟加拉文化通过地区、方言、历史、食物、政治、媒体和日常视觉生活丰富地表达,但在多模态评估中仍然代表性不足。为了解决这一差距,我们引入了BanglaVerse,这是一个文化基础的基准,用于评估多语言视觉语言模型(VLM)对孟加拉文化的理解,涵盖历史关联语言和地区方言。该基准由9个领域的1152张手动策划图像构建,支持视觉问答和字幕生成,并扩展为四种语言和五种孟加拉方言,产生约32.2K个工件。我们的实验表明,仅评估标准孟加拉语会高估真实模型能力:在方言变化下性能下降,尤其是字幕生成,而历史关联语言如印地语和乌尔都语保留了一些文化意义,但在结构化推理方面仍然较弱。跨领域来看,主要瓶颈是缺失文化知识而非仅视觉基础,尤其是知识密集型类别。这些发现将BanglaVerse定位为在语言变化下衡量文化基础多模态理解的更现实测试平台。

英文摘要

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.2K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

2601.04716 2026-05-28 cs.CL 版本更新

Identifying and Mitigating Bottlenecks in Role-Playing Agents: A Systematic Study of Disentangling Character Profile Axes

识别与缓解角色扮演代理中的瓶颈:解耦角色档案轴线的系统研究

Yonghyun Jun, Junhyuk Choi, Jeonghyun Park, Jihyeong Park, Liu Nicole Geumheon, Hwanhee Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 本研究通过解耦角色档案的熟悉度、结构和性格三个轴线,系统诊断LLM角色扮演代理的性能瓶颈,并提出无训练的场感知对比解码(FACD)策略来缓解性格带来的性能下降。

Comments 28 pages

详情
AI中文摘要

尽管大语言模型(LLM)角色扮演代理发展迅速,但尚不清楚哪些档案元素真正驱动角色扮演质量。为填补这一空白,我们引入了一个系统诊断框架,沿三个轴线解耦角色档案的影响:熟悉度(已知 vs. 未知)、结构(结构化 vs. 非结构化)和性格(道德 vs. 不道德)。利用统一的分层模式(5个维度,28个字段),我们构建了一个包含211个人物的受控数据集,并在单轮和多轮交互中评估了五个LLM。我们的结果揭示了显著的不对称性:熟悉度和结构的影响可忽略,而性格在所有条件下对不道德角色产生大且一致的性能下降。进一步分析表明,道德-不道德差距被后SFT对齐放大,且这种下降在不同档案属性间差异显著。为缓解这一瓶颈,我们提出场感知对比解码(FACD),一种无训练策略,通过放大被抑制的性格敏感信号,显著缩小性能差距而不牺牲道德角色的性能。

英文摘要

While Large Language Model (LLM) role-playing agents have advanced rapidly, it remains unclear which profile elements genuinely drive role-playing quality. To bridge this gap, we introduce a systematic diagnostic framework that disentangles the impact of character profiles along three axes: Familiarity (Known vs. Unknown), Structure (Structured vs. Unstructured), and Disposition (Moral vs. Immoral). Utilizing a unified hierarchical schema (5 dimensions, 28 fields), we construct a controlled dataset of 211 personas and evaluate five LLMs on both single- and multi-turn interactions. Our results reveal a striking asymmetry: Familiarity and Structure show negligible impact, while Disposition produces large, consistent performance degradation for immoral characters across all conditions. Further analyses suggest that the Moral--Immoral gap is amplified by post-SFT alignment, and that this degradation varies substantially across profile attributes. To mitigate this bottleneck, we propose Field-Aware Contrastive Decoding (FACD), a training-free strategy that amplifies suppressed disposition-sensitive signals, significantly closing the performance gap without sacrificing moral-character performance.

2603.02097 2026-05-28 cs.CL 版本更新

ClinConsensus: A Physician-Calibrated Benchmark for Evaluating Clinical Rubric Coverage in Chinese Medical LLMs

ClinConsensus:一个用于评估中文医疗大模型临床评分标准覆盖率的医师校准基准

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Xue Yang, Kailuan Wu, Ruyi Xu, Tianyun Lu, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Sen Yang, Lin Qu, Bing Zhao, Hu Wei

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 为解决开放域医疗大模型评估缺乏医师校准的临床响应标准覆盖率问题,提出包含2500个专家病例的ClinConsensus基准,并引入医师锚定覆盖率评分(CACS)及双裁判框架,发现前沿模型存在19.2-21.9分的覆盖率差距。

详情
AI中文摘要

开放域医疗大模型评估在医师校准的临床相关响应标准覆盖率方面仍然薄弱,尤其是在本地化临床环境中。我们引入了 extsc{ClinConsensus},一个中文医疗基准,包含 2,500 个专家精选病例,涵盖 36 个专科、12 个任务主题、多个难度级别以及面向非专业与专业人员的场景。每个病例配有 30 个病例特定的二元评分标准。为了评估响应是否满足足够多的医师撰写的标准,我们提出了 \emph{医师锚定覆盖率评分}(CACS),一个在 \(k=10\) 实例化的医师校准阈值度量,并开发了一个双裁判框架,结合 GPT-5.1 评分器与一个医师监督的 Qwen3-8B 裁判。评估 11 个前沿大模型,我们发现存在持续的覆盖率差距:评分准确率在 39.6% 到 52.1% 之间,而 CACS@10 在 17.8% 到 32.9% 之间,模型间存在 19.2-21.9 个百分点的差距。分层分析进一步揭示了在推理、证据使用、结构化提取、用药说明、随访和对话语域方面的显著差异。这些结果表明,医疗大模型评估应衡量阈值化的、基于评分标准的临床覆盖率,而非平均部分正确性。

英文摘要

Open-ended medical LLM evaluation remains weakly grounded in physician-calibrated coverage of clinically relevant response criteria, especially in localized clinical settings. We introduce \textsc{ClinConsensus}, a Chinese medical benchmark of 2{,}500 expert-curated cases spanning 36 specialties, 12 task themes, multiple difficulty levels, and lay-facing versus professional-facing settings. Each case is paired with 30 case-specific binary rubric criteria. To evaluate whether responses satisfy enough physician-authored criteria, we propose \emph{Clinician-Anchored Coverage Score} (CACS), a physician-calibrated threshold metric instantiated at \(k=10\), and develop a dual-judge framework combining a GPT-5.1 grader with a physician-supervised Qwen3-8B judge. Evaluating 11 frontier LLMs, we find a persistent coverage gap: Rubric Accuracy ranges from 39.6\% to 52.1\%, whereas CACS@10 ranges from 17.8\% to 32.9\%, leaving a 19.2--21.9 point gap across models. Stratified analyses further reveal substantial variation across reasoning, evidence use, structured extraction, medication instructions, follow-up, and dialogue register. These results suggest that medical LLM evaluation should measure thresholded, rubric-grounded clinical coverage rather than average partial correctness.

2601.04505 2026-05-28 cs.AI cs.CL cs.SY eess.SY 版本更新

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

CircuitLM: 一种基于多智能体的大语言模型辅助设计框架,用于从自然语言提示生成电路原理图

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Electrical and Electronic Engineering(电气与电子工程系) Islamic University of Technology(伊斯兰技术大学)

AI总结 提出CircuitLM多智能体流水线,通过嵌入驱动的组件知识库和五阶段流程,将自然语言提示转化为结构化的CircuitJSON原理图,并采用确定性电气规则检查和LLM作为评判的元评估器双重验证,解决大语言模型在电路设计中的幻觉和物理约束问题。

Comments Accepted at the 2026 IEEE International Conference on LLM-Aided Design (ICLAD), 10 pages, 8 figures, 6 tables

详情
AI中文摘要

从高层自然语言描述生成准确的电路原理图仍然是电子设计自动化(EDA)中的一个持久挑战,因为大语言模型(LLM)经常产生组件幻觉、违反严格的物理约束并输出非机器可读的结果。为解决此问题,我们提出CircuitLM,一个多智能体流水线,将用户提示转化为结构化的、视觉可解释的$\texttt{CircuitJSON}$原理图。该框架通过五个顺序阶段: (i) 组件识别,(ii) 规范引脚输出检索,(iii) 思维链推理,(iv) JSON原理图合成,以及(v) 交互式力导向可视化,基于一个精心策划的、嵌入驱动的组件知识库进行生成,从而减轻幻觉并确保物理可行性。我们在一个包含100个独特电路设计提示的数据集上,使用五个最先进的大语言模型评估了该系统。为系统评估性能,我们部署了严格的双层评估方法:一个确定性电气规则检查(ERC)引擎按严格严重性(关键、主要、次要、警告)对拓扑故障进行分类,同时一个LLM作为评判的元评估器识别复杂的、上下文感知的设计缺陷,这些缺陷绕过了标准的基于规则的检查器。最终,这项工作展示了目标检索与确定性和语义验证相结合如何将自然语言转化为结构可行的、原理图就绪的硬件和安全电路原型。我们的代码和数据公开在 https://github.com/Khandakar227/CircuitLM。

英文摘要

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data are publicly available at https://github.com/Khandakar227/CircuitLM.

2512.20780 2026-05-28 cs.CL cs.CY 版本更新

Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

大型语言模型在数学辅导中接近专家教学质量,但在教学和语言特征上存在差异

Ramatu Oiza Abdulsalam, Segun Aroyehun

发表机构 * African University of science and Technology(非洲科学与技术大学) University of Konstanz(康斯坦茨大学)

AI总结 通过分析数学辅导对话数据集,比较专家、新手教师和七种大型语言模型的教学质量,发现大型语言模型平均接近专家水平,但在教学策略和语言特征上存在系统性差异。

详情
AI中文摘要

最近的工作探索了使用大型语言模型(LLMs)生成数学辅导回应,但尚不清楚其教学行为与人类专家实践的接近程度。我们分析了一个数学补救对话数据集,其中专家教师、新手教师和七种不同规模的大型语言模型(包括开放权重和商业模型)对相同的学生错误做出回应。我们检查了教学策略和辅导回应的语言特征,包括吸收(重述和转述)、追问准确性和推理、词汇多样性、可读性、礼貌性和能动性。我们发现专家教师产生的回应质量高于新手教师,并且较大的LLMs通常比较小的模型获得更高的教学质量评分,平均接近专家表现。然而,LLMs在教学特征上表现出系统性差异:它们较少使用专家教师特有的讨论策略,同时生成更长、词汇更丰富、更礼貌的回应。回归分析表明,追问准确性和推理、重述和转述以及词汇多样性与感知教学质量正相关,而更高水平的能动性和礼貌性语言则负相关。这些发现强调了在评估人类教师和智能辅导系统的辅导回应时分析教学策略和语言特征的重要性。

英文摘要

Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

2603.14864 2026-05-28 cs.CL 版本更新

Shopping Companion: Benchmarking and Training LLM Agents for Long-Horizon Preference-Grounded E-Commerce Tasks

购物助手:面向长期偏好引导的电子商务任务的LLM智能体基准测试与训练

Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 针对电子商务中缺乏长期偏好感知购物任务基准和细粒度训练监督的问题,提出了Shopping Companion Bench基准和免标注工具级奖励方法,有效提升了LLM智能体的偏好捕获与任务性能。

详情
AI中文摘要

在电子商务中,LLM智能体在推荐、预算管理和捆绑销售等购物任务中展现出潜力,其中从长期对话中准确捕捉用户偏好至关重要。然而,进展受到两个关键挑战的限制:(1)缺乏评估长期偏好感知购物任务的基准,(2)缺乏用于购物智能体训练的细粒度监督。为了填补基准空白,我们引入了Shopping Companion Bench,这是一个新颖的基准,包含两个需要跨会话偏好记忆的购物任务,基于超过120万真实商品的产品池。我们的分析进一步指出了该基准上失败的两个主要来源:偏好幻觉导致的级联错误,以及未能充分验证产品属性是否符合用户需求。为了解决这些失败模式,我们设计了免标注的、工具级奖励,为每次工具调用提供过程监督,从而缓解了长期任务中的奖励稀疏问题。实验结果表明,即使是GPT-5等最先进模型,成功率也低于70%,凸显了我们基准的难度。值得注意的是,我们微调的轻量级4B模型在偏好捕获和任务性能上均持续优于强基线,表明我们奖励设计的有效性。

英文摘要

In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budget management, and bundle deals, where accurately capturing user preferences from long-horizon conversations is critical. However, progress is limited by two key challenges: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of fine-grained supervision for shopping agent training. To fill the benchmark gap, we introduce Shopping Companion Bench, a novel benchmark comprising two shopping tasks that require cross-session preference memory, grounded in a product pool of over 1.2 million real-world items. Our analysis further identifies two major sources of failure on this benchmark: cascading errors caused by preference hallucination, and insufficient verification of product attributes against user requirements. To address these failure modes, we design annotation-free, tool-wise rewards that provide process supervision for each tool call, alleviating reward sparsity in long-horizon tasks. Experimental results demonstrate that even state-of-the-art models such as GPT-5 achieve success rates below 70%, highlighting the difficulty of our benchmark. Notably, our fine-tuned lightweight 4B model consistently outperforms strong baselines in both preference capture and task performance, suggesting the effectiveness of our reward design.

2602.22787 2026-05-28 cs.CL cs.AI 版本更新

Probing for Knowledge Attribution in Large Language Models

探测大型语言模型中的知识归因

Ivo Brink, Alexander Boer, Dennis Ulmer

发表机构 * KPMG NL(KPMG荷兰分公司) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文通过线性探针从隐藏表示中分类大型语言模型输出的主导知识来源(记忆或上下文),并引入自监督流水线AttriWiki生成训练数据,在多个模型和数据集上达到高F1分数。

详情
AI中文摘要

大型语言模型(LLM)的幻觉,即流畅但事实不正确的生成,分为两类:忠实性违反,即模型误用提供的上下文;以及事实性违反,即答案反映内部知识中的错误。适当的缓解取决于知道哪个来源驱动每个答案。我们研究贡献性归因,即对每个输出背后的主导知识来源进行分类,并表明在隐藏表示上训练的简单线性探针可以可靠地识别它。我们引入了AttriWiki,一个自监督流水线,通过提示模型从记忆中回忆被隐藏的实体或从上下文中读取它们,而不依赖知识冲突,自动生成标记的训练数据。在AttriWiki上训练的探针在Llama-3.1-8B、Mistral-7B和Qwen-7B上达到高达0.96的Macro-$F_1$,迁移到SQuAD和WebQuestions时达到0.94-0.99的Macro-$F_1$,并零样本泛化到Tighidet等人(2024)的基准,在冲突设置上无需重新训练即优于他们的探针。此外,归因不匹配会使错误率提高高达70%,尽管正确的归因并不能保证正确的答案,这表明需要更广泛的检测框架。

英文摘要

Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations, where the model misuses provided context, and factuality violations, where answers reflect errors in internal knowledge. Proper mitigation depends on knowing which source drives each answer. We study contributive attribution, i.e. the classification of the dominant knowledge source behind each output, and show that a simple linear probe trained on hidden representations can reliably identify it. We introduce AttriWiki, a self-supervised pipeline that automatically generates labelled training data by prompting models to recall withheld entities from memory or read them from context without relying on knowledge conflicts. Probes trained on AttriWiki achieve up to 0.96 Macro-$F_1$ on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transfer to SQuAD and WebQuestions with 0.94-0.99 Macro-$F_1$, and generalise zero-shot to Tighidet et al. (2024)'s benchmark, outperforming their probe on conflicting settings without retraining. Furthermore, attribution mismatches raise error rates by up to 70%, though correct attribution does not guarantee correct answers, pointing to the need for broader detection frameworks.

2602.20020 2026-05-28 cs.CL 版本更新

CodeGENCAT: Generative Computerized Adaptive Testing for Open-ended Coding Problems

CodeGENCAT:面向开放式编程问题的生成式计算机自适应测试

Wanyong Feng, Alexander Scarlatos, Ruochen Sun, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Independent Researcher(独立研究者)

AI总结 提出CodeGENCAT框架,通过生成式项目反应理论模型预测学生代码响应,并设计三种选题算法,在编程教育数据集上优于现有CAT基线。

Comments 23 pages, 2 figures

详情
AI中文摘要

现有的计算机自适应测试(CAT)框架通常根据预测的学生正确回答概率来选题。这种设计忽略了学生开放式回答中包含的信息,尤其是在编程教育等领域,代码结构和错误蕴含丰富的学生知识信息。在这项工作中,我们提出了 extbf{Code} extbf{GEN}erative extbf{CAT}( extbf{CodeGENCAT}),一种使用预测的学生代码响应来选题的生成式CAT框架。首先,我们开发了一个生成式项目反应理论(GIRT)模型,该模型根据估计的学生知识生成代码响应,通过监督微调和直接偏好优化进行知识-响应对齐训练。其次,我们引入了三种选题算法,分别衡量不确定性、编码风格多样性以及从预测的学生代码响应中提取的信息。在两个真实世界的编程教育数据集上的实验表明,CodeGENCAT优于所有CAT基线,在自适应测试早期阶段,AUC比最强基线提高了4.32%。

英文摘要

Existing Computerized Adaptive Testing (CAT) frameworks typically select questions based on the predicted likelihood that the student will answer correctly. This design ignores information contained in students' open-ended responses, especially in domains such as programming education, where code structures and bugs contain rich information on student knowledge. In this work, we propose \textbf{Code} \textbf{GEN}erative \textbf{CAT} (\textbf{CodeGENCAT}), a generative CAT framework that selects questions using predicted student code responses. First, we develop a Generative Item Response Theory (GIRT) model that generates code responses conditioned on estimated student knowledge, trained with supervised fine-tuning followed by direct preference optimization for knowledge-response alignment. Second, we introduce three question-selection algorithms that measure uncertainty, coding style diversity, and information from predicted student code responses. Experiments on two real-world programming education datasets show that CodeGENCAT outperforms all CAT baselines, achieving an AUC improvement of up to 4.32\% over the strongest baseline in the early stages of adaptive testing.

2602.17003 2026-05-28 cs.CL cs.AI 版本更新

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Persona2Web: 基于用户历史进行上下文推理的个性化Web智能体基准

Serin Kim, Sangam Lee, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea(人工智能系,延世大学,首尔,大韩民国)

AI总结 提出Persona2Web基准,通过澄清-个性化原则评估Web智能体在真实开放网络中利用用户历史解决模糊查询的个性化能力,并引入推理感知评估框架。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型推动了Web智能体的发展,但当前的智能体缺乏个性化能力。由于用户很少明确说明其意图的每个细节,实用的Web智能体必须能够通过推断用户偏好和上下文来解释模糊查询。为应对这一挑战,我们提出了Persona2Web,这是首个在真实开放网络上评估个性化Web智能体的基准,基于澄清-个性化原则构建,要求智能体根据用户历史而非依赖显式指令来解决歧义。Persona2Web包括:(1) 在长时间跨度内隐含揭示偏好的用户历史,(2) 需要智能体推断隐含用户偏好的模糊查询,以及(3) 一个推理感知评估框架,能够对个性化进行细粒度评估。我们针对各种智能体架构、骨干模型、历史访问方案和不同模糊程度的查询进行了广泛实验,揭示了个性化Web智能体行为中的关键挑战。为便于复现,我们的代码和数据集公开在 https://serin-kimm.github.io/Persona2Web/。

英文摘要

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://serin-kimm.github.io/Persona2Web/.

2602.15894 2026-05-28 cs.CL cs.LG 版本更新

Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

质量约束的熵最大化策略优化用于LLM多样性

Haihui Pan, Yuzhong Hong, Kaichen Zhang, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

发表机构 * Zuoyebang Education Technology(左叶bang教育科技)

AI总结 提出QEMPO框架,通过理论推导的闭式解在保证输出质量的同时最大化熵以提升LLM多样性,实验证明其在不牺牲质量的情况下提升多样性。

详情
AI中文摘要

在许多大语言模型(LLM)对齐应用中,用户不仅期望高质量输出,还希望有显著的多样性。然而,现有方法通常面临这些目标之间的根本权衡:提高输出质量的方法往往会降低多样性,而增加多样性的方法往往以牺牲质量为代价。在这项工作中,我们提出了质量约束的熵最大化策略优化(QEMPO),这是一个新颖的框架,在明确保持输出质量的同时增强LLM输出的多样性。QEMPO建立在坚实的理论基础之上:我们推导出一个闭式解析解,该解在质量约束下可证明地最大化熵(多样性的原则性度量),并在定义的目标下保证最优性。利用这一解,QEMPO自然支持在线和离线训练设置。实验结果表明,QEMPO在不牺牲质量的情况下持续提高输出多样性,并且在许多情况下,与现有基线相比,在质量和多样性两个维度上都取得了提升,与我们的理论保证一致。

英文摘要

In many large language model (LLM) alignment applications, users expect not only high-quality outputs but also substantial diversity. However, existing methods often face a fundamental trade-off between these objectives: approaches that improve output quality tend to reduce diversity, while methods that increase diversity often do so at the expense of quality. In this work, we propose Quality-constrained Entropy Maximization Policy Optimization (QEMPO), a novel framework that enhances the diversity of LLM outputs while explicitly preserving output quality. QEMPO is grounded in a strong theoretical foundation: we derive a closed-form analytical solution that provably maximizes entropy-a principled measure of diversity-subject to a quality constraint, with guarantees on optimality under the defined objective. Leveraging this solution, QEMPO naturally supports both online and offline training settings. Empirical results demonstrate that QEMPO consistently improves output diversity without sacrificing quality, and in many cases yields gains in both dimensions compared to existing baselines, aligning with our theoretical guarantees.

2601.16800 2026-05-28 cs.CL 版本更新

Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

大语言模型作为细粒度意见分析的自动标注者和标注裁决者

Gaurav Negi, MA Waskow, John McCrae, Omnia Zayed, Paul Buitelaar

发表机构 * Data Science Institute(数据科学研究所) University of Galway(Galway大学)

AI总结 本文探索使用大语言模型作为自动标注者进行细粒度意见分析,提出声明式标注流水线和LLM裁决方法,实验表明LLM在跨度级别可靠但难以再现关系结构,更适合作为标注助手而非完全替代人类。

详情
AI中文摘要

文本的细粒度意见分析提供了对表达情感的详细理解,包括所涉及的实体。尽管这种详细程度很有价值,但在数据集中标注意见以训练模型需要大量人力投入和成本,尤其是在不同领域和实际应用中。为了解决领域特定标注数据集的短缺,我们探索了LLM作为自动标注者进行细粒度意见分析的可行性。我们使用声明式标注流水线,这种方法减少了在使用LLM识别文本中细粒度意见跨度时手动提示工程的可变性。我们还提出了一种专门的方法,让LLM裁决多个标签并产生最终标注。我们使用不同大小的模型在方面情感三元组提取(ASTE)和方面-类别-意见-情感(ACOS)分析任务上试用了该流水线。在这项工作中,我们试图开发完全自主的基于LLM的标注者,但我们的结果揭示了一个不均衡的画面,其特点是关键的性能分叉:LLM在跨度级别可靠,但难以忠实地再现连接这些跨度的关系结构。这表明LLM更适合作为高保真标注助手和数据增强工具,以扩展细粒度意见标注数据集,而不是完全取代人类标注者。

英文摘要

Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is valuable, annotating opinions in datasets for model training requires considerable human effort and substantial cost, especially across diverse domains and real-world applications. To address this shortage of domain-specific labelled datasets, we explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis. We use a declarative annotation pipeline, an approach that reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a dedicated methodology for an LLM to adjudicate multiple labels and produce final annotations. We trial the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks. In this work, we attempt to develop fully autonomous LLM-based annotators, but our results reveal an uneven picture characterised by a critical performance bifurcation: LLMs are reliable at the span level yet struggle to faithfully reproduce the relational structures that connect those spans. This suggests that LLMs are better positioned as high-fidelity annotation assistants and data augmentation tools to expand fine-grained opinion-annotated datasets, rather than replacing human annotators entirely.

2602.15198 2026-05-28 cs.MA cs.AI cs.CL 版本更新

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Colosseum: 审计合作多智能体系统中的合谋行为

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) University of Virginia(弗吉尼亚大学) ELLIS Institute Tübingen(图宾根ELLIS研究所) MPI for Intelligent Systems, Tübingen(图宾根智能系统研究所) AI Center(人工智能中心)

AI总结 提出Colosseum框架,通过形式化决策框架和基于遗憾的度量审计LLM智能体在合作多智能体系统中的合谋行为,发现大多数模型存在新兴合谋倾向,并观察到“纸上合谋”现象。

详情
AI中文摘要

多智能体系统中,通过自由形式语言通信的LLM智能体能够实现复杂的协调以解决复杂的合作任务。当一组智能体形成联盟并合谋追求次要目标、降低联合目标时,这会产生独特的安全问题。在本文中,我们提出Colosseum,一个用于审计多智能体设置中LLM智能体合谋行为的框架。我们通过形式化的多智能体决策框架来理解智能体如何合作,并通过相对于合作最优的遗憾来度量基于行动的合谋行为,并将其与基于通信的合谋行为进行比较。Colosseum能够在良性设置、不同联盟目标、说服策略和网络拓扑下审计LLM智能体的合谋行为。然后,我们通过创建智能体之间的秘密通信渠道引入一种新的行为探针,表明大多数开箱即用的模型在此探针下表现出合谋倾向,我们称之为新兴合谋。此外,我们发现了“纸上合谋”现象,即智能体在文本中计划合谋但往往选择非合谋行动。Colosseum提供了一种审计合作多智能体系统中合谋的新方法,同时呈现了关于合谋如何出现、什么影响合谋效率以及哪些策略可能缓解合谋的观察。

英文摘要

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a coalition and colludes to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a formal multi-agent decision-making framework and measure action-based collusive behavior in actions via regret relative to the cooperative optimum and compare it with communication-based collusive behavior. Colosseum enables audits of LLM agents for collusion under benign settings, different coalition objectives, persuasion tactics, and network topologies. We then introduce a new behavioral probe by creating secret communication channels between agents, showing that most out-of-the-box models exhibit a propensity to collude under this probe, which we term emergent collusion. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but often pick non-collusive actions. Colosseum provides a new way to audit collusion in cooperative multi-agent systems while presenting observations about how collusion emerges, what affects collusion efficacy, and which strategies may mitigate it.

2602.13748 2026-05-28 cs.CL cs.CV 版本更新

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

RMPL:基于关系感知的多任务渐进学习与分阶段训练的多媒体事件抽取

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出RMPL框架,通过分阶段训练结合单模态事件抽取和多模态关系抽取的异构监督,在低资源条件下实现多媒体事件抽取,并在M2E2基准上取得一致改进。

Comments Accepted by ACM ICMR 2026

详情
AI中文摘要

多媒体事件抽取(MEE)旨在从包含文本和图像的文档中识别事件及其论元。它需要跨不同模态对事件语义进行 grounding。MEE 的进展受到缺乏标注训练数据的限制。M2E2 是唯一已建立的基准,但它仅提供评估用的标注。这使得直接监督训练不切实际。现有方法主要依赖于跨模态对齐或使用视觉-语言模型(VLM)进行推理时提示。这些方法没有显式学习结构化的事件表示,并且通常在多模态设置中产生较弱的论元 grounding。为解决这些限制,我们提出了 RMPL,一种用于低资源条件下 MEE 的基于关系感知的多任务渐进学习框架。RMPL 通过分阶段训练整合了来自单模态事件抽取和多模态关系抽取的异构监督。模型首先使用统一模式进行训练,以学习跨模态的共享事件中心表示。然后,使用混合文本和视觉数据对模型进行微调,以进行事件提及识别和论元角色抽取。在 M2E2 基准上使用多个 VLM 进行的实验表明,在不同模态设置下均取得了一致的改进。

英文摘要

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

2602.06054 2026-05-28 cs.CL 版本更新

Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

我们真的在创新吗?AI研究论文原创性的定性与定量研究

Abeer Mostafa, Thi Huyen Nguyen, Zahra Ahmadi

发表机构 * Peter L. Reichertz Institute for Medical Informatics(汉诺威医学院彼得·L·里赫茨医学信息学研究所) L3S Research Center(L3S研究中心) Lower Saxony Center for Artificial Intelligence and Causal Methods in Medicine (CAIMed)(下萨克森人工智能与医学因果方法中心(CAIMed))

AI总结 基于10万+同行评审报告,通过定性与定量方法分析AI研究论文原创性的感知维度,并评估大语言模型在原创性评估中的可靠性。

详情
AI中文摘要

评估AI研究的原创性可以说是同行评审中最重要但最不可靠的步骤。评审者对原创性的判断仍然不透明、不一致,并且依赖于对先前工作的比较,而这些比较往往不完整。在本文中,我们基于来自顶级AI会议的超过10万份同行评审报告,对研究原创性进行了大规模、数据驱动的定性与定量分析,涵盖了该领域快速增长的时期。利用结构化的、语义检索的先前工作以及嵌入在专家评审者评估中的信号,我们系统地描述了原创性在实践中是如何被感知的,并识别出最强烈影响新颖性判断的关键维度。我们的分析产生了一个细粒度、基于证据的框架,为作者和评审者提供了关于原创性如何被评估的可操作见解。此外,我们评估了当前大语言模型(LLM)智能体在评估原创性方面的可靠性。我们发现这些模型倾向于系统性地高估新颖性,并且在检测概念抄袭方面存在困难,尤其是在存在改写的情况下。我们在以下网址发布我们的数据集、训练模型和代码:https://anonymous.4open.science/r/Novelty-Reviewer-365C/。

英文摘要

Assessing originality in AI research is arguably the most consequential yet least reliable step in peer review. Reviewer judgments of originality remain opaque, inconsistent, and dependent on comparisons to prior work that are often incomplete. In this paper, we present a large-scale, data-driven qualitative and quantitative analysis of research originality based on over 100,000 peer-review reports from leading AI venues, spanning a period of rapid growth in the field. Leveraging structured, semantically retrieved prior work and signals embedded in expert reviewer assessments, we systematically characterize how originality is perceived in practice and identify the key dimensions that most strongly influence novelty judgments. Our analysis yields a fine-grained, evidence-based framework that equips both authors and reviewers with actionable insights into how originality is evaluated. In addition, we evaluate the reliability of current large language model (LLM) agents in assessing originality. We find that these models tend to systematically overestimate novelty and struggle to detect conceptual plagiarism, particularly in the presence of paraphrasing. We release our dataset, trained models, and code at: https://anonymous.4open.science/r/Novelty-Reviewer-365C/.

2602.07574 2026-05-28 cs.CV cs.CL 版本更新

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

ViCA:仅视觉交叉注意力的高效多模态大语言模型

Wenjie Liu, Hao Wu, Xin Qiu, Xudong Wang, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院、东部技术研究院) Munich Center for Machine Learning, LMU Munich(慕尼黑机器学习中心、慕尼黑大学)

AI总结 提出ViCA架构,通过仅视觉交叉注意力减少视觉令牌计算,在保持98%准确率的同时将视觉计算降至4%,实现显著加速。

详情
AI中文摘要

现代多模态大语言模型(MLLMs)采用统一的自我注意设计,在每个Transformer层处理视觉和文本令牌,导致大量计算开销。在这项工作中,我们重新审视了这种密集视觉处理的必要性,并表明投影的视觉嵌入已经与语言空间良好对齐,而有效的视觉-语言交互仅发生在少数层中。基于这些见解,我们提出了ViCA(仅视觉交叉注意力),一种最小的MLLM架构,其中视觉令牌绕过所有自我注意和前馈层,仅通过稀疏的交叉注意力在选定层与文本交互。在三个MLLM骨干、九个多模态基准和26个基于剪枝的基线上的广泛评估表明,ViCA在将视觉侧计算减少到4%的同时保持了98%的基线准确率,始终实现了优越的性能-效率权衡。此外,ViCA提供了一个规则的、硬件友好的推理流水线,在单批推理中实现了超过3.5倍的加速,在多批推理中实现了超过10倍的加速,与仅文本的LLM相比,将视觉定位减少到接近零的开销。它还与令牌剪枝方法正交,可以无缝结合以进一步提高效率。我们的代码可在https://github.com/EIT-NLP/ViCA获取。

英文摘要

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

2602.05897 2026-05-28 cs.CL 版本更新

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

停止奖励幻觉步骤:面向小型推理模型的忠实感知步骤级强化学习

Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd(中国电信人工智能研究院) College of Integrated Circuits, Zhejiang University, Hangzhou, Zhejiang, China(浙江大学集成电路学院) Zhongguancun Academy, Beijing, China(中关村学院)

AI总结 针对小型推理模型在中间推理步骤中容易产生忠实性幻觉的问题,提出忠实感知步骤级强化学习(FaithRL),通过过程奖励模型提供步骤级监督和隐式截断重采样策略,减少幻觉并提高推理可靠性。

详情
AI中文摘要

随着大型语言模型变得更小更高效,小型推理模型(SRM)在资源受限环境中实现思维链(CoT)推理至关重要。然而,它们容易产生忠实性幻觉,尤其是在中间推理步骤中。现有的基于在线强化学习的缓解方法依赖于结果奖励或粗粒度的CoT评估,这可能在最终答案正确时无意中强化不忠实的推理。为了解决这些局限性,我们提出了忠实感知步骤级强化学习(FaithRL),通过来自过程奖励模型的显式忠实奖励引入步骤级监督,以及一种隐式截断重采样策略,该策略从忠实前缀生成对比信号,同时减轻步骤级奖励的奖励黑客攻击。在多个SRM和开放书籍QA基准上的实验表明,FaithRL持续减少CoT和最终答案中的幻觉,从而实现更忠实和可靠的推理。代码可在 https://github.com/Easy195/FaithRL 获取。

英文摘要

As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes, while also mitigating reward hacking from step-level rewards. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.

2503.01829 2026-05-28 cs.CL cs.AI cs.LG cs.MA 版本更新

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

如果你能说服我:评估大型语言模型说服效果与易受影响性的框架

Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出PMIYC框架,通过多智能体对话自动评估LLM的说服效果与易受影响性,发现不同模型在说服力和抗说服性上存在显著差异。

Comments Paper published at the ACM Conference on AI and Agentic Systems 2026

详情
AI中文摘要

大型语言模型(LLM)展现出与人类水平相当的说服能力。虽然这些能力可用于社会公益,但也存在被滥用的风险。除了关注LLM如何说服他人外,它们自身对说服的易受影响性也构成了关键的校准挑战,引发了关于鲁棒性、安全性和伦理原则遵守的问题。为了研究这些动态,我们引入了“如果你能说服我”(PMIYC),一个用于评估多智能体交互中说服力和易受影响性的自动化框架。我们的框架提供了一种可扩展的替代方案,替代了通常用于研究LLM说服的昂贵且耗时的人工标注过程。PMIYC自动进行说服者和被说服者智能体之间的多轮对话,同时衡量说服的有效性和易受影响性。我们的综合评估涵盖了多种LLM和说服场景(例如,主观和错误信息场景)。我们通过人工评估验证了框架的有效性,并展示了与先前研究中人工评估的一致性。通过PMIYC,我们发现Llama-3.3-70B和GPT-4o表现出相似的说服效果,比Claude 3 Haiku高出30%。然而,GPT-4o在对抗错误信息方面的抵抗力比Llama-3.3-70B高出50%以上。值得注意的是,o4-mini既是有效的说服者,也是抵抗的被说服者。这些发现为LLM的说服动态提供了实证见解,并有助于开发更安全的AI系统。

英文摘要

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. Notably, o4-mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

2602.03491 2026-05-28 cs.CV cs.CL 版本更新

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

解耦骨架与血肉:基于解缠对齐和结构感知引导的高效多模态表格推理

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室)

AI总结 提出DiSCo解缠结构-内容对齐框架和Table-GLS全局到局部结构引导推理框架,高效增强LVLM的表格理解与推理能力,无需昂贵监督或外部工具。

Comments Accepted as a Spotlight Paper at ICML 2026

详情
AI中文摘要

由于复杂的布局和紧密耦合的结构-内容信息,对表格图像进行推理对于大型视觉语言模型(LVLM)仍然具有挑战性。现有解决方案通常依赖于昂贵的监督训练、强化学习或外部工具,限制了效率和可扩展性。这项工作解决了一个关键问题:如何以最少的标注且无需外部工具来使LVLM适应表格推理?具体来说,我们首先引入了DiSCo,一种解缠结构-内容对齐框架,在多模态对齐期间明确分离结构抽象和语义基础,高效地将LVLM适应于表格结构。在DiSCo的基础上,我们进一步提出了Table-GLS,一种全局到局部结构引导推理框架,通过结构化探索和基于证据的推理来执行表格推理。跨多个基准的大量实验表明,我们的框架高效地增强了LVLM的表格理解和推理能力,特别是泛化到未见过的表格结构。我们的数据和代码可在https://github.com/AAAndy-Zhu/TableVLM获取。

英文摘要

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures. Our data and code are available at https://github.com/AAAndy-Zhu/TableVLM.

2602.02898 2026-05-28 cs.AI cs.CL 版本更新

Aligning Language Model Benchmarks with Pairwise Preferences

将语言模型基准与成对偏好对齐

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen

发表机构 * School of Data Science, University of Virginia(弗吉尼亚大学数据科学学院) Imperial College London(伦敦帝国理工学院) Thomson Reuters Foundational Research(汤姆森路透基础研究) Department of Electrical Engineering and Computer Science, UC Berkeley and UCSF(伯克利大学电气工程与计算机科学系及旧金山大学)

AI总结 提出BenchAlign方法,通过利用语言模型在问题级别的性能与模型成对排名,自动调整离线基准权重,使新基准能根据偏好准确排序未见模型。

详情
AI中文摘要

语言模型基准是广泛使用的、计算高效的现实性能代理。然而,许多近期工作发现基准常常无法预测实际效用。为弥合这一差距,我们引入基准对齐,即利用有限的模型性能信息自动更新离线基准,旨在生成新的静态基准,以预测给定测试设置中的模型成对偏好。然后我们提出BenchAlign,这是该问题的首个解决方案,它利用语言模型在问题级别的性能以及可能在部署期间收集的模型成对排名,学习基准问题的偏好对齐权重,生成新的基准,根据这些偏好对先前未见过的模型进行排序。我们的实验表明,我们的对齐基准能够根据人类偏好模型准确地对未见模型进行排序,即使模型大小不同,同时保持可解释性。总体而言,我们的工作为将基准与实际人类偏好对齐的局限性提供了见解,这有助于加速模型开发以追求实际效用。

英文摘要

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

2602.01807 2026-05-28 cs.CL cs.LG 版本更新

Sentence Curve Language Models

句子曲线语言模型

DongNyeong Heo, Taehwan Kim, Heeyoul Choi

发表机构 * Ulsan National Institute of Science and Technology(全南国立科学研究所) Handong Global University(翰昂全球大学)

AI总结 提出句子曲线表示,将扩散语言模型扩展为预测句子曲线而非静态词嵌入,以增强全局结构建模,并在IWSLT14和WMT14上取得最优性能。

详情
AI中文摘要

语言模型(LM)是现代AI系统的核心组成部分,扩散语言模型(DLM)最近已成为一种有竞争力的替代方案。这两种范式都依赖词嵌入来表示输入句子,以及骨干模型训练预测的目标句子。我们认为,这种目标词的静态嵌入对相邻词不敏感,鼓励局部准确的词预测,而全局句子结构则较少被强调。为了解决这个问题,我们提出了一种连续的句子表示,称为句子曲线,定义为一条样条曲线,其控制点影响句子中的多个词。基于这种表示,我们引入了句子曲线语言模型(SCLM),它将DLM扩展为预测句子曲线而非静态词嵌入。我们从理论上证明,句子曲线预测会引入正则化效应,促进全局结构建模,并刻画了不同句子曲线类型如何影响这种行为。实验上,SCLM在IWSLT14和WMT14上取得了DLM中的最优性能,训练稳定且无需繁重的知识蒸馏,并在LM1B上展现出与离散DLM相比有潜力的前景。

英文摘要

Language models (LMs) are a central component of modern AI systems, and diffusion language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while global sentence structure is less emphasized. To address this, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves state-of-the-art performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.

2602.01203 2026-05-28 cs.CL cs.LG 版本更新

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

注意力汇聚在注意力层中锻造原生MoE:针对头部坍塌的汇聚感知训练

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

发表机构 * Institute for Artificial Intelligence, Peking University, Beijing(人工智能研究院,北京大学,北京) School of Integrated Circuits, Peking University, Beijing(集成电路学院,北京大学,北京)

AI总结 本文通过理论和实证证明注意力汇聚自然构建了注意力层内的混合专家机制,并提出汇聚感知训练算法以缓解头部坍塌问题,提升模型性能。

Comments 2026 International Conference on Machine Learning (ICML)

详情
AI中文摘要

大型语言模型(LLMs)通常将不成比例的注意力分配给第一个标记,这种现象称为注意力汇聚。最近的几种方法旨在解决这个问题,包括GPT-OSS中的汇聚注意力和Qwen3-Next中的门控注意力。然而,缺乏对这些注意力机制之间关系的全面分析。在这项工作中,我们提供了理论和实证证据,表明普通注意力和汇聚注意力中的汇聚自然地在注意力层内构建了混合专家(MoE)机制。这一见解解释了先前工作中观察到的头部坍塌现象,即只有固定子集的注意力头对生成有贡献。为了缓解头部坍塌,我们提出了一种汇聚感知训练算法,该算法带有专为注意力层设计的辅助负载平衡损失。大量实验表明,我们的方法在普通注意力、汇聚注意力和门控注意力上实现了有效的头部负载平衡,并提高了模型性能。我们希望这项研究能为注意力机制提供新的视角,并鼓励进一步探索注意力层内固有的MoE结构。

英文摘要

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

2510.08525 2026-05-28 cs.CL 版本更新

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些注意力头对推理重要?RL引导的KV缓存压缩

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang

发表机构 * Westlake University(西华大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克AI研究院) Zhejiang University(浙江大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德智能大学)

AI总结 提出RLKV方法,利用强化学习识别对推理质量关键的注意力头,并对其保留完整KV缓存而对其他头进行激进压缩,实现20-60%缓存减少且性能近乎无损。

详情
AI中文摘要

推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为,这些行为在解码过程中对信息损失高度敏感,给KV缓存压缩带来了关键挑战。现有的token丢弃方法通过移除中间步骤直接破坏推理链,而为检索任务设计的头重分配方法无法保留对生成推理至关重要的注意力头。然而,现有方法均无法识别哪些注意力头真正维持推理一致性并控制生成终止。为解决此问题,我们提出RLKV,它使用强化学习作为探针,通过直接优化注意力头缓存使用与实际生成结果的关系,发现哪些头对推理质量有贡献。这一发现自然引出了高效的压缩策略:我们对推理关键的头分配完整KV缓存,同时对其他头使用固定大小的KV缓存进行激进压缩。实验表明,少数头对推理至关重要,使得在多种任务和模型上实现20-60%的缓存减少且性能近乎无损,在60%压缩率下实现高达2.06倍的端到端加速。

英文摘要

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

2507.16679 2026-05-28 cs.CL cs.AI cs.CY 版本更新

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

发表机构 * Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学) North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学) Microsoft Research Asia, Beijing, China(微软亚洲研究院) Tongji University, Shanghai, China(同济大学)

AI总结 针对情境对齐中价值冲突导致的指令瓶颈问题,提出PICACO方法,通过优化元指令并最大化指定价值与模型响应的总相关,无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情
AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力,有助于减少有害输出并适应多样化偏好,而无需昂贵的后训练,这被称为情境对齐。然而,大语言模型对输入提示的理解仍是不可知的,限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的,常常施加相互冲突的要求,例如刺激与传统。因此,当前的情境对齐方法面临指令瓶颈挑战,即大语言模型难以在单个提示中协调多个预期价值,导致对齐不完整或有偏。为了解决这个问题,我们提出了PICACO,一种新颖的多元情境对齐方法。无需微调,PICACO优化一个融合了多个价值的元指令,以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的,这从理论上强化了价值一致性并减少了干扰噪声,从而产生更有效的指令。在五个价值集上的大量实验表明,PICACO在黑盒和开源大语言模型上均表现良好,优于多个近期强基线,并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

2601.19926 2026-05-28 cs.CL cs.AI 版本更新

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法:语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

发表机构 * Universitat Pompeu Fabra(巴塞罗那庞培乌法布拉大学) ICREA(加泰罗尼亚国家研究委员会)

AI总结 通过对337篇文章的系统综述,评估基于Transformer的语言模型(TLM)的句法能力,发现TLM编码了非平凡的句法知识,但句法-语义接口现象表现较弱,且研究集中在英语和BERT类模型上。

详情
AI中文摘要

我们对337篇评估基于Transformer的语言模型(TLM)句法能力的文章进行了系统综述,报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明,TLM编码了非平凡的句法知识。行为证据显示,TLM在形式句法现象上表现强劲,但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言,表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而,由于大多数工作仍停留在观察层面,且当前方法在方法论上具有异质性,对句法处理背后的详细计算机制的洞察仍然有限。同时,文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义,并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

2601.08131 2026-05-28 cs.CL 版本更新

Attention Projection Mixing with Exogenous Anchors

基于外生锚点的注意力投影混合

Jonathan Su

发表机构 * Independent Researcher(独立研究者)

AI总结 针对早期注意力投影跨层重用中内部锚点设计存在的结构冲突,提出ExoFormer模型,通过学习序列层外的外生锚点投影,并引入统一归一化混合框架,在减少令牌使用量的同时提升下游准确率。

详情
AI中文摘要

早期注意力投影的跨层重用可以改善优化和数据效率,但它造成了一个结构冲突:第一层必须同时作为所有更深层的稳定、可重用的锚点和有效的计算块。我们证明这种张力限制了内部锚点设计的性能。我们提出ExoFormer,通过在序列层堆栈之外学习外生锚点投影来解决这一冲突。我们引入了一个统一的归一化混合框架,该框架使用可学习的系数(探索系数粒度:元素级、头级和标量级)混合查询、键、值和门控对数,并表明归一化锚点源是稳定重用的关键。ExoFormer变体始终优于其内部锚点对应物,动态变体在匹配验证损失的情况下,使用比Gated Attention少1.5倍的令牌,获得1.5倍的下游准确率。我们通过卸载假说解释这种有效性:外部锚点保留必要的令牌身份,使层能够专门专注于特征变换。我们发布代码和模型以促进未来研究。

英文摘要

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

2509.06350 2026-05-28 cs.CL cs.AI cs.CR 版本更新

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG:对抗性后缀中的所有标记对于越狱攻击都是必要的吗?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

发表机构 * Politecnico di Milano(米兰理工学院) Beihang University(北京航空航天大学) East China Normal University(华东师范大学) Fudan University(复旦大学) University of the Chinese Academy of Sciences(中国科学院大学) AI Security Lab(360人工智能安全实验室)

AI总结 提出Mask-GCG方法,通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记,降低计算开销并保持攻击成功率,揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

详情
Journal ref
2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026
AI中文摘要

针对大型语言模型(LLM)的越狱攻击已展示了多种成功方法,攻击者操纵模型生成其本应避免的有害响应。其中,贪婪坐标梯度(GCG)作为一种通用且有效的方法,通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体,但它们都依赖于固定长度的后缀。然而,这些后缀中潜在的冗余尚未被探索。在这项工作中,我们提出Mask-GCG,一种即插即用的方法,采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率,同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余,还降低了梯度空间的大小,从而减少了计算开销,并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明,后缀中的大多数标记对攻击成功有显著贡献,剪枝少数低影响力标记不会影响损失值或攻击成功率(ASR),从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

2601.18116 2026-05-28 cs.CL 版本更新

BEAR: Budgeted Evidence Allocation for Multi-Document Reasoning

BEAR: 面向多文档推理的预算化证据分配

Lin Sun, Linglin Zhang, Jingang Huang, Change Jia, Zhengwei Cheng, Xiangzheng Zhang

发表机构 * Qiyuan Tech(启元科技)

AI总结 提出BEAR框架,通过构建分层语义索引并在查询时进行由粗到细的证据访问,在固定证据预算下实现高效的多文档推理。

详情
AI中文摘要

我们认为多文档推理不仅受限于模型能读取的文本量,还受限于有限的查询时证据预算如何在文档和语义粒度之间分配。全上下文推理非选择性地向模型提供广泛证据且每次查询成本高,而平面分块检索通常返回局部相关但跨文档综合组织薄弱的段落。我们提出 extbf{BEAR},一个结构化证据分配框架,它离线构建分层语义索引,并在查询时通过互补的 extit{探索}和 extit{恢复}路径进行由粗到细的证据访问。这种由粗到细的设计可视为在固定证据上下文预算下的结构化证据分配。在合成和真实基准上,BEAR在DragonBall上表现尤为强劲,在HotpotQA上与强检索基线保持竞争力,并在我们的评估协议下在2Wiki上取得了最佳的基于检索的结果,同时其查询时证据预算远小于所报告的长上下文参考。进一步分析表明,性能提升与作为分配基础的分层结构以及互补的探索和恢复相关,而非仅靠语义分块。

英文摘要

We argue that multi-document reasoning is constrained not only by how much text a model can read, but also by how limited query-time evidence budget is allocated across documents and semantic granularities. Full-context inference exposes the model to broad evidence non-selectively and at high per-query cost, while flat chunk retrieval often returns locally relevant passages that are weakly organized for cross-document synthesis. We present \textbf{BEAR}, a framework for structured evidence allocation that builds hierarchical semantic indices offline and performs coarse-to-fine evidence access at query time through complementary \emph{exploration} and \emph{recovery} paths. This coarse-to-fine design can be viewed as structured evidence allocation under a fixed evidence-context budget. Across synthetic and real-world benchmarks, BEAR performs particularly strongly on DragonBall, remains competitive with strong retrieval-based baselines on HotpotQA, and yields the best retrieval-based result on 2Wiki under our evaluated protocol, while operating under substantially smaller \emph{query-time evidence budgets} than the reported long-context references. Additional analyses suggest that the gains are associated with hierarchy as an allocation substrate together with complementary exploration and recovery, rather than semantic chunking alone.

2601.18006 2026-05-28 cs.CL 版本更新

PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation

PEAR:机器翻译中自动相对评分的成对评估

Lorenzo Proietti, Roman Grundkiewicz, Matt Post

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) Microsoft(微软)

AI总结 提出PEAR,一种监督式质量估计指标族,通过成对比较实现无参考机器翻译评估,预测质量差异方向和幅度,在WMT24基准上优于单候选基线,并有效用于最小贝叶斯风险解码。

Comments ACL 2026 Main Conference. 19 pages

详情
AI中文摘要

我们提出PEAR(成对评估用于自动相对评分),一种监督式质量估计(QE)指标族,将无参考机器翻译(MT)评估重新定义为分级成对比较。给定一个源片段和两个候选翻译,PEAR预测它们质量差异的方向和幅度。这些指标使用从人工判断差异中导出的成对监督进行训练,并添加一个正则化项,鼓励在候选顺序反转时符号反转。在WMT24元评估基准上,PEAR优于使用相同数据和骨干网络训练的严格匹配的单候选QE基线,隔离了所提出的成对公式的优势。尽管使用的参数远少于近期的大指标,PEAR超越了更大的QE模型和基于参考的指标。我们的分析进一步表明,与其他顶级指标相比,PEAR产生更少冗余的评估信号。最后,我们展示PEAR是用于最小贝叶斯风险(MBR)解码的有效效用函数,以可忽略的影响降低了成对评分成本。

英文摘要

We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised quality estimation (QE) metric family that reframes reference-free machine translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for minimum Bayes risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.

2505.17654 2026-05-28 cs.CL cs.AI 版本更新

EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

EVADE-Bench:用于评估和增强规避性内容检测的多模态基准

Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Yukun Chen, Hamid Alinejad-Rokny, Min Yang

发表机构 * SIAT, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) University of Chinese Academy of Sciences(中国科学院大学) Alibaba Group(阿里巴巴集团) University of New South Wales(新南威尔士大学)

AI总结 针对电商平台中LLM/VLM易受规避性内容攻击的问题,提出首个专家标注的中文多模态基准EVADE-Bench,评估26个模型并发现规则分类可提升检测一致性,多智能体分解策略能显著提高准确率。

Comments SIGIR 2026

详情
AI中文摘要

电商平台越来越依赖大型语言模型(LLMs)和视觉语言模型(VLMs)来检测非法或误导性产品内容。然而,这些模型仍然容易受到规避性内容的影响,即通过分词、委婉语言或图像裁剪等技术故意修改的输入,以掩盖违反政策的行为,同时仍传达被禁止的主张。关键在于,检测此类内容需要模型同时掌握两种能力:准确理解复杂规则,以及正确推断故意混淆的多模态输入背后的真实意图。虽然先前的工作分别探索了LLM对复杂规则的推理和基于LLM的规避性内容检测,但现有基准尚未将两者结合在统一的评估框架内。这一差距在电商领域尤为严重,因为准确的审核要求这两种能力协同运作。为填补这一空白,我们引入了EVADE-Bench,这是首个专家策划的中文多模态基准,专门设计用于评估LLMs和VLMs在真实电商场景中的规避性内容检测。我们对26个开源和闭源LLMs及VLMs的全面评估显示,即使是最先进的模型也经常错误分类规避性样本。我们进一步证明,更清晰的规则分类显著提高了模型预测的一致性并减少了错误预测,凸显了基准设计在实现可靠评估中的关键作用。为了探索性能提升的路径,我们研究了多智能体分解在多模态推理中的可行性,即将视觉描述和逻辑推理解耦为独立的智能体,并发现这一策略带来了显著的准确率提升。

英文摘要

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content, which refers to inputs that have been deliberately modified through techniques such as word splitting, euphemistic language, or image cropping to conceal policy violations while still conveying prohibited claims. Crucially, detecting such content requires a model to simultaneously master two capabilities: accurately comprehending complex rules, and correctly inferring the true intent behind deliberately obfuscated multimodal inputs. While prior work has separately explored LLM reasoning over complex rules and LLM-based detection of evasive content, no existing benchmark combines both within a unified evaluation framework. This gap is particularly consequential in e-commerce, where accurate moderation demands that both capabilities operate in concert. To address this gap, we introduce EVADE-Bench, the first expert-curated Chinese multimodal benchmark specifically designed to evaluate LLMs and VLMs on evasive content detection in real-world e-commerce scenarios. Our comprehensive evaluation of 26 open- and closed-source LLMs and VLMs reveals that even state-of-the-art models frequently misclassify evasive samples. We further demonstrate that clearer rule categorization significantly improves model prediction consistency and reduces false predictions, highlighting the critical role of benchmark design in enabling reliable evaluation. To explore paths for performance improvement, we investigate the feasibility of multi-agent decomposition for multimodal reasoning, wherein visual description and logical inference are decoupled into separate agents, and find that this strategy yields notable accuracy gains.

2601.12154 2026-05-28 cs.CL 版本更新

Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

基于嵌入的主题建模和LLM分析癌症患者的体验

Teodor-Călin Ionescu, Lifeng Han, Jan Heijdra Suasnabar, Anne Stiggelbout, Suzan Verberne

发表机构 * Leiden Institute of Advanced Computer Science (LIACS), Leiden, The Netherlands(莱顿高级计算机科学研究所(LIACS),莱顿,荷兰) Leiden University Medical Center (LUMC), Leiden, The Netherlands(莱顿大学医学中心(LUMC),莱顿,荷兰)

AI总结 本研究利用BERTopic和Top2Vec等神经主题建模方法,结合LLM(GPT4)进行主题标注,从癌症患者访谈数据中提取有意义主题,并评估不同嵌入模型的效果,发现领域特定的BioClinicalBERT嵌入能提高主题精度和可解释性。

Comments accepted by the CLIN journal. The CLIN Journal is the journal for research in computational linguistics in The Netherlands and Belgium

详情
AI中文摘要

本研究探讨了使用神经主题建模和LLM从患者叙述数据中发现有意义主题的方法,以提供有助于更以患者为中心的医疗实践的见解。我们分析了一组转录的癌症患者访谈(13次访谈,共132,722词)。首先,我们通过使用相似的预处理、分块和聚类配置,评估BERTopic和Top2Vec在单个访谈摘要中的关键词提取性能,以确保公平比较。然后,使用LLM(GPT4)进行下一步的主题标注。通过小规模人工评估,对单个访谈(I0)的输出进行评分,重点关注{连贯性}、{清晰度}和{相关性}。基于初步结果和评估,BERTopic表现出更强的性能,并被选用于进一步实验,使用三种{临床导向的嵌入}模型。然后,我们使用最佳模型设置分析了完整的访谈集合。结果表明,领域特定的嵌入提高了主题的 extit{精确度}和 extit{可解释性},其中BioClinicalBERT在转录中产生最一致的结果。使用BioClinicalBERT嵌入模型对全部13次访谈的全局分析揭示了所有13次访谈中最主要的主题,即“癌症护理管理中的协调与沟通”和“患者癌症治疗旅程中的决策”。尽管这些访谈是从荷兰语机器翻译成英语,且临床专业人员未参与评估,但研究结果表明,神经主题建模,特别是BERTopic,可以帮助从患者访谈中为临床医生提供有用的反馈。该流程可以支持更高效的文档导航,并加强患者在医疗工作流程中的声音。

英文摘要

This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.

2601.10085 2026-05-28 cs.CL 版本更新

CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

CALM-IT: 通过双角色对话动态追踪生成逼真的长形式动机访谈对话

Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Michael L. Birnbaum, Munmun De Choudhury

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Northwell Health(北well健康) Columbia University(哥伦比亚大学)

AI总结 提出CALM-IT框架,通过显式建模客户与咨询师状态的演变来生成和评估长形式动机访谈对话,在8,232个合成对话语料上优于基线方法,尤其在MITI 4.2全局评分和客户接受率上表现最佳。

Comments 53 pages, in submission to EMNLP

详情
AI中文摘要

治疗性对话并非孤立响应的序列:客户目标、动机、抵抗和治疗联盟随时间演变。然而,当前基于LLM的心理健康对话系统通常缺乏在长时间交互中追踪这些动态的显式机制,可能导致时机不当的干预或过早的目标解决。我们引入CALM-IT,一个通过显式建模客户和咨询师状态演变来生成和评估长形式动机访谈对话的框架,指导咨询策略选择和话语生成。我们在包含8,232个合成对话的大规模语料库上评估CALM-IT,涵盖多种对话长度和框架。与所有基线相比,CALM-IT在大多数MITI 4.2全局评分(包括共情、伙伴关系和软化维持谈话)以及其他关键性能指标上取得最佳性能,且随着对话长度增加性能下降最小。值得注意的是,尽管CALM-IT发起的改变导向提示较少,但在不同长度条件下平均客户接受率最高(64.3%)。我们发布了一个可复现的生成框架、一个基于MITI的过程级评估协议,以及一个大规模合成语料库,用于在逼真的长形式交互条件下研究治疗性LLM。

英文摘要

Therapeutic dialogue is not a sequence of isolated responses: client goals, motivation, resistance, and therapeutic alliance evolve over time. Yet current LLM-based mental health dialogue systems often lack explicit mechanisms for tracking these dynamics across extended interactions, which can lead to poorly timed interventions or premature goal resolution. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing dialogues through explicit modeling of evolving client and counselor states, guiding both counseling strategy selection and utterance generation. We evaluate CALM-IT on a large-scale corpus of 8,232 synthetic dialogues spanning multiple dialogue lengths and frameworks. Compared with all baselines, CALM-IT achieves the best performance on most MITI 4.2 global ratings, including Empathy, Partnership, and Softening Sustain Talk, as well as on other key performance metrics while exhibiting minimal performance degradation as dialogue length increases. Notably, although CALM-IT initiates fewer change-directed prompts, it produces the highest client acceptance rate (64.3%) on average across different length conditions. We release a reproducible generation framework, a MITI-grounded process-level evaluation protocol, and a large-scale synthetic corpus for studying therapeutic LLMs under realistic long-form interaction conditions.

2601.07648 2026-05-28 cs.CL 版本更新

What Are We Measuring in NLG? A Meta-Analysis of Evaluation Trends 2020-2025

我们在NLG中测量什么?2020-2025年评估趋势的元分析

Jing Yang, Nils Feldhus, Salar Mohtaj, Leonhard Hennig, Qianli Wang, Eleni Metheniti, Sherzod Hakimov, Charlott Jakob, Veronika Solopova, Konrad Rieck, David Schlangen, Sebastian Möller, Vera Schmitt

发表机构 * Technische Universität Berlin(技术大学柏林) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所) German Research Center for Artificial Intelligence (DFKI), Berlin(德国人工智能研究中心(DFKI),柏林) ANITI University of Potsdam(波茨坦大学) CERTAIN

AI总结 通过元分析14,171篇论文,揭示NLG评估中的三个系统性问题:度量惯性、度量-标准映射问题和验证差距,并提出最小评估清单。

Comments 8 pages

详情
AI中文摘要

随着自然语言生成(NLG)主导现代NLP,可扩展评估仍然是一个关键瓶颈。因此,LLM作为评判者(LaaJ)的采用迅速加速,在2025年出现在比人工评估更多的论文中。这一关键转变促使对当前评估实践进行批判性分析。克服刚性关键词过滤和人工审查的限制,我们采用多LLM信息提取管道,从四大NLP会议(2020-2025)的14,171篇论文中收集结构化元数据。分析3,334篇过滤后的NLG论文,我们识别出三个系统性问题。(1)度量惯性:尽管转向开放式生成,传统词汇度量(BLEU、ROUGE)仍作为主要指标持续存在,通常与语义替代方案一起使用而非被取代。(2)度量-标准映射问题:我们的论文级共现数据显示,通用自动度量被用作质量的广泛代理,而未指定它们旨在评估文本生成的哪个维度。(3)验证差距:LaaJ在没有相应人工验证的情况下快速增长(少于8%的论文)。关键的是,虽然LaaJ与整体质量相关,但在流畅性等细粒度标准上一致性崩溃。为弥补这些差距,我们将发现提炼为一个最小评估清单,以指导度量选择、构念效度和LaaJ部署。

英文摘要

As Natural Language Generation (NLG) dominates modern NLP, scalable evaluation remains a critical bottleneck. Consequently, LLM-as-a-judge (LaaJ) adoption has accelerated rapidly, appearing in more papers than human evaluation in 2025. This pivotal shift motivates a critical analysis of current evaluation practices. Overcoming the limits of rigid keyword filtering and manual review, we employ a multi-LLM information extraction pipeline to gather structured metadata from 14,171 papers across four major NLP conferences (2020-2025). Analyzing 3,334 filtered NLG papers, we identify three systemic challenges. (1) Metric inertia: despite the shift toward open-ended generation, legacy lexical metrics (BLEU, ROUGE) persist as primary indicators, typically used alongside rather than replaced by semantic alternatives. (2) Metric-criteria mapping problem: our paper-level co-occurrence data reveals that general-purpose automatic metrics are applied as broad proxies for quality, without specifying which dimension of text generation they are intended to evaluate. (3) Validation gap: LaaJ has grown rapidly without commensurate human validation (fewer than 8% of papers). Crucially, while LaaJ correlates with aggregate quality, alignment collapses on fine-grained criteria like fluency. To address these gaps, we distill our findings into a minimal Evaluation Checklist to guide metric selection, construct validity, and LaaJ deployment.

2601.06329 2026-05-28 cs.CL cs.AI 版本更新

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

论口语语言模型评估中全局令牌困惑度的谬误

Chan-Jan Hsu, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

发表机构 * Carnegie Mellon University(卡内基梅隆大学) National Taiwan University(国立台湾大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对口语语言模型评估中直接使用文本困惑度公式计算语音令牌困惑度的问题,提出基于似然和生成的新型评估方法,更忠实反映生成质量,并缩小了最佳模型与人类基线之间的差距。

详情
AI中文摘要

在大规模原始音频上预训练的生成式口语语言模型能够以适当内容继续语音提示,同时保留说话人和情感等属性,作为口语对话的基础模型。在先前文献中,这些模型通常使用“全局令牌困惑度”进行评估,该指标直接将文本困惑度公式应用于语音令牌。然而,这种做法忽略了语音和文本模态之间的根本差异,可能导致对语音特性的低估。在这项工作中,我们提出了多种基于似然和生成的评估方法,以替代朴素的全局令牌困惑度。我们证明,所提出的评估更忠实地反映了感知生成质量,与人类评分的平均意见得分(MOS)具有更强的相关性。在新指标下评估时,口语语言模型的相对性能格局被重塑,揭示了最佳性能模型与人类基线之间的差距显著缩小。总之,这些结果表明,适当的评估对于准确评估口语语言建模的进展至关重要。

英文摘要

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

2601.03549 2026-05-28 cs.CV cs.CL 版本更新

FEA-SLT: A Gloss-Free End-to-End Framework for Facial-Expression-Aware Sign Language Translation

FEA-SLT:一种面向面部表情感知的手语翻译的无词汇端到端框架

Guobin Tu, Di Weng

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 提出FEA-SLT框架,通过面部表情感知融合模块利用面部动态作为语义锚点,解决无词汇手语翻译中手势歧义问题,在PHOENIX14T和CSL-Daily数据集上达到最优BLEU性能。

详情
AI中文摘要

手语翻译(SLT)是一项具有挑战性的跨模态任务,需要对手部动作和非手动信号进行联合建模。现有的无词汇SLT方法有效捕捉手势动态,但常常未充分利用面部表情,而面部表情在语法和消除歧义中起着关键作用。当不同概念共享相似手部配置时,这一限制可能导致语义退化。为解决此问题,我们提出FEA-SLT(面部表情感知手语翻译),一种无词汇端到端框架,利用面部动态作为语义锚点来消除手部歧义。FEA-SLT采用领域迁移的面部编码器提取表情敏感表示,并通过语言约束的面部表情感知融合(FEAF)模块将其与手部特征集成。FEAF通过双向调制捕捉手部和面部通道之间的相互依赖关系,增强句法保真度。在PHOENIX14T和CSL-Daily上的实验表明,FEA-SLT在无词汇方法中实现了最先进的BLEU性能,而针对性分析证实了其对面部敏感语句翻译的改进。代码可在[https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT)获取。

英文摘要

Sign Language Translation (SLT) is a challenging cross-modal task requiring joint modeling of manual articulations and non-manual signals. Existing gloss-free SLT methods effectively capture gestural dynamics but often underutilize facial expressions, which play crucial grammatical and disambiguating roles. This limitation can cause semantic degradation when distinct concepts share similar manual configurations. To address this issue, we propose FEA-SLT (**F**acial-**E**xpression-**A**ware **S**ign **L**anguage **T**ranslation), a gloss-free end-to-end framework that uses facial dynamics as semantic anchors for resolving manual ambiguity. FEA-SLT employs a domain-transferred facial encoder to extract expression-sensitive representations and integrates them with manual features through a linguistically constrained *Facial-Expression-Aware Fusion* (FEAF) module. FEAF captures reciprocal dependencies between manual and facial channels via bidirectional modulation, enhancing syntactic fidelity. Experiments on PHOENIX14T and CSL-Daily show that FEA-SLT achieves state-of-the-art BLEU performance among gloss-free methods, while targeted analyses confirm improved translation of facial-sensitive utterances. Code is available at [https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT).

2512.23959 2026-05-28 cs.CL cs.AI cs.LG 版本更新

HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

HGMem:基于超图的工作记忆以改进长上下文复杂关系建模的多步RAG

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

发表机构 * The Chinese University of Hong Kong.(香港中文大学) Pengcheng Laboratory.(鹏城实验室) WeChat AI, Tencent(微信AI,腾讯) University of Chinese Academy of Sciences.(中国科学院大学)

AI总结 提出HGMem超图工作记忆系统,通过超边表示记忆单元并渐进形成高阶交互,增强多步RAG中的全局理解和复杂推理能力。

Comments ICML 2026; Code released at https://github.com/Encyclomen/HGMem

详情
AI中文摘要

多步检索增强生成(RAG)已成为增强大型语言模型(LLMs)在需要全局理解和密集推理任务上的广泛采用策略。尽管许多RAG系统整合了工作记忆来整合信息,但现有设计主要作为孤立事实的被动存储。这种静态特性忽略了原始事实之间的关键高阶相关性,从而限制了模型的多步推理能力,导致在扩展上下文中的碎片化推理和弱全局理解。我们引入了HGMem,一种基于超图的工作记忆系统,将记忆的概念从简单存储扩展到动态、表达性结构,用于复杂推理和全局理解。在我们的方法中,记忆被表示为超图,其中超边对应不同的记忆单元,使得记忆内高阶交互的逐步形成成为可能。该机制连接围绕焦点问题的事实和思考,将记忆演变为一个集成且情境化的知识结构,为更深层次的推理提供强有力的命题。我们在几个具有挑战性的全局理解基准上评估了HGMem。大量实验和深入分析表明,我们的方法持续改进了多步RAG,并在不同数据集上显著优于强基线系统。

英文摘要

Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high-order correlations among primitive facts, thereby limiting models' capacity for multi-step reasoning and resulting in fragmented reasoning and weak global sense-making within extended contexts. We introduce HGMem, a hypergraph-based working memory system, extending the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of high-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense-making benchmarks. Extensive experiments and in-depth analyses demonstrate that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse datasets.

2512.17375 2026-05-28 cs.LG cs.CL cs.CR 版本更新

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

AdvJudge-Zero:通过对抗控制令牌在LLM作为评判者中实现二元决策翻转

Tung-Ling Li, Yuhao Wu, Hongliang Liu

发表机构 * Palo Alto Networks(帕洛阿尔托网络公司)

AI总结 本文提出AdvJudge-Zero方法,通过从评判模型自身分布中采样低困惑度令牌,无需梯度优化即可将LLM评判者的二元判决从“否”翻转为“是”,并基于发现的令牌池提出防御策略以增强评判鲁棒性。

详情
AI中文摘要

LLM作为评判者系统在现代RLHF和RLVR流程中提供奖励信号,但其二元判决简化为一个隐藏状态上的单一线性读出F_gap。我们证明该读出足够浅,以至于短且低困惑度的令牌可以将判决从“否”翻转为“是”。这些令牌是从评判者自身在响应位置的下一个令牌分布中采样的,无需手动设置种子或基于梯度的优化。我们的方法AdvJudge-Zero在六个Qwen、Llama和Gemma评判者上的24个(模型,数据集)单元中,有22个实现了>90%的集成假阳性率,而先前策划的10令牌基准为54-72%,并且发现的表面跨格式转移到70B标量奖励模型。相同的发现池使得防御成为可能:基于9类机制分类法分层的LoRA微调,在相同池上的朴素采样失败的跨族泛化中增强了鲁棒性,其中机制广度而非池大小带来了增益。在GRPO训练下,硬化后的评判者消除了未硬化基线在MATH和GSM8K上每个条件十个种子时观察到的奖励崩溃失败(假阳性峰值和长度崩溃)。发现的池、机制分类法和每个提示的翻转记录将在负责任的披露下发布。

英文摘要

LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from "No" to "Yes". These tokens are sampled from the judge's own next-token distribution at the response position, with no manual seed set and no gradient-based optimization. Our procedure, AdvJudge-Zero, reaches $>$90% ensemble false-positive rate on 22 of 24 (model, dataset) cells across six Qwen, Llama, and Gemma judges, versus 54-72% for the prior curated 10-token benchmark, and the discovered surface transfers cross-format to a 70B scalar reward model. The same discovered pool enables a defense: a LoRA fine-tune stratified by a 9-class mechanism taxonomy hardens cross-family generalization where naive sampling on the same pool fails, with mechanism breadth rather than pool size carrying the gain. Under GRPO training, the hardened judge eliminates the reward-collapse failures (false-positive spikes and length collapse) we observe in the unhardened baseline on both MATH and GSM8K at ten seeds per condition. The discovered pool, the mechanism taxonomy, and per-prompt flip records will be released under responsible disclosure.

2512.01970 2026-05-28 cs.AI cs.CL 版本更新

Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies

原子技能是前提:当强化学习合成组合推理时,以及当它仅放大时

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong

发表机构 * University of Waterloo(滑铁卢大学) Duke University(杜克大学) National University of Singapore(新加坡国立大学) Princeton University(普林斯顿大学) Peking University(北京大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 通过互补推理任务,研究强化学习是合成新技能还是仅放大已有技能,发现强化学习在基础模型通过监督微调掌握独立原子技能后才能合成新组合策略。

Comments Work in Progress. Code and data are available at https://github.com/sitaocheng/from_atomic_to_composite

详情
AI中文摘要

强化学习(RL)仅仅是放大现有技能,还是合成新技能?我们通过互补推理的视角研究这个问题:互补推理是整合内部知识与外部上下文的关键实践能力,是可靠的持续学习和检索增强生成的前提。为了避免预训练污染,我们构建了一个受控的语义合成传记数据集,并将这种能力分解为两个原子技能:参数推理(检索模型权重中编码的事实)和上下文推理(处理新的上下文信息)。我们有两个发现。首先,直接在复合任务上监督训练的模型在已知事实和推理路径上达到高准确率(90%),但在新事实和推理路径上崩溃(18%),表明监督微调(SFT)依赖于死记硬背而非真正的技能整合。其次,RL弥合了这一泛化差距,充当技能合成器而非仅仅是放大器——但只有在严格的前提条件下:只有当基础模型首先通过SFT掌握了独立的原子技能时,它才能合成新的组合策略。这些结果表明,解耦的原子训练后接RL为复杂的新推理提供了一条可扩展的路径。

英文摘要

Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens of Complementary Reasoning: the critical practical capability of integrating internal knowledge with external context, a prerequisite for reliable Continual Learning and Retrieval-Augmented Generation. To avoid pre-training contamination, we construct a controlled semanticsynthetic dataset of biographies and decompose this capability into two atomic skills: Parametric Reasoning (retrieving facts encoded in model weights) and Contextual Reasoning (processing novel in-context information). We present two findings. First, models supervised directly on the composite task reach high accuracy on seen facts and reasoning paths (90%) but collapse on novel facts and reasoning paths (18%), indicating that Supervised Fine-Tuning (SFT) relies on rote memorization rather than genuine skill integration. Second, RL bridges this generalization gap, acting as a skill synthesizer rather than a mere amplifier--but only under a strict prerequisite: it synthesizes new composite strategies only when the base model has first mastered the independent atomic skills via SFT. These results suggest that decoupled atomic training followed by RL offers a scalable path to complex novel reasoning.

2511.05550 2026-05-28 cs.SD cs.CL cs.LG 版本更新

Assessing Factual Music Comprehension in Large Audio Language Models

评估大型音频语言模型中的事实音乐理解能力

Daniel Chenyu Lin, Michael Freeman, John Thickstun

AI总结 针对现有MusicQA数据集无法衡量模型回答事实正确性的问题,提出基于可验证信息的评估协议,通过精确率、召回率和F1分数客观评估模型,并在三个数据集上定义六项事实检索任务,对九个最新LALM进行基准测试。

Comments 16 pages; second submission

详情
AI中文摘要

大型音频语言模型(LALMs)利用多模态表示生成对音频自然语言查询的开放式回答。本文(1)提供经验证据表明,使用流行的MusicQA数据集评估LALMs无法衡量模型关于音乐的回答是否事实正确,(2)开发了一种新的评估LALMs音乐理解能力的协议。具体来说,我们提出一个评估协议,提示LALM提供可事实验证的信息,并将其开放式回答解析为结构化格式,使用精确率、召回率和F1分数进行客观评估。利用该协议,我们定义了一个基准测试,包含在三个不同数据集(MusicNet、Free Music Archive和OverClocked ReMix)上定义的六项事实信息检索任务。我们对九个最近的LALMs进行了基准测试,包括前沿模型如Gemini和最新的开放模型如Music Flamingo,并在https://github.com/DCL2004/LALM-Eval发布了评估脚本套件,以方便新LALMs的基准测试。

英文摘要

Large audio language models (LALMs) leverage multimodal representations to generate open-ended answers to natural language queries about audio. In this paper, we (1) provide empirical evidence that assessment of LALMs using the popular MusicQA dataset fails to measure whether a model's responses about music are factually correct, and (2) develop a new protocol for assessing the music comprehension capabilities of LALMs. Specifically, we propose an evaluation protocol that prompts a LALM for factually verifiable information, and parses its open-ended response into a structured format that can be objectively assessed using Precision, Recall, and F1 scores. Using this protocol, we define a benchmark consisting of six factual information retrieval tasks defined on three diverse datasets: MusicNet, the Free Music Archive, and OverClocked ReMix. We benchmark nine recent LALMs, including frontier models like Gemini and the latest open models like Music Flamingo, and release the suite of evaluation scripts at https://github.com/DCL2004/LALM-Eval to facilitate benchmarking of new LALMs.

2510.17620 2026-05-28 cs.CL 版本更新

Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models

忘记知识,记住使用:面向大型语言模型的上下文感知遗忘

Yuefeng Peng, Parnian Afshar, Megan Ganji, Thomas Butler, Amir Houmansadr, Mingxian Wang, Dezhi Hong

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Amazon(亚马逊)

AI总结 针对现有遗忘方法损害上下文可用性的问题,提出一种插件式目标项,在保持遗忘效果和保留集性能的同时恢复模型对已遗忘知识的上下文使用能力。

Comments ICML 2026

详情
AI中文摘要

大型语言模型可能编码需要移除的敏感信息或过时知识,以确保模型响应负责任且合规。遗忘学习已成为完整重新训练的高效替代方案,旨在移除特定知识同时保持模型整体效用。现有遗忘方法评估关注(1)目标知识的遗忘程度(遗忘集)和(2)保留集上的性能(即效用)。然而,这些评估忽略了一个重要的可用性方面:如果提示中重新引入已移除信息,用户可能仍希望模型利用该信息。在对六种最先进遗忘方法的系统评估中,我们发现它们一致损害了这种上下文效用。为解决此问题,我们用一个插件项增强遗忘目标,该插件项保留模型在上下文中存在已遗忘知识时使用它的能力。大量实验表明,我们的方法将上下文效用恢复到接近原始水平,同时仍然保持有效的遗忘和保留集效用。

英文摘要

Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.

2510.11170 2026-05-28 cs.LG cs.AI cs.CL 版本更新

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

EAGer: 基于熵感知的自适应推理时缩放生成方法

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

发表机构 * University of Groningen(格罗宁根大学) University of Milan - Bicocca(米兰-比科卡大学) Cohere Labs(Cohere实验室)

AI总结 提出一种无需训练的生成方法EAGer,利用逐词熵分布动态分配计算资源,在复杂推理任务中提升性能并减少冗余计算。

详情
AI中文摘要

随着推理语言模型和测试时缩放方法作为提升模型性能范式的兴起,通常需要大量计算来从同一提示生成多个候选序列。这允许探索通向正确答案的不同推理路径,然而,为每个提示分配相同的计算预算。基于不同提示具有不同复杂度因而需要不同计算量的假设,我们提出EAGer,一种无需训练的生成方法,通过逐词熵分布利用模型不确定性来减少冗余计算并同时提升整体性能。EAGer仅在存在高熵词时分支到多个推理路径,并将节省的计算预算重新分配到最需要探索替代路径的实例上。我们在复杂推理基准上对多个开源模型验证了EAGer,特别是在AIME 2025上展示了增益。当目标标签可访问时(如在RLVR训练流程中),EAGer在Pass@k上提升高达37%,且token减少59%;在测试时设置中,与全并行采样相比,仍能在Pass@k上提升12%,且token减少64%。

英文摘要

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and reallocates the saved compute budget to instances where exploration of alternative paths is most needed. We validate EAGer across multiple open-source models on complex reasoning benchmarks, with gains specifically demonstrated on AIME 2025. When target labels are accessible -- as in RLVR training pipelines -- EAGer achieves up to +37% in Pass@k and 59% fewer tokens; in test-time settings it still yields +12% in Pass@k and 64% fewer tokens compared to Full Parallel Sampling.

2510.10185 2026-05-28 cs.CL cs.AI cs.MA 版本更新

Auditing medical multi-agent AI reveals risks of false consensus

审计医疗多智能体AI揭示虚假共识风险

Yinghao Zhu, Lei Gu, Zixiang Wang, Haoran Sang, Dehao Sui, Wen Tang, Lan Mi, Yasha Wang, Junyi Gao, Liang Yao, Tianfan Fu, Ewen Harrison, Lequan Yu, Liantao Ma

发表机构 * National Engineering Research Center for Software Engineering, Peking University(北京大学软件工程国家工程研究中心) School of Computing and Data Science, The University of Hong Kong(香港大学计算机与数据科学学院) Department of Nephrology, Peking University Third Hospital(北京大学第三医院肾内科) Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Department of Lymphoma, Peking University Cancer Hospital & Institute(教育部癌症发生与转化研究重点实验室、北京大学肿瘤医院淋巴瘤科) Department of Automation, Tsinghua University(清华大学自动化系) Centre for Medical Informatics, The University of Edinburgh(爱丁堡大学医学信息学中心) Health Data Research UK(英国健康数据研究机构) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李科贤医学院) State Key Laboratory for Novel Software Technology, School of Computer Science, Nanjing University(南京大学新型软件技术国家重点实验室、计算机科学学院)

AI总结 本研究提出MedAgentAudit框架,通过专家验证的审计流程诊断医疗多智能体系统中的协作失败模式,发现虚假共识、权威偏差等系统性风险。

Comments Code and Data: https://github.com/MedX-PKU/MedAgentAudit

详情
AI中文摘要

大型语言模型正越来越多地被组装成医疗多智能体系统,通过专家角色、同行评审和共识形成模拟多学科会诊。然而,在临床决策支持中,表面共识并不足够。临床医生还需要知道智能体是否检查了证据、处理了分歧并保持了不确定性可见。当前评估主要关注最终准确性,未测试协作过程的安全性。本文介绍MedAgentAudit,一个基于临床的工作流审计框架,用于诊断和量化医疗多智能体系统中的协作失败模式。从3,600个执行日志中,我们推导出一个经专家验证的十种常见失败分类法,涵盖任务理解、协作讨论以及综合与决策。随后,我们部署一个经专家验证的自动审计器作为非干预探针,覆盖14,400个案例,涉及六种多智能体架构、六个医疗文本和视觉数据集以及每种模态的四个大语言模型设置。跨系统而言,协作带来不均衡的准确性提升和频繁的过程失败。16.63%的案例中存在无依据的观察结果,并向下游传播。在讨论中,智能体在98.42%的案例中重复初始观点而非重新审视证据,并在42.73%的案例中未能激活专家推理。在综合阶段,最终答案常常用权威或多数票替代证据检查,显示出权威偏差(28.76%,从35.30%上升至68.75%)、自我矛盾(18.53%)、矛盾忽视(5.48%)和少数派压制(5.11%)。MedAgentAudit将医疗AI评估从输出评分重新定义为过程级安全与问责,为医学中透明、可审计且由临床医生监督的智能体系统提供了实践基础。

英文摘要

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making. We then deploy an expert-validated automated auditor as non-interventional probes across 14,400 cases, covering six multi-agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re-examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self-contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process-level safety and accountability, providing a practical foundation for transparent, auditable and clinician-supervised agentic systems in medicine.

2510.06974 2026-05-28 cs.CL 版本更新

Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups

探究中文大语言模型中的社会身份偏见:基于性别代词与社会群体

Geng Liu, Feng Li, Junjie Mu, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano(电子、信息与生物工程系,米兰理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 通过设计考虑中文语言特性的提示,评估十种代表性中文大语言模型在240个社会群体上的内群体与外群体框架下的情感和毒性偏见,发现系统性不对称且指令调优减少情感偏见但毒性差距更持久,女性标记代词与更高毒性相关。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在面向用户的应用程序中,引发了对它们可能反映和放大社会偏见的担忧。我们使用针对中文的提示,在十种代表性模型中研究了中文LLMs中的社会身份偏见。我们的评估比较了240个在中国语境中显著的社会群体的内群体(“我们”)和外群体(“他们”)框架,使用了一个双层测量框架来评估情感和毒性。提示设计明确考虑了中文的语言特性,包括默认性别中立复数代词与其明确女性对应代词之间的区别,从而能够对社会身份框架效应进行受控比较。跨模型观察,我们发现了系统性的内群体-外群体不对称性,尽管其表达在不同测量维度上有所不同。特别是,指令调优通常减少情感不对称性,而毒性差距仍然更为持久。此外,在多个模型中,女性标记的复数代词比默认性别中立复数代词与更高的毒性相关。我们的研究引入了一个针对中文LLMs的语言感知评估框架,并表明(i)先前在英语中记录的社会身份偏见在中文中也有所体现,以及(ii)中文特有的语言结构可以揭示在仅英语环境中无法直接观察到的偏见模式。

英文摘要

Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns that they may reflect and amplify social biases. We investigate social identity biases in Chinese LLMs using Mandarin-specific prompts across ten representative models. Our evaluation compares ingroup ("We") and outgroup ("They") framings across 240 social groups salient in the Chinese context, using a two-tiered measurement framework that assesses both sentiment and toxicity. The prompt design explicitly accounts for linguistic properties of Mandarin, including the distinction between the default gender-neutral plural pronoun and its explicitly feminine counterpart, enabling a controlled comparison of social identity framing effects. Across models, we observe systematic ingroup-outgroup asymmetries, although their expression differs across measurement dimensions. In particular, instruction tuning often reduces sentiment asymmetries, while toxicity gaps remain more persistent. Moreover, the feminine-marked plural pronoun is associated with higher toxicity than the default gender-neutral plural in several models. Our study introduces a language-aware evaluation framework for Chinese LLMs and shows that (i) social identity biases previously documented in English also manifest in Chinese and that (ii) Mandarin-specific linguistic structure can reveal bias patterns that are not directly observable in English-only settings.

2510.05291 2026-05-28 cs.CL 版本更新

Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Camellia: 亚洲语言中LLMs文化偏见的基准测试

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, Tanmoy Chakraborty, Yuki Arase, Keisuke Sakaguchi, JinYeong Bak, Wei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Samsung R&D Institute Philippines(三星菲律宾研发院) Sungkyunkwan University(成均馆大学) Tohoku University(东北大学) National University of Singapore(新加坡国立大学) University of Copenhagen(哥本哈根大学) University of Michigan(密歇根大学) Indian Institute of Technology Delhi(印度理工学院德里) Institute of Science Tokyo(东京科学研究院)

AI总结 提出Camellia基准,通过三个任务评估九种亚洲语言中多语言大模型对亚洲与西方文化实体的偏见,发现模型存在文化适应困难、情感关联差异及实体提取性能差距。

详情
AI中文摘要

随着大语言模型(LLMs)多语言能力的增强,它们对文化多样性实体的敏感性变得越来越重要。Naous等人(2024)的前期工作表明,LLMs在阿拉伯语中往往偏好与西方相关的实体。由于缺乏以实体为中心的多语言基准,这种偏见是否也存在于各种非西方语言中尚不清楚。在本文中,我们介绍了Camellia,这是一个用于评估九种亚洲语言(涵盖六种亚洲文化)中实体中心文化偏见的基准。Camellia包括19,530个手动注释的实体,这些实体与所涵盖的亚洲或西方文化相关,以及从社交媒体帖子中提取的2,173个这些实体的掩码上下文。利用Camellia,我们在三个任务中评估了四个最近的多语言LLMs的文化偏见:文化上下文适应、情感关联和实体抽取式问答。我们的分析表明,LLMs在这些语言中难以进行文化适应,不同地区开发的模型表现存在差异。我们进一步观察到,不同的LLM家族可能持有不同的偏见,这反映在它们将文化与特定情感联系起来的方式上。最后,我们发现LLMs在某些亚洲语言中可能难以理解上下文,从而在实体抽取中造成文化之间的性能差距。

英文摘要

As Large Language Models (LLMs) develop stronger multilingual capabilities, their sensitivity to culturally diverse entities becomes increasingly important. Prior work by Naous et al. (2024) has shown that LLMs often favor Western-associated entities in Arabic. Due to the lack of entity-centric multilingual benchmarks, it remains unclear if such biases also manifest in various non-Western languages. In this paper, we introduce Camellia, a benchmark for evaluating entity-centric cultural biases in nine Asian languages, spanning six Asian cultures. Camellia includes 19,530 manually annotated entities associated with the covered Asian or Western cultures, as well as 2,173 masked contexts for these entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLMs across three tasks: cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show that LLMs struggle with cultural adaptation across these languages, with performance differing across models developed in different regions. We further observe that different LLM families can hold distinct biases, reflected in the ways they link cultures to particular sentiments. Lastly, we find that LLMs can struggle with context understanding in some Asian languages, creating performance gaps between cultures in entity extraction.

2510.02329 2026-05-28 cs.CL cs.AI 版本更新

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

SelfJudge: 通过自监督验证器加速推测解码

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

发表机构 * Efficient AI Large Language Model(大型语言模型) Speculative Decoding(推测解码)

AI总结 提出SelfJudge方法,利用目标模型的自监督训练验证器,通过评估令牌替换后响应的语义保持性来加速推测解码,实现更优的推理-准确率权衡。

详情
Journal ref
ICML 2026
AI中文摘要

推测解码通过验证来自草稿模型的候选令牌与较大目标模型的匹配来加速LLM推理。最近的验证解码通过放宽验证标准,接受可能与目标模型输出存在微小差异的草稿令牌来加速这一过程,但现有方法受限于依赖人工标注或具有可验证真实结果的任务,限制了其在多样化NLP任务中的泛化能力。我们提出SelfJudge,通过目标模型的自监督训练验证器。我们的方法通过评估令牌替换后的响应是否保持原始响应的意义来衡量语义保持性,从而实现在多样化NLP任务中的自动验证器训练。实验表明,SelfJudge在推理-准确率权衡上优于验证解码基线,为更快的LLM推理提供了广泛适用的解决方案。

英文摘要

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

2506.08846 2026-05-28 cs.CY cs.CL cs.SD eess.AS 版本更新

Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

自动语音识别技术审计实践中的陷阱:以失语症患者为例

Katelyn Xiaoying Mei, Anna Seo Gyeong Choi, Hilke Schellmann, Mona Sloane, Allison Koenecke

发表机构 * University of Washington(华盛顿大学) Cornell University(康奈尔大学) New York University(纽约大学) University of Virginia(弗吉尼亚大学)

AI总结 本文识别了标准ASR审计中的三个常见陷阱,并提出了一个整体审计框架,通过失语症患者的案例研究发现ASR系统对其表现更差。

Comments Published at the Proceedings of The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

自动语音识别(ASR)系统的日益普及需要稳健的审计方法,以确保转录质量的公平性,特别是对于像失语症这样的言语障碍患者,他们不成比例地依赖ASR。虽然学术和行业审计揭示了不同用户群体之间的性能差异,但标准审计实践常常忽视可能掩盖对边缘群体伤害的细微差别。我们识别了标准ASR审计中的三个常见陷阱:(1)坚持单一的文本标准化方法,这可能掩盖ASR性能的差异并忽视边缘社区的标准化偏好;(2)展示高层次的人口统计发现,而不考虑按细微交叉亚组划分的性能差异,或依赖于相关的声学特性;(3)仅报告一个黄金标准指标(词错误率),这不足以量化常见的生成式AI错误,如幻觉。我们提出了一个解决这些陷阱的整体审计框架,并在对六个流行ASR系统的案例研究中发现,与对照组相比,失语症患者的ASR性能持续更差。我们呼吁从业者实施这些更适合快速变化的ASR环境的稳健、社区驱动的ASR审计实践。

英文摘要

Automatic Speech Recognition (ASR) systems' growing use warrants robust auditing approaches to ensure equitable transcription quality, especially for people with speech disorders like aphasia who disproportionately depend on ASR. While academic and industry audits have revealed performance disparities across user populations, standard auditing practices often overlook nuances that risk masking harm to marginalized groups. We identify three common pitfalls in standard ASR audits: (1) adhering to one method of text standardization, which can mask variance in ASR performance and ignore the standardization preferences of marginalized communities; (2) displaying high-level demographic findings without considering performance disparities by nuanced intersectional subgroups, or conditioning on relevant acoustic properties; and (3) reporting only one gold-standard metric (Word Error Rate), which inadequately quantifies common generative AI errors like hallucinations. We propose a holistic auditing framework addressing these pitfalls, and in a case study of six popular ASR systems, find consistently worse ASR performance for speakers with aphasia relative to a control group. We call on practitioners to implement these robust, community-driven ASR auditing practices better suited for the rapidly changing ASR landscape.

2507.08014 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

大规模真实对话分析揭示LLM越狱的复杂性界限

Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

发表机构 * Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.(瓦伦西亚人工智能研究 institute,瓦伦西亚理工大学,西班牙瓦伦西亚) Department of Computer Science, The University of Chicago, Chicago, USA(计算机科学系,芝加哥大学,美国芝加哥) Center for Automation and Robotics, Spanish National Research Council, Madrid, Spain(自动化与机器人中心,西班牙国家研究委员会,西班牙马德里)

AI总结 通过分析超过200万条真实对话,发现越狱尝试的复杂性并不显著高于正常对话,且攻击复杂性随时间保持稳定,表明LLM安全演化受人类创造力限制。

Comments Code: https://github.com/ACMCMC/risky-conversations Results: https://huggingface.co/risky-conversations Visualizer: https://huggingface.co/spaces/risky-conversations/Visualizer

详情
AI中文摘要

随着大型语言模型(LLM)的日益部署,理解越狱策略的复杂性和演变对于AI安全至关重要。我们对来自不同平台(包括专门的越狱社区和通用聊天机器人)的超过200万条真实对话进行了大规模实证分析,研究了越狱复杂性。使用一系列复杂性指标,涵盖概率度量、词汇多样性、压缩比和认知负荷指标,我们发现越狱尝试并未表现出显著高于正常对话的复杂性。这一模式在专门的越狱社区和普通用户群体中一致成立,表明攻击的复杂性存在实际界限。时间分析显示,虽然用户攻击的毒性和复杂性随时间保持稳定,但助手响应的毒性有所下降,表明安全机制正在改进。复杂性分布中缺乏幂律标度进一步指出了越狱发展的自然限制。我们的发现挑战了攻击者与防御者之间军备竞赛不断升级的主流说法,反而表明LLM安全演化受人类创造力限制,而防御措施持续进步。我们的结果突显了学术越狱披露中的关键信息危害,因为超出当前复杂性基线的复杂攻击可能破坏观察到的平衡,并在防御适应之前造成广泛伤害。

英文摘要

As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

2507.06999 2026-05-28 cs.CV cs.CL cs.LG 版本更新

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

有意学习,直觉行动:解锁多模态大语言模型的测试时推理能力

Yahan Yu, Yuyang Dong, Masafumi Oyamada

发表机构 * Kyoto University(京都大学) Initial S NEC Corporation, Japan(日本NEC公司)

AI总结 提出D2I框架,通过训练时使用基于规则的格式奖励进行有意推理以增强模态对齐,推理时移除显式策略转为直觉推理,从而提升多模态大语言模型的推理能力,无需额外标注或复杂奖励。

Comments 22 pages, 24 figures

详情
AI中文摘要

推理对于大型语言模型(LLMs)至关重要,尤其是在数学问题求解等复杂任务中。然而,多模态推理在模态对齐和训练可扩展性方面仍面临挑战,因为许多现有方法依赖于额外的标注或复杂的基于规则的奖励。为了解决这些问题,我们提出了“有意到直觉”推理框架(D2I),该框架无需额外标注或复杂奖励即可提升多模态大语言模型(MLLMs)的理解和推理能力。在训练过程中,D2I使用仅由基于规则的格式奖励监督的有意推理策略来增强模态对齐。在推理过程中,它通过移除这些显式策略转向直觉推理,使模型能够在其响应中隐式应用所获得的能力。D2I在域内和域外基准测试中均优于基线,突显了格式奖励在培养可迁移多模态推理技能方面的有效性,并表明将训练时的推理深度与测试时的响应灵活性解耦是有益的。

英文摘要

Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability, as many existing methods rely on additional annotations or complex rule-based rewards. To address these issues, we propose the Deliberate-to-Intuitive reasoning framework (D2I), which improves the understanding and reasoning abilities of multimodal LLMs (MLLMs) without extra annotations or complex rewards. During training, D2I uses deliberate reasoning strategies supervised only by rule-based format rewards to enhance modality alignment. During inference, it shifts to intuitive reasoning by removing these explicit strategies, allowing the model to implicitly apply the acquired abilities in its responses. D2I outperforms baselines on both in-domain and out-of-domain benchmarks, highlighting the effectiveness of format rewards in fostering transferable multimodal reasoning skills and suggesting the benefit of decoupling training-time reasoning depth from test-time response flexibility.

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控:增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) ICISEE, Shanghai Jiao Tong University(上海交通大学ICISEE) School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学数学科学学院) King Abdullah University of Science and Technology(卡塔尔国王 Abdullah 科学与技术大学)

AI总结 提出TELLME方法,通过改进大型语言模型的内部表征透明度,帮助监控者识别不当和敏感行为,并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情
AI中文摘要

大型语言模型(LLMs)的能力日益增强,但其思维和决策过程的机制仍不清楚。思维链(CoTs)常被用来外化LLMs的思维,但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角,以改善对其潜在思维的可监控性。然而,以往的方法仅尝试开发外部模块,而非使LLMs本身更易于监控。本文提出了一种新方法TELLME,提高了LLMs的透明度,并帮助监控者识别不合适和敏感的行为。此外,我们在去毒化任务上展示了TELLME的有效性,LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

2503.18893 2026-05-28 cs.CL cs.LG 版本更新

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction

xKV:通过对齐奇异向量提取的跨层KV缓存压缩

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Hung-Yueh Chiang, Yash Akhauri, Xilai Dai, Huiqiang Jiang, Yucheng Li, Luis Ceze, Kai-Chiang Wu, Mohamed S. Abdelfattah

发表机构 * Cornell University(康奈尔大学) University of Washington(华盛顿大学) Department of Computer Science, National Yang Ming Chiao Tung University(国立阳明交通大学计算机科学系) The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Surrey(塞夫顿大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出xKV,一种通过跨层共享低秩子空间压缩KV缓存的后训练方法,实现高达8倍压缩且保持长上下文任务精度,并引入选择性重建实现端到端加速。

Comments ICML 2026

详情
AI中文摘要

长上下文大型语言模型(LLMs)支持强大的应用,但由于键值状态(KV-Cache)导致高内存成本。最近的研究尝试跨层共享KV-Cache,但这些方法要么需要昂贵的预训练,要么依赖于实践中通常有限的逐token跨层余弦相似度。我们通过中心核对齐(CKA)表明,KV-Cache的主要奇异向量在层间对齐良好。受此观察启发,我们提出xKV,一种后训练压缩方法,将分组层的KV-Cache联合分解为共享的低秩子空间,大幅减少KV-Cache内存。在广泛使用的LLMs上,xKV实现了高达8倍的KV-Cache压缩,同时在长上下文任务和多轮设置中保持准确性。为进一步提高效率,我们在解码时引入选择性重建(SR)。结合SR,xKV相比全注意力基线实现了高达4.23倍的端到端加速,并在相似精度水平下以30%更高的吞吐量超越了显著基线。总体而言,xKV提供了一种即插即用的方法,用于减少长上下文LLM推理的内存和延迟。我们的代码公开于:https://github.com/abdelfattah-lab/xKV。

英文摘要

Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression method that jointly factorizes grouped-layer KV-Cache into a shared low-rank subspace, substantially reducing KV-Cache memory. Across widely used LLMs, xKV achieves up to 8x KV-Cache compression while preserving accuracy on long-context tasks and in multi-turn settings. To further improve efficiency, we introduce Selective Reconstruction (SR) at decode time. Combined with SR, xKV achieves up to 4.23x end-to-end speedup over the full attention baseline, and surpasses notable baselines with 30% higher throughput under a similar accuracy level. Overall, xKV provides a plug-and-play approach to reduce both memory and latency for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

2407.21075 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Apple Intelligence Foundation Language Models

Apple Intelligence 基础语言模型

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Daniel Parilla, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren

发表机构 * Apple(苹果公司)

AI总结 本文介绍了为 Apple Intelligence 功能开发的基础语言模型,包括一个约30亿参数的设备端高效运行模型和一个用于私有云计算的服务器端大模型,并描述了其架构、训练数据、优化过程和评估结果。

详情
AI中文摘要

我们介绍了为支持 Apple Intelligence 功能而开发的基础语言模型,包括一个约30亿参数的模型,旨在设备上高效运行,以及一个用于私有云计算的大型服务器端语言模型。这些模型旨在高效、准确且负责任地执行广泛的任务。本报告描述了模型架构、用于训练模型的数据、训练过程、模型如何针对推理进行优化以及评估结果。我们强调了对负责任人工智能的关注,以及这些原则如何贯穿于模型开发的整个过程。

英文摘要

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

2308.04823 2026-05-28 cs.CL 版本更新

Evaluating the Generation Capabilities of Large Chinese Language Models

评估大型中文语言模型的生成能力

Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang

发表机构 * AI Research Center, Besteasy Language Technology Co., Ltd(最佳语言技术有限公司人工智能研究中心) LanguageX AI Lab(LanguageX人工智能实验室)

AI总结 提出CG-Eval自动评估框架和Gscore复合指标,用于多学科领域评估大型中文语言模型的生成能力。

详情
AI中文摘要

本文揭示了CG-Eval,这是首个专为评估大型中文语言模型在多个学科领域中的生成能力而设计的全面自动化评估框架。CG-Eval以其自动化流程脱颖而出,该流程基于模型在六个关键领域(科学与工程、人文与社会科学、数学计算、医师资格考试、司法考试和注册会计师考试)中生成精确且上下文相关回答的能力进行关键评估。同时,我们引入了Gscore,这是一种创新的复合指标,由多个指标的加权和开发而成。Gscore独特地自动测量模型文本生成相对于参考标准的质量,提供对模型性能的详细而细致的评估。这种自动化不仅提高了评估过程的效率和可扩展性,还确保了跨不同模型的客观和一致评估。详细的测试数据和结果,展示了所评估模型的强大能力和比较性能,可在http://cgeval.besteasy.com/获取。

英文摘要

This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines. CG-Eval stands out for its automated process, which critically assesses models based on their proficiency in generating precise and contextually relevant responses to a diverse array of questions within six key domains: Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. Alongside this, we introduce Gscore, an innovative composite index developed from a weighted sum of multiple metrics. Gscore uniquely automates the quality measurement of a model's text generation against reference standards, providing a detailed and nuanced assessment of model performance. This automation not only enhances the efficiency and scalability of the evaluation process but also ensures objective and consistent assessment across various models. The detailed test data and results, highlighting the robust capabilities and comparative performance of the evaluated models, are accessible at http://cgeval.besteasy.com/.

2304.12986 2026-05-28 cs.CL cs.AI 版本更新

Measuring Massive Multitask Chinese Understanding

测量大规模多任务中文理解

Hui Zeng

发表机构 * Besteasy (Beijing) Language Technology Co., Ltd.(北京最佳语言科技有限公司)

AI总结 针对中文大语言模型缺乏能力评估的问题,提出一个涵盖医学、法律、心理学和教育四大领域共23个子任务的多任务测试,通过零样本准确率评估模型性能,发现最佳模型平均领先最差模型18.6个百分点,且所有模型在法律领域表现最差。

详情
AI中文摘要

大规模中文语言模型的发展蓬勃,但缺乏相应的能力评估。因此,我们提出一个测试来衡量大型中文语言模型的多任务准确性。该测试涵盖四大领域,包括医学、法律、心理学和教育,其中医学有15个子任务,教育有8个子任务。我们发现,在零样本设置中,表现最好的模型平均比表现最差的模型高出近18.6个百分点。在四大领域中,所有模型的最高平均零样本准确率为0.512。在子领域中,只有GPT-3.5-turbo模型在临床医学上达到了0.693的零样本准确率,这是所有模型在所有子任务中的最高准确率。所有模型在法律领域表现不佳,最高零样本准确率仅为0.239。通过全面评估多个学科知识的广度和深度,该测试可以更准确地识别模型的不足之处。

英文摘要

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.