2605.28806 2026-05-28 cs.CV cs.CL cs.IR 版本更新

Personal Visual Memory from Explicit and Implicit Evidence

来自显式和隐式证据的个人视觉记忆

Viet Nguyen, Thao Nguyen, Vishal M. Patel, Yuheng Li

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Adobe Research（Adobe研究）

AI总结本文提出个人视觉记忆基准和VisualMem混合架构，通过显式与隐式视觉证据增强AI代理的长期记忆，显著提升个性化任务性能。

Comments Project Page: https://viettmab.github.io/visualmem-page/

详情

AI中文摘要

长期记忆对于个性化AI代理越来越重要，然而现有的基准和方法仍然主要以文本为中心。即使包含图像，后续问题所需的用户特定信息通常仅从文本中即可恢复，并且大多数记忆系统将图像轮次简化为通用描述。然而，图像通常携带文本很少陈述的个人信息——包括显式证据（如重复出现的用户相关实体）和隐式证据（如从视觉或多模态线索推断出的潜在用户事实）。我们引入了一个针对这两种证据形式的个人视觉记忆基准，并提出了VisualMem，一种混合视觉-文本架构，通过结构化个人视觉记忆模块增强文本记忆后端。VisualMem不是将图像压缩为描述，而是利用对话上下文来解析身份、所有权和持久的用户事实。实验表明，VisualMem在我们的基准上显著优于先前的记忆系统，同时在标准文本记忆基准上保持竞争力，这表明个人视觉记忆是个性化AI代理长期记忆中一个独特且重要的组成部分。

英文摘要

Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

URL PDF HTML ☆

赞 0 踩 0

2605.28805 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1: 具有显式结构化重校准的多模态元验证器

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang

发表机构 * Tsinghua University（清华大学）； Pennsylvania State University（宾夕法尼亚州立大学）； University of Southern California（南加州大学）； Microcyto ； Princeton University（普林斯顿大学）

AI总结提出OmniVerifier-M1，通过符号化元验证（如边界框）和解耦强化学习，实现多模态大模型的可靠细粒度验证与动态区域级自校正。

Comments ICML 2026. Project: https://github.com/Cominclip/OmniVerifier

详情

AI中文摘要

视觉结果日益成为多模态大语言模型的核心，因此可靠且细粒度的验证对于扩展通用基础模型至关重要。在这项工作中，我们研究了多模态元验证，它利用验证器生成的推理过程而非仅决策信号，并探索如何有效地将元验证反馈纳入多模态验证器训练。我们发现了两个关键发现。首先，符号化验证器输出（例如边界框）作为元验证推理过程优于文本解释，能够实现高效的基于规则的强化学习奖励，同时避免依赖来自辅助评判模型的基于模型的奖励。其次，解耦二元判断和元验证的强化学习目标显著优于联合奖励优化，这是由于输出结构和学习动态的内在差异。基于这些见解，我们训练了OmniVerifier-M1，一个利用符号化元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1提供稳健的验证和细粒度的错误定位，并进一步实现了M1-TTS，一个由验证器驱动的智能体生成系统，实现动态区域级自校正。这种方法为更可靠、可解释和细粒度的多模态验证铺平了道路，支持更安全、更可控的基础模型部署。

英文摘要

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.28802 2026-05-28 cs.CL 版本更新

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

人类标注变异作为稳定信号：通过跨标注者偏好优化学习标注者特定的解释行为

Beiduo Chen, Pingjun Hong, Ziyun Zhang, Benjamin Roth, Anna Korhonen, Barbara Plank

发表机构 * MaiNLP ； Center for Information and Language Processing（信息与语言处理中心）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； University of Vienna（维也纳大学）； LTL ； University of Cambridge（剑桥大学）

AI总结研究大语言模型能否学习并复现标注者特定的标签-解释行为，提出跨标注者偏好优化（CAPO）方法，通过对比目标标注者与其他有效但非目标标注者的响应来提升模仿和归因能力。

Comments 43 pages, 20 figures

详情

AI中文摘要

自由文本解释通过揭示标注者决策背后的推理和偏好，将人类标注变异（HLV）扩展到标签分歧之外。我们研究大语言模型（LLM）能否学习并复现这种标注者特定的标签-解释行为。使用两个句子对任务（自然语言推理和释义判断），每个任务有四位标注者，我们首先分析标注者是否表现出稳定的个体模式。我们发现，由于强烈的输入内容效应，这些模式在单标注层面较弱，但在减少输入内容影响并进行标注者级别聚合后变得可检测。然后，我们比较了提示学习和监督微调（SFT）基线，并提出了跨标注者偏好优化（CAPO），该方法将目标标注者的响应与同一输入的其他有效但非目标标注者的标注进行对比。实验表明，提示学习有限且不稳定，SFT能更好地捕捉标注者特定行为，而CAPO进一步改进了聚合感知模仿和基于判断的归因，同时在人类验证下保留了目标特定的推理模式。总体而言，我们的结果表明，HLV可以作为标注者特定的标签-解释行为被学习，这为基于标注者历史而非仅标签的可扩展解释型标注提供了路径。

英文摘要

Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators' decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each -- natural language inference and paraphrase judgment -- we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator's response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

URL PDF HTML ☆

赞 0 踩 0

2605.28791 2026-05-28 cs.CL cs.AI 版本更新

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

技能条件门控自蒸馏用于大语言模型推理

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao

发表机构 * Tsinghua University（清华大学）； Fudan University（复旦大学）； City University of Hong Kong（香港城市大学）； Huazhong University of Science and Technology（华中科技大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结提出技能条件门控自蒸馏（SGSD），通过从经验技能库中检索技能-错误对构建多教师池，并利用验证器验证教师极性，以鲁棒门控目标蒸馏信息性师生差异，在弱先验信息假设下提升数学推理性能。

详情

AI中文摘要

在线自蒸馏（SD）通过使用教师端特权信息（PI）将稀疏的验证器结果转化为密集的令牌级监督，从而改善大语言模型推理。现有方法通常假设可信的PI，例如参考答案或成功轨迹。我们提出PI是否可以来自经验驱动的技能库，其中检索到的技能紧凑且可重用，但也可能不相关或具有误导性。我们提出技能条件门控自蒸馏（SGSD），将基于技能的SD表述为教师假设验证而非无条件模仿。SGSD检索技能-错误对，构建多教师池，并让所有技能条件教师对相同的普通提示学生输出进行评分。验证器验证每个教师的极性：支持成功或抑制失败提供正向监督，而相反立场则被反转。然后，一个鲁棒的门控目标蒸馏信息性的师生差异，同时抑制不确定或极端信号。在多个数学推理基准上的实验表明，SGSD在弱PI假设下持续优于GRPO，并与答案条件OPSD保持竞争力。例如，在Qwen3-1.7B上，SGSD在AIME24、AIME25和HMMT25上平均比GRPO高出6.2%，比OPSD高出1.7%。我们的代码可在https://github.com/walawalagoose/SGSD获取。

英文摘要

On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.

URL PDF HTML ☆

赞 0 踩 0

2605.28782 2026-05-28 cs.CL 版本更新

Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

大型语言模型能否处理话语标记？以口语马来语为例

Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, Guangliang Liu, Xi Chen

发表机构 * Nanyang Technological University（南洋理工大学）； University of Mississippi（密苏里大学）； Indiana University（印第安纳大学）

AI总结本文通过构建MalayPrag基准和提出五个属性，系统评估并改进了大型语言模型在口语马来语中处理话语标记的能力。

详情

AI中文摘要

话语标记，如 extit{well}和 extit{kind of}，是使LLMs更接近人类“说话”方式的关键组成部分。它们用于传达情感、意图和人际意义。然而，现有研究尚未全面了解LLMs处理话语标记的能力。此外，有限的研究主要集中在英语等高资源语言上，对东南亚语言的关注很少。在本文中，我们（1）提出了 extsc{MalayPrag}，一个旨在系统评估和分析LLMs处理口语马来语中话语标记能力的基准；（2）引入了五个属性，这些属性提供了一个基于语言学的统一框架，用于解释话语标记的语用功能。应用这两项贡献，我们提示十个现成的LLMs执行三个预测任务。实验结果表明，当前LLMs在准确将马来语话语标记与其语用功能联系起来方面面临重大挑战。本研究中设计的五个属性的提供显著改善了这些联系，突出了模型语用能力结构化支架的必要性。

英文摘要

Discourse particles, such as \textit{well} and \textit{kind of}, are crucial components that enable LLMs to ``speak'' more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs' capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textsc{MalayPrag}, a benchmark designed to systematically evaluate and analyze LLMs' capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models' pragmatic competence.

URL PDF HTML ☆

赞 0 踩 0

2605.28779 2026-05-28 cs.CL cs.CV 版本更新

The Abstraction Gap in Vision-Language Causal Reasoning

视觉-语言因果推理中的抽象差距

Chinh Hoang, Mohammad Rashedul Hasan

发表机构 * Department of Electrical and Computer Engineering, University of Nebraska--Lincoln, Lincoln, Nebraska, USA（电气与计算机工程系，内布拉斯加大学林肯分校，内布拉斯加，林肯，美国）

AI总结针对视觉-语言模型（VLM）生成因果解释时语言流畅性与忠实因果推理的混淆问题，提出双探针方法和抽象差距（AG）指标，通过CAGE基准评估发现多数模型存在显著AG，但通过预训练和架构选择可缩小差距。

详情

AI中文摘要

视觉-语言模型（VLM）能生成流畅的因果解释，但当前的评估无法区分语言合理性与忠实因果推理。我们提出一种双探针方法来分离这些属性。文本探针测量语言质量。链式文本探针要求模型首先生成显式因果链。抽象差距（AG）指标量化归一化的性能差异。在CAGE（因果抽象差距评估）基准上评估八个VLM，该基准包含跨越Pearl因果层次的5,500张图像上的49,500个问题，我们发现七个模型的AG超过0.50，文本得分为6-8，但链式得分低于2.5。在45,000个链式标注样本上进行微调未能缩小差距。然而，一个模型实现了接近零的AG。该能力存在于当前VLM架构中，并取决于预训练和架构选择。CAGE为评估VLM中的忠实因果推理提供了诊断工具。

英文摘要

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.28778 2026-05-28 cs.CL 版本更新

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

LLMs 能否使用语言不确定性标记可靠地反映内在置信度？

Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University（耶鲁大学）

AI总结本研究首次系统探究大语言模型（LLMs）是否能够稳定且泛化地将其语言置信度标记与内在置信度关联，并评估上下文特征的影响，通过提出7个指标分析标记内在置信度的稳定性，发现LLMs即使在模型中心解释下也存在忠实校准偏差。

Comments Code: https://github.com/yale-nlp/marker_internal_confidence

详情

AI中文摘要

LLMs 的语言表达置信度应忠实反映其内在不确定性。尽管近期研究表明 LLMs 在以人类对齐的方式使用认知标记（例如，“很可能...”）方面存在困难，但尚不清楚模型是否能够应用其自身的语言置信度框架，以稳定且可泛化的方式将标记与特定置信水平关联起来，以及上下文特征如何影响这种能力。我们首次对此问题进行了系统研究，将_标记内在置信度_（MIC）形式化为模型在给定任务领域中与特定认知标记相关联的估计内在置信度。我们提出了7个指标来评估 MIC 在分布内和跨分布的稳定性。将我们的分析框架应用于多种模型和任务，我们发现，即使在模型中心解释标记含义的情况下，LLMs 仍然存在忠实校准偏差，尽管在任务间保持了一定程度的一致排序，但难以根据内在置信度区分跨分布的标记。这为现有工作提供了关键且互补的证据，有助于全面理解 LLMs 中的忠实校准，强调了需要更对齐和稳定的标记使用以提高可信度和可靠性。

英文摘要

LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.28775 2026-05-28 cs.LG cs.AI cs.CL 版本更新

反向探测：临床文本中大语言模型的监督式词级不确定性量化

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

发表机构 * University of Florida（佛罗里达大学）； U.S. National Library of Medicine（美国国家医学图书馆）

AI总结提出反向探测框架，利用预标注摘要从模型内部激活中提取词级不确定性信号，在临床文本中实现高效、可解释的不确定性量化。

详情

AI中文摘要

随着大语言模型越来越多地应用于临床文本，确保它们能够可靠地表明自身的不确定性变得至关重要。大多数现有的不确定性量化（UQ）方法是为开放域生成设计的，无法在长临床文本中定位到词或跨度级别的不确定性。我们提出了反向探测，这是首个专门针对临床摘要的UQ框架，它直接从预标注的摘要中估计词级不确定性。与采样新输出不同，反向探测将文本视为探测模型内部状态的探针，从四类内部激活中提取不确定性信号。我们在两个专家标注的临床数据集上进行了评估，在所有指标上优于八个适配基线，AUPRC最高提升4倍，同时减少了推理时间和计算成本。特征分析表明，delta能量和邻域上下文是所有模型中最一致的预测因子。本研究提供了关于模型内部如何响应无支持的临床内容的可解释性见解。

英文摘要

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

URL PDF HTML ☆

赞 0 踩 0

2605.28732 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

MemTrace：大型语言模型记忆系统中的错误追踪与归因

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

发表机构 * Zhejiang University（浙江大学）； Alibaba Group（阿里巴巴集团）

AI总结提出MemTrace框架，通过构建可执行的记忆演化图实现细粒度错误追踪，并利用自动归因方法定位根因，进而优化提示词提升下游任务性能。

Comments Ongoing work

详情

AI中文摘要

记忆对于使大型语言模型支持长程推理至关重要，但现有的记忆系统仍然不可靠且难以调试。追踪记忆的动态演化对于理解信息如何随时间合成、传播或损坏至关重要。在这项工作中，我们研究了LLM记忆系统中错误追踪与归因的新问题。我们提出了一种新颖的框架，将记忆流水线转换为可执行的记忆演化图，从而实现对操作信息流的细粒度追踪。然后，我们构建了MemTraceBench，一个从代表性记忆系统（如Long-Context、RAG、Mem0和EverMemOS）收集的基准，以系统地研究记忆故障模式。我们进一步引入了一种自动归因方法，该方法迭代地追踪操作子图以定位任何失败案例的根本原因。我们的分析表明，记忆故障是系统性的，源于操作层面的问题，如信息丢失和检索错位。关键的是，我们利用这些细粒度的归因信号来指导下游提示优化，建立了一个自动纠正故障并提升最终任务性能高达7.62%的闭环系统。代码将在https://github.com/zjunlp/MemTrace发布。

英文摘要

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

URL PDF HTML ☆

赞 0 踩 0

2605.28714 2026-05-28 cs.CL cs.AI 版本更新

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

IPO-Mine：用于长多模态IPO文档的章节结构化分析的工具包和数据集

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye, Arnav Hiray, Rutwik Routu, Prasun Banerjee, Siddhartha Somani, Sudheer Chava

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Sai University（赛大学）； Duke University（杜克大学）

AI总结本文提出IPO-Mine工具包和数据集，通过标准化解析IPO文件为章节结构化文本和图像，构建大规模多模态数据集，并建立图表评估任务，揭示多模态模型在长文档分析中的对齐挑战。

Comments 12 pages

详情

AI中文摘要

首次公开募股（IPO）文件是私营公司上市时发布的文件，允许个人（散户）投资者购买其股票。这些文件描述了公司的业务、财务状况和风险，是包含叙述性文本和图像的长篇多模态文档。尽管它们对金融市场至关重要，但目前缺乏用于使用现代语言和多模态模型研究IPO文件的大规模标准化数据集或基准。这些文档带来了重大挑战：文件通常超过50万词，且缺乏一致的结构组织。我们引入了IPO-Toolkit，这是一个开源框架，用于下载和解析IPO文件，将其标准化为章节结构化文本和提取的图像。该工具包分割文件、提取嵌入的图像，并生成结构化输出，从而支持对长多模态文档进行大规模、可重复的分析工作流。利用这一基础设施，我们构建了IPO-Dataset，这是一个大规模、章节结构化的多模态数据集，涵盖1994年至2026年超过109,000份IPO文件及其修订版，包含超过76,000张图像。我们针对提取的金融图表建立了结构化评估任务，包括图表质量和误导性评估。我们的实验表明，最先进的多模态模型在这些任务上常常与专家人类判断存在分歧，揭示了在长篇幅真实监管文档上进行多模态推理时的对齐挑战。除了基准测试，IPO-Dataset还支持对章节级文本变异以及视觉和文本披露实践的跨行业差异进行大规模分析。我们的代码、数据集和网站根据CC-BY-4.0公开提供。

英文摘要

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

URL PDF HTML ☆

赞 0 踩 0

2605.28710 2026-05-28 cs.CL cs.AI 版本更新

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

迈向可靠的多语言LLM作为评判者：一项实证研究

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

发表机构 * HiTZ Center - Ixa, University of the Basque Country EHU（希茨中心 - Ixa，巴斯克国家大学EHU）

AI总结本研究通过分析指令翻译、单语与多语言监督及模型规模等策略，探讨了在有无领域内数据情况下开发多语言LLM评判者的方法，并揭示了领域内数据可用时微调小模型可媲美专有模型、零样本大模型在域外更有效等关键权衡。

详情

AI中文摘要

AI中文摘要

评估基于LLM的社会智能体的真实性：对西班牙在线新闻反应的案例研究

Alejandro Buitrago López, Alberto Ortega Pastor, Javier Pastor-Galindo, José A. Ruipérez-Valiente

发表机构 * Faculty of Computer Science, University of Murcia（计算机科学系，穆尔西亚大学）

AI总结通过比较真实与LLM生成的西班牙新闻评论，研究LLM在仇恨言论、情感和语义对齐三个维度上的真实性，发现现成模型表现不佳，微调可部分改善。

详情

AI中文摘要

基于LLM的社会智能体越来越多地被用于模拟在线社交行为，但其真实性仍然难以验证。现有工作主要依赖通用基准，而对简短的反应性话语（如受众对在线新闻的回复）关注较少。在本文中，我们评估LLM生成的西班牙新闻反应是否再现了真实受众话语的可测量属性。使用Hatemedia数据集，我们将5,631条新闻与58,555条真实受众反应配对，并在共享实验设置下使用五个LLM生成匹配的合成数据集。我们从仇恨言论、情感和语义对齐三个维度比较真实和合成反应，考虑现成和微调生成。结果表明，现成模型是真实受众反应的糟糕代理：它们严重低估仇恨言论，引入模型特定的情感偏差，并且在分布上与人类回复相距甚远。微调不均匀地提高了保真度。Qwen3提供了最平衡的近似，而Mistral7B实现了最强的情感和语义对齐，但过度估计了仇恨普遍性。看似合理的合成回复不一定再现公共话语的分布特性。

英文摘要

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.28565 2026-05-28 cs.DL cs.AI cs.CL cs.IR 版本更新

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

验证性误导：衡量搜索增强型大语言模型中的结构性引用失败

Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； Department of Computer Science and Engineering, Konkuk University（Konkuk大学计算机科学与工程系）； Incheon International Airport Corporation（仁川国际机场公司）； Department of Computer Science and Engineering, Ewha Womans University（成均馆女子大学计算机科学与工程系）

AI总结针对搜索增强型大语言模型中的引用可信度问题，提出CITETRACE数据集和三维评估框架，发现系统性“验证性误导”模式：模型引用真实可访问来源但存在意图对齐、来源适宜性或答案-来源忠实度缺陷，导致用户面临结构性误导。

Comments Working Progress

详情

AI中文摘要

搜索增强型大语言模型的用户依赖引用作为回答基于真实来源的证据，但很少自行验证引用的页面。每天数百万次查询通过这些系统，使得引用质量成为用户是被告知还是被误导的无声决定因素——然而现有基准各自孤立地处理一个方面，导致决定引用可信度的联合结构未被衡量。我们构建了CITETRACE，一个大规模数据集，追踪从用户查询到检索来源再到生成答案的完整引用链：来自28个社区的11,200个真实世界查询，与来自五个提供商的十个模型的112,000个回答配对，产生761,495个可评估的引用对。我们设计了一个三维评估框架，使用专家验证的预定义矩阵和五级忠实度标准，对每个引用在意图-目的对齐、来源适宜性和答案-来源忠实度上进行评分；该框架适用于任何产生带引用回答的系统。大规模应用该框架，我们识别出一种系统性的模式，称为验证性误导（VM）：模型引用真实、可访问的来源，但在一个或多个维度上失败，产生忠实度-适宜性权衡，其中忠实模型选择不合适的来源，反之亦然。在我们的池中，30.6%的引用扭曲了其来源，27.1%的引用源自领域不合适的来源；在回答层面，高达96%的用户至少遇到一个结构性误导的引用。提供商层面的差异解释了88-96%的引用质量方差，表明来源选择更多受超出单个模型能力的因素控制，而非LLM本身。总之，CITETRACE及其评估框架为诊断部署的搜索增强系统中的结构性引用失败提供了首个资源。

英文摘要

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

URL PDF HTML ☆

赞 0 踩 0

2605.28561 2026-05-28 cs.CL cs.LG 版本更新

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Soft-SVeRL: 基于软奖励的自验证强化学习

Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis

发表机构 * Cohere Labs（Cohere实验室）

AI总结针对部分可验证任务，提出基于检查表分解的软奖励框架Soft-RLVR及其自验证变体Soft-SVeRL，通过密集部分信用信号提升强化学习训练效果，并解决自验证中的奖励膨胀问题。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）在数学和代码等领域改进了语言模型，这些领域中正确性可以自动检查。然而，许多重要任务仅部分可验证：提示包含多个要求，响应可能满足其中一些但非全部，或者可能不存在单一的参考答案。我们引入Soft-RLVR，一个从分解的、学习的验证信号中进行强化学习的框架。Soft-RLVR将每个提示转换为原子要求的检查表，使用LLM验证器逐项评分候选响应，并在生成的软奖励上进行训练。基于检查表的奖励将稀疏的通过/失败监督转化为更密集的部分信用信号，但它们也引入了一个权衡：平均逐项判断可以减少验证器噪声，而部分信用可能奖励不完整的响应。我们形式化了这一权衡，并确定了基于检查表的验证比整体验证提供更可靠RL训练信号的条件。我们进一步引入Soft-SVeRL，这是Soft-RLVR的一个自验证变体，其中策略也充当验证器。我们表明，自验证容易因过于宽松的自我判断而导致奖励膨胀，并且需要显式稳定化以防止这种崩溃。在基于规则的ground-truth评估的受控指令遵循设置中，基于检查表的Soft-RLVR仅使用学习的验证器奖励就将IFEval提升了最多11.1分。我们的实验进一步表明，验证器质量和检查表质量都影响下游RL结果，并且显式稳定化对于有效的自验证至关重要。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.

URL PDF HTML ☆

赞 0 踩 0

2605.28543 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Cultural Binding Heads in Language Models

语言模型中的文化绑定头

Avrile Floro, Luca Benedetto

发表机构 * Mistral-7B ； Mistral-Nemo-12B ； Llama-3.1-8B ； Gemma-2-9B

AI总结通过机制可解释性和析因设计，识别出8个语言模型中2-3个中间层注意力头对文化绑定有因果贡献，且绑定主要在预训练阶段形成，知识探测表明模型知道的知识远多于其行为表现。

详情

AI中文摘要

大型语言模型通常默认对不同文化群体一视同仁，即使上下文需要区分：这缺乏差异意识。利用机制可解释性和Wang等人(2025)的N4文化挪用基准上的析因设计，我们在八个模型（四种架构，基础版和指令版）中识别出每个模型有2-3个中间层注意力头对文化绑定有因果贡献。文化绑定是将文化项目与适当身份关联的过程。敲除这些头上的身份到项目边会使绑定强度降低9-23%。识别出的头从指令模型转移到基础模型，表明文化绑定是在预训练阶段创建的。α缩放显示分级剂量反应，生成时适度放大引导（α=2-3）可将文化区分准确性提高1-3个百分点，同时基本保持中性推理不变。知识探测任务表明，模型知道的知识比其行为表现多3-5倍，表明瓶颈在于路由而非知识。

英文摘要

LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.28534 2026-05-28 cs.CL 版本更新

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER：通过因果内化和密度感知示例重选进行GUI代理的中期训练

Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； Meituan（美团）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出GUI-CIDER中期训练方法，通过因果内化和密度感知示例重选显式内化GUI世界知识，提升代理对GUI操作的理解和任务成功率。

详情

AI中文摘要

尽管多模态大语言模型在构建图形用户界面（GUI）代理方面取得了快速进展，但其现实世界任务完成从根本上受到缺乏GUI操作世界知识的瓶颈。现有解决方案通常依赖昂贵的多代理框架或传统的后训练范式，如监督微调（SFT）和强化学习（RL）。然而，后训练仅允许代理通过动作注释或奖励信号隐式吸收世界知识，导致低效的轨迹记忆而非真正理解。因此，一种能够显式学习这些知识的方法至关重要。为此，我们提出GUI-CIDER，一种通过因果内化和密度感知示例重选显式内化GUI世界知识的中期训练方法。GUI-CIDER分为三个阶段：（1）数据合成，从GUI轨迹中提取静态规划和动态因果知识为文本；（2）示例重选，通过奖励因果结构和惩罚语义冗余来过滤语料库；（3）中期训练，使用精炼数据嵌入所学知识。在两个GUI知识基准和三个任务完成基准上的大量实验表明，GUI-CIDER持续提升了代理对GUI操作的理解及其任务成功率。代码可在https://github.com/Wuzheng02/GUI-CIDER获取。

英文摘要

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.

URL PDF HTML ☆

赞 0 踩 0

2605.28526 2026-05-28 cs.AI cs.CL 版本更新

Entropy-aware Masking for Masked Language Modeling

面向掩码语言建模的熵感知掩码策略

Gokul Srinivasagan, Kai Hartung, Munir Georges

发表机构 * AImotion Bavaria（AImotion巴伐利亚）； Technische Hochschule Ingolstadt（英戈尔施塔特技术大学）

AI总结提出基于熵分布的掩码策略，通过模型预测熵识别信息量高的token进行掩码，并引入自掩码方法提升训练效率，在GLUE上平均提升5%。

Comments accepted at starsem 2026 Conference

详情

AI中文摘要

掩码语言建模已成为训练基于编码器的语言模型的标准预训练目标。在该方法中，输入中的某些token被掩码，模型学习利用周围上下文预测它们。这一过程使模型能够捕捉语言的句法和语义属性。传统上，用于掩码的token是随机选择的，这可能并不总是产生最有效的学习信号。在这项工作中，我们研究了一种基于熵分布的token掩码策略。我们利用模型在token预测上的熵来确定哪些token应被掩码。该方法旨在针对信息量更大、不确定性更高的token，以提高训练效率。我们还提出了一种新颖的自掩码方法，无需依赖外部参考模型即可增强训练效率。实验结果表明，与基线相比，我们的方法在GLUE分数上平均提升了5%。此外，我们尝试将知识蒸馏与熵掩码相结合，取得了最佳的整体结果。

英文摘要

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

URL PDF HTML ☆

赞 0 踩 0

2605.28521 2026-05-28 cs.CL 版本更新

ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task

ClinicalEncoder26AM：一个多语言可诊断的ColBERT模型——来自MultiClinNER共享任务的证据

François Remy

发表机构 * Parallia AI

AI总结本文提出ClinicalEncoder26AM，一个基于BGE-M3的多语言可诊断ColBERT模型，通过多适配器蒸馏和ColBERT式检索目标进行临床后训练，在MultiClinNER任务中微调为BIO标注器，实现了最先进的多语言实体召回率和字符加权F1分数前五。

详情

AI中文摘要

ClinicalEncoder26AM是一个用于临床和生物医学文本的多语言可诊断ColBERT模型，它在多个层次上将其token级语义与ClinicalMap25对齐，ClinicalMap25是一个受BioLORD-2023启发并通过合成和标注监督丰富的临床潜在空间。后训练方案基于BGE-M3，结合了合成临床笔记、患者-医生对话以及MedMentions等标注资源，同时通过多适配器蒸馏考虑命名实体级和句子级表示，并采用ColBERT风格的检索目标。在这篇系统演示论文中，我们通过将模型微调为用于患者症状、疾病和程序范围的BIO标注器来评估其在MultiClinNER共享任务中的表现，使用轻量级两层CNN头部来改善局部边界检测。最终系统保持简单，在单个8192 token窗口中处理大多数文档，实现了最先进的多语言实体召回率，并在所有实体类型和语言的字符加权F1分数中达到前五。训练曲线进一步表明，ClinicalEncoder26AM比基础M3模型在数据效率上显著更高，支持其临床后训练对下游信息提取的有用性。模型可在https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts下载。

英文摘要

ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient--doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts

URL PDF HTML ☆

赞 0 踩 0

2605.28512 2026-05-28 cs.CL 版本更新

On Compositional Learning Behaviours in Formal Mathematics

论形式数学中的组合学习行为

Kevin Yandoka Denamganaï

发表机构 * University of York（约克大学）

AI总结本文提出 S2B-LM 基准，通过去除数值处理混淆并添加思维链框架来评估组合学习行为（CLB），发现 CLB 能力对于形式数学验证的困难部分必要但不充分。

Comments work in progress, under review

详情

AI中文摘要

能够征服形式数学困难尾部的自我进化科学智能体需要组合学习行为（CLBs）——在上下文中基础化和重组新颖符号结构的能力，而不仅仅是预学习原子的重组。我们提出了 extbf{S2B-LM}，这是符号行为基准的一个改编，它移除了数值处理作为混淆因素，并添加了思维链框架以引发而非仅仅探测潜在的 CLB 能力。在 CLB 能力（adj-ZSCT）和 miniF2F 整体证明性能上交叉评估十个 Lean~4 定理证明器，精确置换检验建立了一个层次必要性结构：搜索密集型模型覆盖了可处理的绝大部分而没有可检测的 CLB，然而每个进入奥林匹克级别（miniF2F $>75\%$）的模型都是五个最高 CLB 得分者之一（$p=0.004$）。在排除模型规模作为混淆因素后，我们的结果表明 CLB 能力对于形式数学验证的困难尾部是 \emph{必要但不充分的}。

英文摘要

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

URL PDF HTML ☆

赞 0 踩 0

2605.28500 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

功能熵：通过不确定性量化预测LLM生成代码的功能正确性

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

发表机构 * CVS Health（CVS健康）

AI总结针对LLM生成代码功能不正确的问题，提出基于功能等价性的不确定性量化方法（功能熵），在多个编程语言和模型上优于现有方法。

详情

打破脚本障碍：实现基于词性标注的ASR错误分析在非拉丁脚本中的自动对齐

Prasenjit K Mudi, Dahlia Devapriya, Sheetal Kalyani

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯学院）

AI总结提出一种语言无关的自动对齐机制，使基于词性标注的ASR错误分析能在拉丁和非拉丁脚本中可靠进行，并应用于多种书写系统以提升WER。

详情

AI中文摘要

自动语音识别（ASR）系统通常使用词错误率（WER）等聚合指标进行评估，但这些指标无法捕捉错误的语言结构。细粒度分析（如基于词性（PoS）的错误特征）需要ASR假设与参考转录之间的准确对齐。然而，现有的对齐工具对于非拉丁脚本的语言通常不可靠。在这项工作中，我们通过提出一种鲁棒、自动、语言无关的对齐机制来填补这一空白，该机制适用于各种ASR架构以及拉丁和非拉丁脚本的语言。这使得假设、参考和评估序列能够一致对齐，为下游语言分析奠定基础。在此基础上，我们使用标准PoS标注器进行可扩展且可重复的基于PoS的错误分析。值得注意的是，我们对三种主要的分段书写系统进行了对齐和下游ASR错误分析，即元音附标文字（泰米尔语、印地语、卡纳达语）、字母文字（英语、俄语、希腊语）和辅音文字（阿拉伯语）。我们进一步展示了如何在ASR训练中利用此类错误信息来改进WER等指标。

英文摘要

Automatic Speech Recognition (ASR) systems are commonly evaluated using aggregate metrics such as Word Error Rate (WER), which do not capture the linguistic structure of errors. Fine-grained analysis, such as Part-of-Speech (PoS)-wise error characterization, requires accurate alignment between ASR hypotheses and reference transcriptions. However, existing alignment tools are often unreliable for languages written in non-Latin scripts. In this work, we address this gap by proposing a robust, automated, language-agnostic alignment mechanism applicable across ASR architectures and across languages written in both Latin and non-Latin scripts. This enables consistent alignment of hypotheses, references, and evaluation sequences, forming the basis for downstream linguistic analysis. Building on this, we employ standard PoS taggers to perform scalable and reproducible PoS-wise error analysis. Notably, we perform alignment and downstream ASR error analysis across three major segmented writing systems, namely, Abugida (Tamil, Hindi, Kannada), Alphabetic (English, Russian, Greek), and Abjad (Arabic). We further demonstrate how such error information can be leveraged during ASR training to improve metrics such as WER.

URL PDF HTML ☆

赞 0 踩 0

2605.28433 2026-05-28 cs.CL 版本更新

Roles with Rails: Contract-Preserving Role Evolution in Multi-Agent Structured Reasoning

角色与轨道：多智能体结构化推理中保持契约的角色演化

Ling-Yue Ge, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）

AI总结提出SERO框架，通过契约保持的角色演化机制（信用引导检索、保护终端聚合器、条件验证器修复、上下文赌博机控制器）解决多智能体系统中角色漂移和契约破坏问题，在真实推理基准上验证有效性。

Comments 33 pages, 23 figures, 12 tables

详情

AI中文摘要

基于角色的LLM多智能体系统需要自适应角色池，但适应此类系统不仅仅是提示优化的问题：角色通常带有结构性义务，包括能力覆盖、消息兼容性、验证、最终答案聚合以及解析器兼容的输出协议。现有系统要么固定角色清单而失去自适应性，要么允许无约束生成导致角色漂移，移除结构上必要的角色并破坏答案契约。我们将此形式化为保持契约的角色演化，要求每次提交的编辑保留五个结构性契约（能力、通信、验证、聚合、输出协议）。我们在SERO（自演化角色编排框架）中实例化这一形式化，该框架通过信用引导检索、带有保护终端聚合器和条件验证器修复的信用排序通信DAG，以及一个上下文赌博机控制器来演化类型化角色卡池，其中LLM提出的编辑仅在它们保持契约并提高任务分数时被提交。在三个LLM骨干上的真实世界推理基准实验证实了保持契约的角色演化的价值。

英文摘要

Role-based LLM multi-agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, validation, final-answer aggregation, and parser-compatible output protocols. Existing systems either fix the role inventory and lose adaptivity, or allow unconstrained generation to induce role drift, removing structurally necessary roles and breaking answer contracts. We formulate this as contract-preserving role evolution, requiring every committed edit to preserve five structural contracts (capability, communication, validation, aggregation, output protocol). We instantiate this formulation in SERO, a Self-Evolving Role Orchestration framework that evolves a typed role-card pool through credit-guided retrieval, a credit-ranked communication DAG with a protected terminal aggregator and conditional validator repair, and a contextual-bandit controller whose LLM-proposed edits are committed only when they preserve the contracts and improve task score. Experiments on real-world reasoning benchmarks across three LLM backbones confirm the value of contract-preserving role evolution.

URL PDF HTML ☆

赞 0 踩 0

2605.28424 2026-05-28 cs.CL 版本更新

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Skill0.5：面向智能体强化学习中分布外泛化的联合技能内化与利用

Jiapeng Zhu, Jianxiang Yu, Yibo Zhao, Chengcheng Han, Qi Gu, Xunliang Cai, Xiang Li, Weining Qian

发表机构 * East China Normal University（华东师范大学）； Meituan Longcat Team（美团Longcat团队）

AI总结提出Skill0.5框架，通过区分通用技能内化与任务特定技能利用，结合动态难度感知路由器，在ALFWorld和WebShop上提升了分布内和分布外场景的性能。

详情

AI中文摘要

将显式技能赋予大型语言模型已成为使自主智能体解决复杂任务的一种有前景的范式。智能体技能可以内在地分为用于广泛认知迁移的通用技能和用于动态执行的任务特定技能。然而，现有的基于技能的强化学习方法通常强制在完全外化（导致高昂的上下文开销）和完全内化（存在过拟合和知识冲突风险）之间做出僵化选择。为了解决这一困境，我们提出了Skill0.5，一种新颖的智能体强化学习框架，通过结合通用技能内化与任务特定技能利用来明确区分技能处理方式。在动态、难度感知路由器的驱动下，Skill0.5将任务流式传输到不同的掌握层级，以应用定制的优化策略：它通过特权蒸馏内化通用技能，为困难任务构建认知基础，同时在简单任务上使用诊断性探测来惩罚捷径并强制特定技能利用。在ALFWorld和WebShop上的实验表明，Skill0.5优于基于记忆和基于技能的强化学习基线，在分布内和分布外场景中均实现了性能提升。

英文摘要

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.28389 2026-05-28 cs.CL 版本更新

FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

FABSVer: 更快的训练与更好的自验证用于大语言模型数学推理

Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song

发表机构 * Zuoyebang Education Technology（左岳邦教育科技）

AI总结提出FABSVer方法，通过融合解生成与自验证为单次前向传播，并引入动态参考模型更新（DRMU）突破奖励瓶颈，在三个模型规模上实现更优的自验证与推理性能，训练时间仅为现有方法的51%-71%。

详情

AI中文摘要

尽管大语言模型在数学推理方面取得了显著进展，但它们在判断自身解决方案的正确性方面仍然不可靠。现有的为模型配备自验证能力的方法通常将解生成和验证视为两个独立的任务，导致训练时间大幅增加。在本文中，我们提出FABSVer，将这两个任务融合为单次生成过程，在联合优化两种能力的同时显著降低训练开销。我们进一步从理论和实验上识别出一个收敛瓶颈：随着训练进行，由于策略受固定参考模型约束，奖励达到平台期。为克服这一问题，我们引入动态参考模型更新（DRMU），提高了奖励上限并实现持续的奖励增长。在数学基准上的大量实验表明，FABSVer在三个模型规模上实现了优越的自验证和推理性能，同时仅需现有方法训练时间的51%–71%。分析进一步揭示了模型获取自验证能力的不同学习阶段，并且随着模型规模增大，验证奖励与答案奖励之间的差距显著缩小。

英文摘要

While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

URL PDF HTML ☆

赞 0 踩 0

2605.28375 2026-05-28 cs.CL 版本更新

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

PrionNER: 朊病毒病生物医学文献命名实体识别数据集

An Dao, Nhan Ly, Thao Tran, Yuji Matsumoto, Akiko Aizawa

发表机构 * The University of Tokyo（东京大学）； Medical Doctor, Independent Researcher（医学博士，独立研究员）； Center for Language AI Research, Tohoku University（语言人工智能研究中心，东北大学）； RIKEN Center for Advanced Intelligence Project（日本理化学研究院高级智能项目中心）； National Institute of Informatics（信息机构国家研究所）

AI总结针对朊病毒病临床信息，构建了手动标注的命名实体识别数据集PrionNER，包含317篇摘要、15种粗粒度和31种细粒度实体类型，并评估了监督和零样本模型性能。

Comments 29 pages, 5 figures, accepted at ACL 25th Workshop on Biomedical Language Processing (BioNLP 2026)

详情

AI中文摘要

朊病毒病是一种罕见、快速进展且致命的神经退行性疾病，由于非特异性临床表现，早期诊断困难。然而，据我们所知，目前尚无公开的、专注于朊病毒病的数据集，用于从生物医学文献中捕获广泛的临床相关实体。我们推出了PrionNER，一个针对PubMed摘要中朊病毒病临床信息的手动标注命名实体识别数据集。当前版本包含317篇摘要、2,943个句子和6,955个文本绑定实体标注，涵盖15种粗粒度和31种细粒度临床导向实体类型，涉及疾病、症状、诊断、发现、解剖、治疗以及时间和统计证据。标注者间一致性达到81.78%的精确匹配F1值，表明标注一致性较强。我们在PrionNER上对监督BERT基线、W2NER和零样本提取器进行了基准测试。W2NER是最强的监督模型，Gemma-4-31B是最强的零样本模型，但基准测试仍具有挑战性，尤其是对于结构复杂的提及和细粒度的临床邻近标签区分。PrionNER为朊病毒病信息提取提供了临床基础的基准，并支持低资源、细粒度及非平面提取条件下的罕见病生物医学NLP研究。数据集、标注指南和评估脚本可在https://github.com/daotuanan/PrionNER/获取。

英文摘要

Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.

URL PDF HTML ☆

赞 0 踩 0

2605.28365 2026-05-28 cs.AI cs.CL cs.LO 版本更新

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

风险控制的 Lean 作为自然语言数学推理的评判者

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

发表机构 * Imperial College London（伦敦帝国理工学院）； Huawei Noah’s Ark Lab（华为诺亚实验室）； UCL Centre for AI（大学学院伦敦人工智能中心）

AI总结针对 Lean 评判自然语言数学答案时信号稀疏且不忠实的问题，提出 COVCAL 选择器，通过有限样本选择性风险控制，在自动形式化覆盖率足够高时保证接受答案的准确率。

详情

AI中文摘要

Lean 越来越多地被用于评判自然语言数学答案，但其信号是不完全的：许多答案从未被形式化，而一个失败的证明可能反映类型错误或缺少库事实，而非答案错误。在 MATH-500 上，我们表明该信号 (i) 严重依赖于覆盖率，即在证明覆盖率高的答案中正确率为 96%，但在覆盖率低时为 20%，以及 (ii) 稀疏且常常不忠实：一个 7B 自动形式化器仅对 28% 的问题证明了某个类别，而人工审计发现其中只有约 43% 的证明是忠实的。我们提出 COVCAL，一个基于 Lean 跟踪诊断的选择器，它在两种机制（保守的 Bonferroni 界和更紧的 dev-then-cal 规则）下，对接受的答案认证有限样本选择性风险界，否则弃权。可行性取决于自动形式化覆盖率：对于 7B 形式化器，信号过于稀疏，Bonferroni 在所有 20 个自助法分区上弃权，而一个专用于证明器的形式化器达到 79% 的覆盖率，并在 20 个分区中的 17 个上使其可行，以 0.98 的接受准确率接受约 48% 的问题。由于自一致性本身已达到 91% 的准确率，我们的贡献是精确描述了何时以及使用哪个形式化器，部分形式化信号可以在风险控制下被信任。

英文摘要

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

URL PDF HTML ☆

赞 0 踩 0

2605.28363 2026-05-28 cs.CL 版本更新

PubMedCausal: A Span-Level Annotated Corpus for Causal Relation Extraction in Biomedical Text

PubMedCausal: 用于生物医学文本中因果关抽取的跨度级标注语料库

Ifeoluwa Kunle-John, Josiah Paul, Oluwatosin Agbaakin, Peter Aina, Ikenna Odezuligbo, Sydney Anuyah

发表机构 * Edyah Limited（Edyah有限公司）； University of Ibadan（伊巴丹大学）； Indiana University（印第安纳大学）； Creighton University（克雷顿大学）

AI总结为解决现有资源将因果关系与广义关联混淆、限制句子级标注或仅关注显式因果线索的问题，构建了基于PubMed摘要的跨度级因果关抽取语料库PubMedCausal，包含30,000段落级行、3,945因果行和6,491个裁决的因果对，并基准测试了判别式编码器和开源生成模型。

Comments Submitted to EMNLP 2026, 8 Pages, 23 page appendix

详情

AI中文摘要

因果关抽取（CRE）是生物医学文本挖掘的核心，但当前资源常将因果关系与更广泛的关联混淆，将标注限制在句子级别，或主要关注显式因果线索。这限制了它们在评估模型是否能恢复生物医学文本中实际表达的因果主张方面的实用性。我们引入了PubMedCausal，一个基于PubMed摘要构建的生物医学CRE跨度级标注语料库。该语料库包含30,000个段落级行，包括3,945个因果行和6,491个经裁决的因果对。每个因果关系都标注了全文的原因和结果跨度、因果类型以及句子性，从而支持因果检测和全跨度因果抽取的评估。我们在检测和抽取设置下对判别式编码器和开源生成模型进行了基准测试。对于因果检测，生物医学编码器表现最强，PubMedBERT达到F$_1$分数0.7391。对于跨度级抽取，最佳生成基线是DeepSeek-R1-32B配合少样本提示，达到余弦对F$_1$分数0.6765。我们进一步通过评估在PubMedCausal上训练的编码器在外部因果关数据集上的表现来测试迁移学习，表明该资源支持跨数据集评估。我们的结果表明，在类别不平衡、长因果跨度、隐式因果关系、跨句关系以及提示敏感性下，生物医学CRE仍然困难。代码和数据可在此处找到：https://github.com/josiahpaul07/PubMedCausal_Exp

英文摘要

Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence-level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span-level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph-level rows, including 3,945 causal rows and 6,491 adjudicated cause--effect pairs. Each causal relation is annotated with full-text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full-span causal extraction. We benchmark discriminative encoders and open-source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F$_1$ score of 0.7391. For span-level extraction, the best generative baseline is DeepSeek-R1-32B with few-shot prompting, reaching a Cosine Pair F$_1$ of 0.6765. We further test transfer learning by evaluating PubMedCausal-trained encoders on external causal relation datasets, showing that the resource supports cross-dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter-sentential relations, and prompt sensitivity. Code and Data can be found here: https://github.com/josiahpaul07/PubMedCausal_Exp

URL PDF HTML ☆

赞 0 踩 0

2605.28346 2026-05-28 cs.CL 版本更新

When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

当话语压力冲突时：视觉-语言模型输出中的信息结构

Marcell Fekete, Johannes Bjerva, Tamás Káldi

发表机构 * Department of Computer Science, Aalborg University（奥尔堡大学计算机科学系）； Department of Psycholinguistics and Neurolinguistics, ELTE Research Centre for Linguistics（ELTE语言研究中心心理学语言学与神经语言学系）； ELTE Bárczi Gusztáv Faculty of Special Needs Education（ELTE巴尔茨吉斯塔夫特殊教育学院）

AI总结研究视觉-语言模型在视觉问答中是否区分话语旧主题和新焦点，发现模型虽产生信息结构相关结构但过度正则化，倾向于窄响应模板，类似模式崩溃。

详情

AI中文摘要

视觉-语言模型（VLM）越来越多地被评估是否能识别正确的视觉内容，但关于它们是否以话语适当的形式表达这些内容却知之甚少。我们利用信息结构（IS）来填补这一研究空白，测试VLM在视觉基础问答中是否能区分话语旧主题（Topic）和话语新焦点（Focus）。我们利用匈牙利语，其中主题和焦点映射到专门的句法位置，使得IS选择在文本中可观察。通过比较六种VLM与人类参与者，我们发现模型产生了与IS相关的结构，但过度正则化了这种敏感性。在话语状态、语法角色（主语主题偏好）和限定性（不定焦点偏好）的相互作用压力下，人类选择多种IS实现策略。相比之下，VLM坍缩为狭窄的响应模板，类似于模式崩溃（Kirk等人，2024）。我们的发现表明，VLM评估应超越内容准确性，关注内容如何为话语打包。

英文摘要

Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.28315 2026-05-28 cs.CL 版本更新

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

HardMTBench：知识密集型领域的中英翻译压力测试

Zheng Li, Mao Zheng, Mingyang Song, Tianxiang Fei

发表机构 * Large Language Model Department, Tencent（腾讯大语言模型部门）

AI总结针对现有中英翻译基准饱和问题，提出HardMTBench，一种难度感知的诊断基准，通过多阶段构建和难度融合规则，在12个知识密集型领域上显著扩大系统性能差异并暴露术语和知识弱点。

详情

AI中文摘要

通用机器翻译基准（如FLORES-200）在中英对上已达到饱和状态，现代大语言模型的高分区间狭窄。在22个系统中，FLORES-200中英GEMBA分数落在7.87分范围内，标准差为2.29，这压缩了系统在金融、医疗、法律、科技等知识密集型领域上的区分度。我们提出HardMTBench，一个面向双向中英领域翻译的难度感知诊断基准。HardMTBench涵盖12个领域，包含10,000条人工筛选的源句及其参考译文，打包为20,000个方向性测试项。一个三阶段构建流程构建了包含84,566对的领域平衡候选池，应用基于LLM的多信号评判器评估知识密度、翻译难度、术语负载和参考正确性，并在难度融合规则下按领域配额组装最终测试集。在涵盖通用LLM、商业引擎和专业翻译模型的22个系统上，HardMTBench将跨系统的GEMBA范围相比FLORES-200扩大了约两倍，引发明显的排名重排，并暴露了仅靠质量指标往往掩盖的领域特定术语和知识弱点。所有数据和代码已在https://github.com/jasonNLP/HardMTBench开源。

英文摘要

General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84{,}566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at https://github.com/jasonNLP/HardMTBench.

URL PDF HTML ☆

赞 0 踩 0

2605.28313 2026-05-28 cs.CL 版本更新

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

基于大语言模型的论证质量评估：一种成对Bradley-Terry方法

Nicolás Benjamín Ocampo, Agnes Paullate Nyiranziza, Davide Ceolin

发表机构 * Centrum Wiskunde & Informatica, The Netherlands（荷兰代尔夫特理工大学）； Vrije Universiteit Amsterdam, The Netherlands（荷兰阿姆斯特丹自由大学）

AI总结本研究利用12种开源大语言模型，通过成对比较和Bradley-Terry模型评估论证质量，发现Llama-70B与人类专家判断具有中等一致性（Cohen's κ=0.493），其他模型表现各异但互补。

详情

AI中文摘要

大语言模型（LLMs）在推理和判断相关任务中展现出显著能力。然而，评估论证质量需要严格的评价。我们研究了LLMs有效执行此任务的程度。我们在零样本、少样本和思维链设置下测试了12种不同规模和系列的开源LLMs，以近似专家在逻辑、修辞和辩证三个维度上对论证质量的成对比较，并将这些比较用于Bradley-Terry模型，以推断潜在强度分数并得出论证排名。我们的见解表明，LLMs与人类专家判断具有有希望但中等程度的相关性，其中Llama-70B获得最强一致性，达到中等Cohen's κ=0.493，并且与从这些标注导出的Bradley-Terry分数具有中等相关性（Kendall、Pearson和Spearman：0.327-0.477）。其他LLMs与Llama-70B表现出弱、中等或高度一致性，同时在与人类专家比较中取得可比结果，表明尽管模型规模和系列存在差异，但对潜在质量维度具有部分但互补的理解。此外，LLM预测在试验运行中稳定，少于7.75%的情况产生不同标签。剩余变异性通过多数投票和少样本提示对大模型进行处理。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $κ$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75\% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.

URL PDF HTML ☆

赞 0 踩 0

2605.28308 2026-05-28 cs.CL 版本更新

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

HELEA: 用于鲁棒实体对齐的硬负样本基准和基于LLM的重排序

Yoonjin Jang, Junwoo Kim, Youngjoong Ko

发表机构 * SungKyunKwan University（全州大学）

AI总结针对现有实体对齐基准中模型依赖名称重叠而非关系结构的问题，提出同名的硬负样本增强策略生成质量可控的评估基准和训练语料，并设计HELEA两阶段框架（实体编码器检索+LLM重排序），在硬负样本基准上实现鲁棒对齐。

Comments 10 pages, 3 figures, 9 tables. Code and benchmarks available at https://github.com/Wnsdnl/HELEA

详情

AI中文摘要

实体对齐（EA）对于知识图谱（KG）融合至关重要，但现有基准通常允许模型利用名称重叠而非关系结构。这使得难以评估模型是否能拒绝指向不同现实世界对象的同名实体。我们的主要贡献是一种同名的硬负样本增强策略，通过从KG名称冲突组中挖掘同名但不同的实体对，同时生成质量可控的评估基准（DW-HN29K、DY-HN27K）和增强训练语料（DW-Train、DY-Train）。我们进一步引入HELEA，一个两阶段框架，整合了（i）在硬负样本增强训练语料上训练的实体编码器检索（使用1跳KG上下文），以及（ii）无需额外训练的基于LLM的重排序。实验表明，依赖名称的基线在我们的硬负样本基准上性能下降至接近随机，而HELEA在DW-HN29K上达到F1 0.967，同时在标准DW-15K上保持Hit@1 0.993。

英文摘要

Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.

URL PDF HTML ☆

赞 0 踩 0

2605.28306 2026-05-28 cs.CL cs.AI 版本更新

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

面向混合专家模型中多语言下游任务的路由对齐微调

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong, Sichun Luo, Linqi Song

发表机构 * City University of Hong Kong（香港城市大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Hong Kong（香港大学）

AI总结针对混合专家模型在多语言下游任务中的路由结构异构问题，提出RA-MoE三阶段框架，通过中间层语言通用对齐区识别任务相关专家，并引入路由对齐损失增强目标语言路由，实验表明该方法优于标准微调和强基线。

详情

AI中文摘要

混合专家（MoE）模型已成为高效扩展LLM的主流范式，但将其适配到非英语下游任务仍然具有挑战性。现有的微调方法将MoE模型视为整体学习器，忽略了预训练期间形成的异构路由结构。我们在多个MoE模型和下游任务上验证，中间层形成了语言通用对齐区，其中路由发散性强烈预测了每种语言的任务性能差距。基于这一观察，我们提出了RA-MoE（路由对齐MoE微调），一个三阶段框架，该框架根据英语和目标语言的正确性将并行任务示例分类为四路分类法（cc/ci/ic/ii），识别中间层中与任务相关的专家，并用路由对齐损失增强标准SFT，该损失鼓励ci类型示例上的目标语言路由遵循英语任务专家激活模式。在三个MoE模型、三个任务和六种目标语言上的实验表明，RA-MoE始终优于标准SFT和强基线（包括Routing Steering和RISE），其中任务-语言对的ci比例可作为对齐收益的可靠预测指标。

英文摘要

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

URL PDF HTML ☆

赞 0 踩 0

2605.28305 2026-05-28 cs.CL cs.AI 版本更新

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

重新审视大语言模型推理中的拟人化反思标记

Yahan Yu, Noa Nakanishi, Fei Cheng

发表机构 * Kyoto University（京都大学）

AI总结本文通过提示级和令牌级干预抑制拟人化反思标记，发现这些标记并非推理性能的必要条件，且抑制后模型仍能进行无标记验证，表明它们更多是表面线索而非可靠反思代理。

Comments 15 pages, 12 figures

详情

AI中文摘要

大语言模型（LLMs）在复杂推理过程中经常产生显式的反思痕迹，并伴随有拟人化标记，如“wait”、“hmm”和“alternatively”。尽管这些标记通常被用作反思的可见指标，但其机制仍不清楚，这带来了与冗余和重复反思标记相关的过度思考风险。在这项工作中，我们重新审视了拟人化反思标记，考察了它们对推理的必要性以及在反思中的作用。我们通过提示级和令牌级干预抑制这些标记，并分析了它们对四个基准测试和两种模型规模的任务性能的影响。我们的结果表明，拟人化标记对于推理性能并非普遍必要：抑制它们可以在多种设置下保持或提高性能，尤其是在较大的采样预算下。同时，标记抑制并不一定消除反思行为，因为模型仍然可以进行无标记验证。这些结果表明，拟人化标记更倾向于表面线索，而不是反思本身的可靠代理，并激励未来在显式标记模式之外对推理机制进行研究。

英文摘要

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

URL PDF HTML ☆

赞 0 踩 0

2605.28295 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Rollouts 的起点：面向 RLVR 的低负载、高杠杆的首 token 多样化

Soeun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）

AI总结本文提出 REFT 方法，通过在推理标记后的第一个 token 处进行均匀采样多样化，以低开销显著提升 RLVR 中 rollout 的多样性，从而改善推理模型的 Pass@k 性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）无需标注轨迹即可训练推理模型，它依赖分组 rollout 将策略暴露于替代推理路径，并由验证器进行评分。Rollout 多样性因此成为 RLVR 的核心瓶颈，现有方法大多通过温度、前缀或 rollout 选择调整来拓宽探索。我们发现了一个结构上独特但被忽视的拓宽多样性的位置：推理标记后的第一个 token。策略的首 token 分布表现出尖锐峰值但正确性解耦的现象，且该首 token 位置可以拓宽 rollout 组覆盖的区域而不改变正确性信号。我们引入 REFT（基于首 token 多样化的 Rollout 探索），这是对 RLVR 流程的一个轻量级补充，它从策略自身的 top-$N$ 候选集中均匀采样首 token，并均匀分配 rollout，其他组件保持不变。在由此产生的多样化 rollout 上训练后，REFT 在四个基础模型（0.5B-7B）和三个难度级别上，相较于 DAPO 和 GRPO 基线，提升了聚合的 Pass@1、Pass@8 和 Pass@64。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.28292 2026-05-28 cs.CL 版本更新

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

CIRF：将思维链分词化为可重用的功能单元，用于大型语言模型的高效潜在推理

Yukyung Lee, Yumeng Shen, Jinhyeong Park, Hyein Yang, Jun-Hyung Park

发表机构 * Boston University（波士顿大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结提出CIRF框架，通过将显式思维链中的语义连贯推理单元映射为离散功能令牌，实现动态序列推理，在数学、符号和常识推理基准上取得优于现有隐式CoT方法的准确率-延迟权衡。

Comments 17 pages, 7 figures

详情

AI中文摘要

隐式思维链通过内化显式理由来降低大型语言模型的推理成本。然而，现有方法通常缺乏与显式理由的对齐以及对示例复杂性的适应性。在这项工作中，我们提出了CIRF（思维链转化为可重用功能单元），一个隐式CoT框架，将推理作为离散功能令牌的动态序列进行。CIRF为显式CoT轨迹中的每个语义连贯推理单元分配一个功能令牌。然后对模型进行微调，以自回归方式生成功能令牌及其可选结果，随后生成最终答案。这种设计将潜在推理与功能单元序列对齐，促进了并行训练、显式理由对齐和自适应推理。在数学、符号和常识推理基准上的大量实验表明，与最先进的隐式CoT方法相比，CIRF提供了有利的准确率-延迟权衡。进一步的分析表明，CIRF构建了独特、可解释的功能令牌，从而带来一致的性能提升。

英文摘要

Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\textit{\underline{C}hain-of-thoughts \underline{I}nto \underline{R}eusable \underline{F}unctional units}), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.28283 2026-05-28 cs.CL cs.AI 版本更新

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath：迈向高度结构化稀疏语言模型

Zhexuan Gu, Zixun Fu, Yancheng Yuan

发表机构 * Department of Applied Mathematics, The Hong Kong Polytechnic University（应用数学系，香港理工大学）

AI总结提出PrunePath框架，通过软最大归一化路由和累积质量阈值实现自适应预算的结构化稀疏化，在自然语言理解、生成和指令调优中取得优越的稀疏-性能权衡，并利用Triton内核将结构化稀疏转化为实际内存节省和解码速度提升。

详情

AI中文摘要

前馈网络（FFN）主导了现代语言模型的参数数量和计算量，然而现有的剪枝方法往往难以将稀疏性转化为硬件友好的推理效率提升。我们引入了 extbf{PrunePath}，一个针对FFN层的预算自适应结构化稀疏化框架。基于MoEfication，PrunePath用软最大归一化路由分布替代独立的专家级阈值，并在累积质量阈值下激活重要专家。这种公式化施加了令牌级概率预算，实现了自适应专家数量以及从单个检查点直接推理时的稀疏性调节旋钮。在自然语言理解、自然语言生成和指令调优评估中，与现有的静态剪枝和基于MoEfication的方法相比，PrunePath实现了有利的稀疏-性能权衡。我们进一步实现了用于KV缓存解码的Triton内核，以将所得的结构化稀疏性转化为实际的内存节省和可测量的解码速度提升。这些结果证明了PrunePath在构建高度稀疏、易于部署的大型语言模型方面的优越性能。

英文摘要

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.28255 2026-05-28 cs.AI cs.CL cs.HC 版本更新

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI，掌舵吧：是什么驱动人机协作问答中的委托与信任？

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

发表机构 * University of Maryland（马里兰大学）； University of California（加州大学）； MBZUAI

AI总结通过问答游戏实验，研究人类在何时以及为何选择委托AI或采纳其建议，发现人类存在对AI正确建议的低依赖（3.9%）和错误建议的过度依赖（1.7%），并受确认偏见影响，建议通过校准置信度、基于证据的解释和信任细化机制来改进人机协作。

Comments Findings of the Association for Computational Linguistics, 2026

详情

AI中文摘要

AI系统并非完美无缺，人类在决定是否信任AI而非自身判断时也可能犯错。因此，改善人机协作需要理解人类何时、为何以及如何决定依赖AI。我们研究了两种不同的依赖决策：委托选择——在不知道AI输出结果的情况下决定何时让AI自主行动，以及采纳选择——评估AI建议并决定如何使用它们。这两种解耦的依赖模式塑造了协作，但先前的工作很少在现实环境中对同一用户同时研究它们。我们通过研究在问答游戏中竞争的人机协作团队来填补这一空白，游戏中人类可以选择何时以及如何与AI代理合作以获胜。我们的24场比赛匹配了23位专家人类和16个AI代理，捕获了387次委托决策和1440次采纳决策。虽然人机协作表现优于单独的AI或人类，但人类做出了次优的协作决策，既对正确的AI建议低依赖（错失3.9%的机会），又在AI误导时过度依赖（1.7%）。双方都贡献了错误答案：当人类和AI意见不一致时，报告的模型置信度接近随机水平，而确认偏见导致当AI建议与人类初始错误答案一致时，低依赖率更高（64.5%）。为缩小这一差距，我们建议采用校准的置信度、基于证据的解释以及帮助用户细化信任的机制。

监督语义差异法用于跨文化概念分析：以人类情感为例

Jan Sikora, Paweł Lenartowicz, Hubert Plisiecki

发表机构 * University of Warsaw（华沙大学）； Society for Open Science（开放科学协会）； Centre for Brain Research, Jagiellonian University（雅盖隆大学脑研究中心）； IDEAS Research Institute（IDEAS研究院）

AI总结本文提出跨语言监督语义差异法（SSD），通过对齐的多语言词嵌入比较语义维度，并以波兰语、英语和法语情感规范词汇为例，验证了情感维度的跨语言可恢复性及文化差异。

Comments 9 pages, 2 figures, excluding the appendices. Code to reproduce our results is available at https://github.com/przebor/Cross-Cultural-SSD

详情

AI中文摘要

跨文化比较心理意义需要超越词汇层面的翻译，并考察语义维度在不同语言中的组织方式。我们提出了监督语义差异法（SSD）的跨语言扩展，该方法在嵌入空间中估计监督语义梯度，并在对齐的多语言词嵌入之间进行比较。该方法通过置换检验和自助法区间检验梯度对齐性和差异，并通过围绕差异梯度的聚类解释残差差异。我们在波兰语、英语和法语情感规范词汇上展示了该方法，对效价、唤醒度和优势度（如可用）进行建模。情感维度在语言和模型设置中显著可恢复。跨语言比较显示出广泛的对齐性以及结构化的残差差异：效价似乎是共享的，而唤醒度和优势度产生了更多可解释的对比，涉及身体威胁、审美刺激、内部情感性、宏观权威和日常控制。几个聚类也反映了语料库特定的伪影，强调了谨慎解释的必要性。跨语言SSD提供了一个可解释的框架，用于测试语义对齐性、识别差异，并生成关于心理意义跨文化差异的假设。

英文摘要

Cross-cultural comparison of psychological meaning requires methods that go beyond word-level translation and examine how semantic dimensions are organized across languages. We introduce a cross-lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross-lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro-level authority, and everyday control. Several clusters also reflected corpus-specific artifacts, underscoring the need for cautious interpretation. Cross-lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross-cultural differences in psychological meaning.

URL PDF HTML ☆

赞 0 踩 0

2605.28222 2026-05-28 cs.CL cs.IR cs.LG 版本更新

Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

使用LoRA适配分析技术文档RAG助手中的质量-延迟-资源权衡

Evgenii Palnikov, Elizaveta Gavrilova

发表机构 * HSE University（俄罗斯高等经济大学）

AI总结本研究通过LoRA适配器在RAG系统中分析质量、延迟和资源之间的权衡，发现仅对q和v注意力投影进行适配的配置在帕累托前沿占优。

Comments 13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at https://github.com/EugPal/rag-lora-tradeoffs

详情

AI中文摘要

我们研究了在基于文档的检索增强生成（RAG）系统中使用生成器的低秩适配（LoRA）时的质量-延迟-资源权衡。我们构建了一个包含5,144个问答对的手动验证基准测试，这些问答对基于官方Kubernetes文档，并将其与固定的混合检索流水线（BGE-M3密集、BGE-M3原生稀疏、互惠排名融合、交叉编码器重排序）结合。在此基准测试上，我们在Llama-3.2-3B-Instruct和Llama-3.1-8B-Instruct上对20种LoRA配置进行了消融实验，涉及秩和目标模块的选择，并评估了每个配置的token级F1、LLM判断的接地性和正确性（pass@4）、推理延迟、推理内存和训练成本，所有结果均附有bootstrap 95%置信区间。帕累托分析表明，仅作用于q和v注意力投影的LoRA适配器始终主导前沿，而3B/8B的选择主要定义了操作区间。参数匹配的控制比较进一步表明，q/v优势是结构性的而非纯粹参数性的。基准测试、选定的适配器和代码可在https://github.com/EugPal/rag-lora-tradeoffs获取。

英文摘要

We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag-lora-tradeoffs.

URL PDF HTML ☆

赞 0 踩 0

2605.28218 2026-05-28 cs.CL 版本更新

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

IFMTBench：多语言翻译指令遵循的综合基准

Mingrui Sun, Mao Zheng, Zheng Li, Mingyang Song

发表机构 * Large Language Model Department, Tencent（腾讯大语言模型部）

AI总结提出IFMTBench基准，涵盖7种语言、4506个单约束和2838个多约束项，通过确定性检查器和基于LLM的评分器评估翻译指令遵循能力，揭示指令遵循随模型规模增长快于翻译质量，且术语表和结构化格式约束难度最高。

Comments 11 pages, 6 figures, conference

详情

AI中文摘要

现代翻译工作流程要求的不只是语义等价。用户通常要求模型保留JSON或HTML模式、遵循精心策划的术语表、利用提供的上下文消除歧义，并匹配指定的语域，往往同时满足多个条件。传统的BLEU和xCOMET等指标能捕捉语义保真度，但对约束遵循的指示甚少，而一般的指令遵循基准则忽略了翻译的跨语言性质。我们引入了\bench，一个涵盖七种语言的多语言翻译指令遵循基准，包含4506个单约束和2838个多约束项，跨越六个约束维度和五种组合模式，指令以所有七种语言发出。约束分为由确定性检查器验证的门控子集和由基于评分规则的LLM法官评分的连续子集，通过乘法规则组合以抵抗奖励黑客攻击。评估15个模型揭示了先前协议遗漏的系统性差距：指令遵循随模型规模增长比翻译质量更显著，术语表和结构化格式约束主导了难度梯度，而一般指令遵循排名与翻译行为的相关性很弱。我们的基准可在https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench获取。

英文摘要

Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.

URL PDF HTML ☆

赞 0 踩 0

2605.28215 2026-05-28 cs.AI cs.CL cs.LG cs.LO cs.MA 版本更新

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

解释比单独预测更难：评估基于概念的MLLM解释作为ICL视觉分类器

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

AI总结本文通过五种形式化程度递增的条件，系统评估多模态大语言模型在少样本上下文学习中的基于概念的可解释性，发现解释比预测更难，且强制生成形式化解释会降低预测准确性。

Comments Accepted to the CompLearn Workshop at ICML 2026

详情

AI中文摘要

上下文学习（ICL）使多模态大语言模型（MLLM）能够从少量标记示例中对图像进行分类。然而，这些模型如何使用提供的上下文仍然不透明。虽然思维链提示被广泛使用，但最近的研究认为它可能不反映真实的内部计算。在本文中，我们通过五种形式化程度递增的条件（从基线分类到描述逻辑（DL）公理生成）系统评估了冻结MLLM在少样本ICL下的基于概念的可解释性。通过独立的LLM-as-a-judge流水线评估四个最先进的MLLM，我们证明解释确实比单独预测更难。令人惊讶的是，强制模型生成形式化结构的基于概念的解释会单调降低预测准确性（从93.8%降至90.1%），这与显式推理普遍有助于性能的假设相矛盾。然而，当模型成功表达类别判别性视觉特征时，解释质量与正确预测强相关。我们的发现表明，虽然MLLM在视觉分类方面表现出色，但它们缺乏形式化、机器可验证的可解释性所需的特定指令微调。

英文摘要

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

URL PDF HTML ☆

赞 0 踩 0

2605.28211 2026-05-28 cs.CL 版本更新

SuperValid: 面向泛化下游扩展的能力对齐OOD验证

Quanen Sun, Changxin Tian, Ke Shi, Cai Chen, Cunyin Peng, Jia Liu, Kunlong Chen, Zhiqiang Zhang

发表机构 * Ant Group（蚂蚁集团）

AI总结提出SuperValid框架，通过从基准测试中提炼核心概念并扩展为多样化的知识丰富文本，合成能力对齐的分布外验证数据，以在能力层面预测下游性能，实现有效的模型选择、早停和扩展决策。

详情

AI中文摘要

扩展定律通过将计算量与交叉熵损失相关联来指导大型语言模型的训练，最近的工作进一步将其扩展到预测下游基准性能。然而，先前的方法在两个方面面临泛化限制：关注基准级性能会引入特定场景的伪影，而依赖IID验证损失则无法在训练分布变化时跟踪能力提升。在这项工作中，我们认为下游扩展应在能力层面进行研究，这能够捕捉跨相关任务的共享技能因素，同时抽象掉基准特定的噪声。我们提出了SuperValid，一个通过从能力领域内的基准测试中提炼核心概念并将其扩展为多样化的知识丰富文本来合成OOD（分布外）、能力对齐验证数据的框架。涵盖6个能力领域内17个基准测试的大量实验表明，SuperValid损失与不同架构、规模和训练数据分布的模型的下游性能表现出强且稳定的相关性。作为一种无需训练、可在训练期间计算且无需基准评估的度量，SuperValid实现了有效的模型选择、早停和扩展决策。

英文摘要

Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

URL PDF HTML ☆

赞 0 踩 0

2605.28163 2026-05-28 cs.CL cs.AI 版本更新

通过边际锐化实现自一致性

Aleksei Arzhantsev, Otmane Sakhi, Nicolas Chopin

发表机构 * Criteo AI Lab（Criteo AI实验室）； CREST IP Paris（巴黎CREST研究所）； ENSAE, Institut Polytechnique de Paris（巴黎理工学院ENSAE）

AI总结提出一种自回归并行采样算法，通过锐化答案边际分布而非完整输出分布，在数学和编程基准上优于标准功率采样且速度更快。

详情

AI中文摘要

推理时采样可以在不额外训练的情况下激发语言模型的强推理能力。现有的功率采样方法通过锐化完整生成输出的分布来做到这一点，偏向于模型下个体可能性高的完成。我们认为这对于推理来说是错误的目标：一个完成将推理轨迹与最终答案纠缠在一起，而重要的是一个答案是否被许多合理的推理路径支持。因此，我们将目标从完整输出分布转移到锐化的答案边际，使自一致性成为推理时目标而非事后投票标准。令人惊讶的是，这个边际目标允许一个高效的近似：我们提出一个简单的、纯粹自回归的并行采样算法，近似地从锐化的答案边际中采样，在数学和编程基准上比标准功率采样表现出更强的性能，同时快几个数量级。

英文摘要

Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.

URL PDF HTML ☆

赞 0 踩 0

2605.28131 2026-05-28 cs.CL 版本更新

Better heads do not guarantee better binarized constituency parsing

更好的头部并不能保证更好的二值化成分句法分析

Zeyao Qi, Yige Chen, Eitan Klinger, Vivaan Wadhwa, Jungyeul Park

发表机构 * The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong（香港中文大学，大屿山分校）； The University of British Columbia, Vancouver, Canada（不列颠哥伦比亚大学，温哥华，加拿大）； Korea Advanced Institute of Science & Technology, Daejeon, South Korea（韩国科学技术院，大田，韩国）

AI总结本文研究了标点感知的树二值化方法，并探讨了依赖诱导的头部信息是否改善了二值化句法分析器的监督信号，发现尽管学习到的头部在内在头部预测上优于基于规则的头部，但在去二值化后并未带来一致的句法分析提升。

2605.28128 2026-05-28 cs.CL 版本更新

Chinese Word Boundary Recovery through Character Alignment Projection

通过字符对齐投影恢复中文词边界

Lusha Wang, Yuchen Li, Su Yuan, Jungyeul Park

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）； Korea Advanced Institute of Science & Technology（韩国科学技术院）

AI总结提出基于对齐投影的两步方法，从带噪句子中恢复词边界，并构建两个评估基准，实验表明该方法能有效纠正过度切分错误。

详情

AI中文摘要

中文分词在非标准文本中尤其脆弱，语言学习者错误和其他字符层面的差异会破坏下游标注和评估所假设的词边界。本文将中文词边界恢复形式化为基于对齐的投影任务。给定一个带噪的源句子和一个更干净的目标对应句，我们首先在字符级别对齐两个字符串，然后将目标侧的词边界投影回源句。除了恢复方法本身，我们还引入了两个评估资源：基于MuCGEC的人工检查学习者中文基准，以及从中文宾州树库导出的受控合成基准。实验表明，直接分词仍然容易受到学习者输入中的复合碎片化影响，而所提出的两步投影方法通过使用校正后的目标恢复源侧词跨度，纠正了许多过度切分错误。结果表明，词边界恢复不同于普通分词，并且对齐投影为在带噪输入下稳定中文标注和评估提供了一种原则性机制。

英文摘要

Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.

URL PDF HTML ☆

赞 0 踩 0

2605.28123 2026-05-28 cs.CL 版本更新

Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models

风险感知的选择性提示用于大型视觉-语言模型中的幻觉缓解

Yuang Huang, Yafeng Zhang, Yu Zilan

发表机构 * Shanghai Jiao Tong University（上海交通大学）； iFLYTEK ； Tsinghua University（清华大学）

AI总结本文系统研究提示验证在大型视觉-语言模型中的风险，发现其效果依赖输入难度，并提出基于预生成不确定性信号的选择性提示方法RSP以平衡性能。

Comments 7 pages, 1 figures, submitted to ACL ARR 2026 May (EMNLP)

详情

AI中文摘要

基于提示的验证被广泛用于缓解大型视觉-语言模型（LVLMs）中的幻觉，但其何时有效仍不清楚。我们系统研究了两种代表性LVLM架构和幻觉基准上的验证提示，发现它是一种有风险的干预：其纠正随输入难度增加，而新引入的错误在不同难度级别持续存在。因此，始终开启的提示在困难输入上有帮助，但在简单输入上益处甚微甚至有害。我们的分析进一步表明，这种行为与保守的输出偏移相关。验证提示将注意力从视觉令牌重新分配到指令令牌，并诱导出中性提示控制中不存在的中层熵模式，这表明是指令条件化的注意力重新分配而非统一的视觉基础改善。受这种输入依赖风险的启发，我们提出了风险感知的选择性提示（RSP），一种无需训练的方法，利用预生成不确定性信号选择性地触发验证。RSP减轻了始终开启提示的性能下降，同时保持基线性能，并揭示了有效的选择信号因架构而异。

英文摘要

Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit -- and can harm -- easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.28122 2026-05-28 cs.CR cs.AI cs.CL 版本更新

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

SNARE: 自适应场景合成以诱发编码代理中的过度行为

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang

发表机构 * Griffith University（格里菲斯大学）； Quantstamp ； Nanyang Technological University（南洋理工大学）； UNSW Sydney（悉尼大学）； Wake Forest University（卫斯理大学）

AI总结提出SNARE流水线，通过组合良性场景片段并使用无评判器预言机评分与汤普森采样，自适应地诱发编码代理的过度行为，并在4×5代理-模型矩阵上评估，发现19.51%的良性运行触发过度行为，且代理框架比模型影响更大。

详情

AI中文摘要

编码代理以一系列shell、文件和网络操作执行良性任务，其中任何操作都可能悄然超出授权范围而任务仍完成。我们称此为过度行为：提示并非对抗性且运行成功，但超出范围的操作可能泄露凭据或删除文件。现有基准未能捕捉：任务完成套件认可任何完成的运行，越狱套件探测对抗性提示，而先前唯一的过度行为基准对每个代理-模型对应用单一固定提示集，导致其最易和最难的配对测量不足。我们提出SNARE（为非对抗场景合成自适应奖励引导诱发），该流水线从可重用范围和陷阱片段组合良性场景，用无评判器预言机对每次运行评分，标记陷阱模式匹配及未经请求的文件添加或删除，并使用汤普森采样将每对运行预算导向最常触发它的场景。在24个过度行为原型上实例化得到OverEager，我们在四个编码代理和五个基础模型的4×5矩阵上运行。在10,000次良性运行中，19.51%触发过度行为，每对比率跨度达11.9倍。这种变化由代理框架驱动，而非模型：框架占56%而模型占21%，因此任何单一框架或单一模型评估都会低估矩阵约五分之一。

英文摘要

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

URL PDF HTML ☆

赞 0 踩 0

2605.28120 2026-05-28 cs.CL cs.AI cs.MA 版本更新

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

LegalGraphRAG：面向可靠法律推理的多智能体图检索增强生成

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； Institute of Artificial Intelligence, Xiamen University（厦门大学人工智能研究院）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出LegalGraphRAG框架，通过分层法律图和多智能体系统（研究员、审计员、裁决员）实现可靠的法律推理，在准确性和可信度上超越现有GraphRAG基线。

Comments 30 pages, 18 figures, ACL 2026 Main Conference. Project page: https://github.com/XMUDeepLIT/LegalGraphRAG

详情

AI中文摘要

基于图的检索增强生成（GraphRAG）通过将知识结构化为关系图，推进了平面文档检索，实现了更连贯和有效的推理。然而，将其应用于法律推理等特定领域面临关键挑战。(i) 法律语料库是异构的，包含来自案例、法条和解释的多粒度知识。平面知识图无法充分区分事实细节、适用规则和抽象原则，限制了准确检索。(ii) 可靠的法律判决需要透明、基于证据的推理。传统的RAG直接将检索到的上下文传递给LLM而不进行验证，导致推理不透明且易出错。为此，我们提出了LegalGraphRAG，一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件：一个分层法律图，用于分层组织法律来源，以便在适当的抽象级别进行检索；以及一个用于可靠法律推理的多智能体系统，其中研究员检索候选证据，审计员严格验证其相对于源文档的有效性，裁决员综合已验证的证据集作出最终判决。大量实验表明，LegalGraphRAG达到了最先进的性能，在准确和可信的法律分析方面优于现有的GraphRAG基线。我们的代码、数据集和实现细节可在https://github.com/XMUDeepLIT/LegalGraphRAG获取。

英文摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

URL PDF HTML ☆

赞 0 踩 0

2605.28116 2026-05-28 cs.CR cs.AI cs.CL 版本更新

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

MIRAGE：通过用户生成内容对移动GUI代理的上下文感知提示注入

Ruoqi Guo, Yi Liu, Gelei Deng, Yiheng Xiong, Yuekang Li, Ying Zhang, Leo Yu Zhang, Lida Zhao, Ji Jie, Yuxiao Lu

发表机构 * Griffith University（格里菲斯大学）； Quantstamp ； Nanyang Technological University（南洋理工大学）； Singapore Management University（新加坡管理学院）； University of New South Wales（新南威尔士大学）； Wake Forest University（威克森林大学）； Independent Researcher（独立研究者）

AI总结提出MIRAGE管道，通过将攻击者控制的文本嵌入用户生成内容区域，在不修改代理、应用或操作系统的情况下，对视觉语言模型驱动的移动GUI代理实现高成功率的提示注入攻击。

详情

AI中文摘要

由视觉语言模型（VLM）驱动的移动图形用户界面（GUI）代理将屏幕视为渲染像素并根据所见选择动作，因此无法可靠地将受信任的界面元素与用户生成内容区分开来。我们提出MIRAGE（移动逼真对抗性GUI示例注入），这是一个管道，通过将攻击者控制的文本放入普通用户生成内容区域，将良性移动截图转化为提示注入样本，而无需修改代理、应用程序或操作系统。MIRAGE分三个阶段运行：定位器识别截图上用户可控制的区域，生成器合成上下文感知的有效载荷并以应用程序的原生风格渲染，策展人调节逼真度并在应用程序、区域类型和攻击意图之间平衡样本。一个关键挑战是，注入的截图必须在视觉上与真实用户内容难以区分，同时仍能转移代理的注意力；我们通过分离控制可达性、逼真度和分布平衡的阶段来解决这一问题。在一个涵盖十个应用程序和十一种攻击意图的1,111样本基准测试中，所有五个被评估的VLM代理都易受攻击，攻击成功率为23%-30%，并且MIRAGE在人类逼真度评分上高于最强的先前攻击（3.02对比2.52，满分5分）。我们进一步发现，每个样本的逼真度和攻击成功率不相关，因此仅靠视觉质量过滤无法可靠防御此威胁。

英文摘要

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

URL PDF HTML ☆

赞 0 踩 0

2605.28112 2026-05-28 cs.CR cs.CL cs.IR 版本更新

A Wolf in Sheep's Clothing: Targeted Routing Hijacking in Federated RAG

披着羊皮的狼：联邦RAG中的目标路由劫持

Junjie Mu, Qiongxiu Li

发表机构 * Politecnico di Milano（米兰理工大学）； Aalborg University（奥胡斯大学）

AI总结本文提出路由劫持攻击，恶意客户端伪造语义配置文件以吸引目标查询，导致检索结果被篡改，并设计了一种基于信任的后路由框架来缓解该攻击。

Comments Under review. Code available at https://github.com/Junjie-Mu/routing-hijacking-fedrag

详情

AI中文摘要

联邦检索增强生成（FedRAG）对隐私敏感的应用具有吸引力，因为原始数据保留在本地。因此，路由必须依赖客户端提供的语义配置文件，这为操纵创造了新的机会。我们引入了路由劫持，一种路由阶段的攻击，其中恶意客户端伪造其配置文件以吸引目标查询，尽管其底层数据不相关。我们表明这种漏洞是严重的。在三种代表性的FedRAG路由架构中，路由劫持一致地错误路由目标查询，并导致下游中断和失败，包括证据缺失、投毒、错误答案和幻觉。在一个高风险的MedQA-USMLE案例研究中，我们进一步表明，投毒的检索证据可以误导不同规模的模型，导致错误答案、幻觉和谄媚失败。现有防御无法弥补这一差距：加密路由保留了被利用的排名，而拜占庭鲁棒的联邦学习（FL）规则难以迁移到异构的路由配置文件。为了解决这一差距，我们提出了一种基于信任的后路由框架，该框架使用返回证据反馈（包括检索相关性、配置文件一致性和跨客户端一致性）对客户端进行重新加权；在线实验表明，它抑制了重复查询上的持续劫持，并迁移到学习的神经路由器。我们的发现将路由完整性确立为FedRAG中一个新的安全挑战，并强调需要更强的防御来确保安全的联邦检索。

英文摘要

Federated Retrieval-Augmented Generation (FedRAG) is attractive for privacy-sensitive applications because raw data remain local. As a result, routing must rely on client-provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing-stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high-stakes MedQA-USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine-robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust-aware post-routing framework that reweights clients using returned-evidence feedback, including retrieval relevance, profile consistency, and cross-client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.28084 2026-05-28 cs.CL cs.AI 版本更新

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

SMILE-Next: 教授大型语言模型检测、分类和推理笑声

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

发表机构 * School of EE, KAIST（韩国科学技术院电子工程系）； Dept. of EE, POSTECH（POSTECH电子工程系）； School of Computing, KAIST（韩国科学技术院计算机科学系）

AI总结提出SMILE-Next数据集和包含笑声特定Self-Instruct与混合笑声专家框架的方法，用于实现多模态笑声理解，显著优于基线模型。

详情

Journal ref: Annual Meetings of the Association for Computational Linguistics 2026

AI中文摘要

笑声是一种复杂的社会信号，传达超越娱乐的交际意图。虽然先前的工作集中在孤立的笑声分析任务上，但在现实场景中对笑声的全面理解仍未得到充分探索。因此，我们引入了SMILE-Next，一个用于现实世界笑声理解的数据集，具有多模态文本表示和跨三个任务的问答标注：笑声检测、笑声类型分类和笑声推理。基于SMILE-Next，我们旨在开发一个能够细致理解现实语境中笑声的笑声专用大型语言模型。为此，我们提出了两个关键组件：笑声特定Self-Instruct和混合笑声专家框架。笑声特定Self-Instruct通过自动合成多样化的以笑声为中心的指令，增强了跨任务和领域的泛化能力。MoLE引入了一种任务自适应专家路由机制，动态选择针对每个笑声相关任务定制的专用专家，提高了任务特定性能和效率。实验结果表明，我们提出的组件的组合显著优于多模态LLM基线，推动了鲁棒的现实世界笑声理解。项目页面位于：https://mok0102.github.io/smile-next/。

英文摘要

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

URL PDF HTML ☆

赞 0 踩 0

2605.28079 2026-05-28 cs.CL 版本更新

ATLAS: All-round Testing of Long-context Abilities across Scales

ATLAS: 跨尺度的长上下文能力全面测试

Deli Huang, Cunguang Wang, Hongyin Tang, Zhe Tang, Linsen Guo, Dongyu Ru, Ruoshi Yuan, Ziyue Zhu, Xiaoyu Li, Ziwen Wang, Chen Zhang, Anchun Gui, Wen Zan, Jiaqi Zhang, Xuezhi Cao, Jingang Wang, Xunliang Cai, Yixin Cao

发表机构 * Meituan（美团）； Fudan University（复旦大学）

AI总结提出ATLAS基准框架，通过分层分类、长度感知AUC评分和ATLAScore聚合指标，系统评估长上下文语言模型在不同长度和任务上的性能退化与能力分布。

Comments 29 pages, 13 figures. Preprint

详情

AI中文摘要

长上下文语言模型现在宣称上下文窗口可达数百万token，然而评估通常报告单一长度或狭窄的任务族，掩盖了两种失败模式：性能随长度增长而崩溃，以及强大的检索能力不一定能迁移到下游使用。我们提出ATLAS，一个重新定义长上下文评估为长度依赖能力剖析的基准框架。ATLAS贡献了三个方法论原则：(i) 分层分类法，将基础操作与应用工作负载分离，以便归因失败；(ii) 长度感知AUC评分，在固定的8K-1M网格上积分分数-长度曲线，用完整的退化曲线替代单点指标；(iii) ATLAScore，对分类类别进行调和平均聚合，惩罚不平衡的剖面，并通过非线性最终聚合从子集分数进行端到端不确定性传播。我们在八个能力维度上实例化该框架，包含九个可审计组件和6,438个实例，并评估了26个模型。Gemini-3.1-Pro-Preview在128K处领先，Claude-Opus-4.6在1M处领先。排名在ATLASscore@8K-128K和ATLASscore@8K-1M之间大幅重新洗牌：7个模型移动至少两个排名，两个分类层仅共享61%的跨模型方差，个别排名差距高达12位。这些结果支持按能力和长度报告长上下文质量，而不是单一的标题分数。

英文摘要

Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong retrieval need not transfer to downstream use. We present ATLAS, a benchmarking framework that redefines long-context evaluation as length-dependent capability profiling. ATLAS contributes three methodological principles:(i) a layered taxonomy separating foundational operations from application workloads so failures can be attributed, (ii) length-aware AUC scoring that integrates score-length curves over a fixed 8K-1M grid, replacing single-point metrics with full degradation profiles, and (iii) ATLAScore, a harmonic-mean aggregate over taxonomy categories that penalizes imbalanced profiles, with end-to-end uncertainty propagation from subset scores through the nonlinear final aggregate. We instantiate the framework across eight capability dimensions with nine auditable components and 6,438 instances, and evaluate 26 models. Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M. Rankings reshuffle substantially between ATLASscore@8K-128K and ATLASscore@8K-1M: 7 models move by at least two ranks, and the two taxonomy layers share only 61% of cross-model variance, with individual rank gaps up to 12 positions. These results support reporting long-context quality by capability and length, not by a single headline score.

URL PDF HTML ☆

赞 0 踩 0

2605.28074 2026-05-28 cs.CR cs.CL cs.IR 版本更新

SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning

SilentRetrieval：通过语义保持的对抗性数据投毒劫持检索增强生成

Jiachen Qian

发表机构 * City University of Hong Kong（香港城市大学）

AI总结提出SilentRetrieval两阶段数据投毒攻击，通过协调束搜索和上下文自适应触发生成，在保持文档流畅性的同时实现高检索命中率和攻击成功率，并评估了防御措施的有效性。

Comments 12 pages, 4 figures, KDD '26 camera-ready version

详情

DOI: 10.1145/3770855.3818186
Journal ref: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea

AI中文摘要

检索增强生成（RAG）缓解了LLM的幻觉问题，但引入了一个关键漏洞：语料库完整性。我们提出SilentRetrieval，一种两阶段数据投毒攻击，通过对抗性构造但流畅的文档劫持RAG系统。第一阶段使用协调束搜索，一种具有流畅性-相似性目标的多令牌联合优化方法，在约束困惑度的同时保持有毒宿主文档的可检索性。第二阶段使用上下文自适应触发生成，一种由冻结LLM驱动的轻量级触发融合步骤，将操纵触发器集成到文档内容中。在合成目标答案的单投毒文档每查询评估下，SilentRetrieval在Natural Questions和MS MARCO上分别达到84.6%/81.3%的HR@10和57.5%/54.8%的ASR-LLM，同时保持接近良性的困惑度。跨四个目标LLM的模型评估显示，在固定触发器生成器下具有非平凡的有效性，针对未见检索器（包括ColBERT和商业嵌入模型）的迁移测试在相同注入语料库协议下平均HR@10为64.7%。在采样维基百科规模评估中，SilentRetrieval在0.016%投毒率下保持74.2%的HR@10。结合检索侧和生成侧防御可大幅降低攻击成功率，但会引入延迟权衡。人工评估显示，与不流畅基线相比，标记率显著降低，但在当前样本量下数值上仍比良性内容更可疑。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

URL PDF HTML ☆

赞 0 踩 0

2605.28073 2026-05-28 cs.CL cs.AI 版本更新

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens: 通过上下文感知叙事丰富实现偏好对齐的故事重写

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； AIM3 Lab, Renmin University of China（中国人民大学AIM3实验室）； Nanyang Technological University（南洋理工大学）

AI总结针对故事重写中读者偏好对齐问题，提出结合上下文感知叙事丰富的方法，构建基准STORYLENSBENCH、奖励模型STORYLENSEVAL和两阶段重写模型STORYLENSWRITER，实验表明上下文增强显著提升用户满意度。

Comments 16 pages, 7 figures, 15 tables

详情

AI中文摘要

故事重写旨在适应不同读者偏好的同时保持情节一致性和叙事连贯性。与传统的风格迁移工作不同，我们认为有效的故事重写需要上下文感知的叙事丰富，而不仅仅是表面层面的风格适应。我们的初步人类研究表明，仅风格适应对读者满意度的提升微乎其微（2.3%），而上下文增强的重写则显著改善了用户偏好对齐（24.5%）。受此启发，我们引入了STORYLENSBENCH，一个用于偏好对齐故事重写的大规模基准，包含结构化故事书、多维读者偏好档案和排序后的上下文感知重写故事。基于该基准，我们提出了STORYLENSEVAL，一个用于估计重写故事读者满意度的奖励模型，以及STORYLENSWRITER，一个结合监督微调和基于GRPO的强化学习的两阶段重写模型。我们进一步建立了一个涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明，STORYLENSWRITER持续优于强大的生成和个性化基线，突显了上下文感知叙事丰富对于个性化故事重写的重要性。

英文摘要

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

URL PDF HTML ☆

赞 0 踩 0

2605.28062 2026-05-28 cs.CL cs.IR 版本更新

ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor

ConvMemory: 一种轻量级学习型记忆重排序器、一个负归因结果以及一个研究预览冲突编辑器

Taiheng Pan

发表机构 * School of Computing and Information Systems（计算与信息系统学院）； University of Melbourne（墨尔本大学）

AI总结本文提出ConvMemory，一种3.6M参数的学习型重排序器，通过交叉编码器教师监督在融合密集和词汇特征上训练，用于对话长期记忆检索，并报告了负归因结果及研究预览冲突编辑器CCGE-LA。

Comments 15 pages. Technical report

详情

AI中文摘要

我们描述了ConvMemory，一种用于对话长期记忆检索的小型3.6M参数学习型重排序器，通过交叉编码器教师监督在融合密集和词汇特征上训练。在LongMemEval记忆族上，ConvMemory在Recall@10上优于BGE-large交叉编码器，延迟降低12-47倍；在Clean500上，与mxbai-rerank-large-v1相比，Recall@10差距在0.025以内，但运行成本低28倍；在Stress1000干扰项下，Recall@10差距扩大到0.081，但ConvMemory的延迟仍低117倍；这些LongMemEval数字是单次运行或单种子结果，作为指示性成本前沿证据报告，而非基准级。然后，我们发布了一个关于先前声称机制的严格负归因结果：一个五种子重训练消融实验结合配对自助法表明，ConvMemory的学习时间窗口在总体上统计显著，但并非时间特定，对硬非时间控制的影响最大，而对多跳时间查询无显著影响。该机制的诚实描述是在融合密集+词汇特征空间中的廉价交叉编码器蒸馏，而非时间结构利用。此外，我们发布了CCGE-LA，一种低幅度的冲突感知候选集编辑器，基于ConvMemory，作为研究预览，在LoCoMo的替换和过时/恢复切片上取得了适度但一致的改进。所有结果均为检索阶段；ConvMemory在绝对LoCoMo MRR上未匹配mxbai-rerank-large-v1，且该报告为单作者，尚未独立审计。

英文摘要

We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory's learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

URL PDF HTML ☆

赞 0 踩 0

2605.28060 2026-05-28 cs.CL 版本更新

Challenges in Explaining Pretrained Clinical Text Classifiers

解释预训练临床文本分类器的挑战

Kristian Miok, Matej Klemen, Blaz Škrlj, Marko Robnik Šikonja

发表机构 * Faculty of Computer and Information Science, University of Ljubljana, Slovenia（卢布尔雅那大学计算机与信息科学学院）； ICAM - Advanced Environmental Research Institute, West University of Timisoara, Romania（蒂米什瓦德西大学先进环境研究所）

AI总结本文通过医院住院时长预测任务，揭示了LIME和SHAP等事后解释方法在临床叙事中的局限性，包括过度关注非信息性标记、归因不稳定以及对不连贯输入的高置信度预测，强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

Comments 9 pages, 7 figures. Accepted at the First Workshop on Responsible Healthcare using Machine Learning (RHCML 2025), co-located with ECML PKDD 2025

详情

DOI: 10.1007/978-3-032-19105-2_22
Journal ref: Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2025. Communications in Computer and Information Science, vol 2842, pp. 314-322. Springer, Cham (2026)

AI中文摘要

在临床自然语言处理中解释神经模型的预测仍然是一个重大挑战，尤其是对于涉及长篇幅、非结构化医疗文本的复杂任务。尽管LIME和SHAP等事后方法被广泛使用，但它们在应用于临床叙事时常常表现不足。在本文中，我们通过针对医院住院时长预测任务的定向演示，识别了基于标记和基于扰动的解释技术的核心局限性。我们的发现揭示了诸如过度强调非信息性标记、归因不稳定以及对不连贯输入变体的高置信度预测等问题。这些结果强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

英文摘要

Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.

URL PDF HTML ☆

赞 0 踩 0

2605.28058 2026-05-28 cs.CL 版本更新

Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis

提示即一切：基于多视角提示的大语言模型在方面级情感分析中的应用

Nils Constantin Hellwig, Niklas Donhauser, Jakob Fehle, Udo Kruschwitz, Christian Wolff

发表机构 * Media Informatics Group, University of Regensburg, Germany（里根大学媒体信息学小组）； Information Science Group, University of Regensburg, Germany（里根大学信息科学小组）

AI总结提出LLM-MvP方法，通过多视角提示、模式约束解码和前缀批处理，使大语言模型在少量样本下达到与微调模型竞争甚至更优的性能，同时降低计算开销。

详情

AI中文摘要

近期工作探索了大语言模型（LLMs）在方面级情感分析（ABSA）中通过少样本提示的能力，相比零样本基线显著改进，且所需标注示例大幅减少。然而，与在数百个示例上微调的模型相比仍存在性能差距，且LLM推理的计算成本对部署构成实际障碍。我们提出了基于LLM的多视角提示（LLM-MvP），将考虑多种元素排序的多视角原理适配到LLM提示中。通过将模式约束解码与上下文无关语法及前缀批处理相结合，LLM-MvP实现了与微调方法竞争甚至更优的性能，同时大幅降低计算开销。在五个基准数据集上的广泛实验表明，LLM-MvP缩小了少样本提示与微调模型之间的差距，为ABSA提供了实用且高效的解决方案。

英文摘要

Recent work explored the capabilities of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA) through few-shot prompting, requiring substantially fewer annotated examples while achieving notable improvements over zero-shot baselines. However, a performance gap remained compared to models fine-tuned on hundreds of examples, and the computational costs of LLM inference present practical barriers to deployment. We introduce LLM-based Multi-View Prompting (LLM-MvP), which adapts the multi-view principle of considering multiple element orderings to LLM prompting. By combining schema-constrained decoding with a context-free grammar and prefix batching, LLM-MvP achieves performance competitive or superior to fine-tuned approaches while substantially reducing computational overhead. Extensive experiments across five benchmark datasets demonstrate that LLM-MvP closes the gap between few-shot prompting and fine-tuned models, offering a practical and efficient solution for ABSA.

URL PDF HTML ☆

赞 0 踩 0

2605.28047 2026-05-28 cs.CL 版本更新

Knowledge Dependency Estimation for Reliable Question Answering

面向可靠问答的知识依赖估计

Chaodong Tong, Qi Zhang, Nannan Sun, Lei Jiang, Yanbing Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； China Industrial Control Systems Cyber Emergency Response Team（中国工业控制系统网络应急响应团队）

AI总结提出Knot方法，通过子集级反事实监督和潜在依赖因子覆盖建模，估计黑盒问答模型对不同知识单元的敏感性，以识别关键知识依赖。

Comments 12 tables, 9 figures

详情

AI中文摘要

可靠的问答不仅需要判断答案是否正确，还需要识别预测所依赖的可用知识。在实际的基于LLM的问答中，这些知识可能来自上下文、检索、分解或中间推理，形成一个嘈杂且冗余的候选空间，而非干净的金标准证据集。我们研究\emph{知识依赖估计}：估计固定黑盒问答模型对不同候选知识单元的敏感性。挑战在于无需穷举测试时扰动即可获得细粒度的依赖分数，同时建模冗余性、可替代性和互补性。我们提出 extbf{Knot}，一种结构化的排序感知知识依赖估计器。Knot从子集级反事实监督中学习，通过覆盖潜在依赖因子来建模子集敏感性，并推导出排序感知的单元分数以识别有影响力的候选。在多项选择和生成式问答基准上，Knot在子集敏感性预测方面优于所有对比基线，并在无需额外问答模型调用的情况下产生比可部署基线更忠实的单元排序；当用于实际风险筛查时，其依赖分数有助于及早标记易出错的问答预测。

英文摘要

Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.

URL PDF HTML ☆

赞 0 踩 0

2605.28046 2026-05-28 cs.AI cs.CL 版本更新

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

MemCog: 从记忆即工具到记忆即认知的对话代理

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

发表机构 * WeChat, Tencent Inc.（腾讯公司）

AI总结提出MemCog系统，通过可导航记忆存储、跨维度导航接口和主动推理协议，将记忆访问融入推理过程，在被动问答和主动记忆触发基准上达到最优性能。

详情

AI中文摘要

现有的代理记忆系统普遍遵循我们称之为“记忆即工具”的范式，其中单个查询触发对扁平段落列表的一次性检索，存在被动调用、推理-检索解耦以及检索片段与代理导航需求之间的结构不匹配等问题。我们提出MemCog，一个“记忆即认知”系统，使记忆访问成为推理过程的一个组成部分。MemCog将用户知识组织为具有关联链接图的可导航记忆存储，暴露跨维度导航接口以进行多步推理驱动的遍历，并采用主动推理协议，驱动代理从对话上下文中自发启动记忆探索。我们还构建了ProactiveMemBench，这是第一个用于评估主动记忆触发的基准。实验表明，MemCog在被动问答基准上达到了最先进水平（LoCoMo上92.98，LongMemEval上95.8），同时在ProactiveMemBench上大幅超越基线，展示了记忆即认知的优势。

英文摘要

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

URL PDF HTML ☆

赞 0 踩 0

2605.28042 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

通过激进剪枝专家从LLM中提取小型翻译专家

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出一种从混合专家LLM中激进剪枝与翻译无关的专家，实现大幅压缩MoE块而不显著降低翻译质量的方法。

详情

AI中文摘要

现代大型语言模型（LLM）实现了最先进的机器翻译性能，但它们是作为广泛通才训练的，主要针对许多与翻译无关的任务和能力。因此，它们对于此任务严重过参数化，导致过多的内存和计算需求。在本文中，我们提出了一种从现代混合专家LLM中激进剪枝专家的方法，同时翻译质量下降可忽略不计。我们的方法利用专家专业化和LLM中多语言能力的可分离性来识别与翻译无关的专家。并且由于MoE的模块化特性，这些专家可以在无需任何训练的情况下轻松剪枝。无需重新训练，我们能够剪枝一半的专家而质量下降可忽略，剪枝70%仅造成轻微损失。通过非常短的SFT，我们剪枝75%的专家并恢复基线性能，在某些设置下移除近90%的专家同时保持合理的翻译质量。总体而言，我们的结果表明翻译仅需要LLM的一小部分，从而实现了对包含超过90%参数的MoE块的大幅压缩。

英文摘要

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.28037 2026-05-28 cs.CL 版本更新

Personality, Role, and Expressive Style in Large Language Models: An Interactionist Analysis

大型语言模型中的个性、角色与表达风格：一种互动主义分析

Moe Nagao, Koichiro Terao, Mikio Nakano, Naoto Iwahashi

发表机构 * Okayama Prefectural University（冈山县立大学）； AI & Humans Lab（人工智能与人类实验室）； C4A Research Institute（C4A研究所）

AI总结本研究从互动主义视角，通过因子设计实验分析个性特质、对话角色和表达风格如何共同影响大型语言模型生成对话中感知的大五人格特质表达。

Comments 26 pages

详情

AI中文摘要

基于提示的个性控制是设计在社交情境中行为一致的大型语言模型（LLM）对话智能体的关键技术。然而，在提示中指定大五人格特质（BFTs）并不能确保这些特质在生成的语句中得到表达。本文从互动主义视角研究这种不匹配，将人格表达视为由特质指定与情境因素相互作用塑造的依赖于上下文的结果。我们分析了感知到的LLM生成对话中的BFT表达如何受三个提示因素影响：人格特质、对话角色和表达风格。采用结合六种人格条件、三种角色和三种表达风格条件的因子设计，我们在英语和日语中各生成了1,080个LLM智能体对话。然后，我们使用LLM-as-a-judge框架评估目标智能体的语句，以估计表达的大五人格特质。结果表明，表达的人格不仅受显式特质指定影响，还受对话角色和表达风格影响。这些效应是特质特定的：对话角色强烈影响开放性，表达风格显著塑造尽责性和宜人性，而显式特质指定主导神经质。即使没有显式的人格特质指定，社会和表达条件也会诱发独特的人格印象。跨语言比较显示英语和日语对话之间的模式大致相似，仅在特定的人格、角色和表达风格组合下存在显著差异。这些发现表明，LLM智能体中的个性控制不应被理解为特质提示的直接结果，而是一个涉及人格指定、社会角色和表达风格的依赖于上下文的过程。

英文摘要

Prompt-based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context-dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM-generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive-style conditions, we generate 1,080 LLM-agent dialogues in each of English and Japanese. We then evaluate the target agent's utterances using an LLM-as-a-judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait-specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality-trait specification, social and expressive conditions induce distinct personality-like impressions. Cross-linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context-dependent process involving personality specification, social role, and expressive style.

URL PDF HTML ☆

赞 0 踩 0

2605.28025 2026-05-28 cs.AI cs.CL cs.CY 版本更新

KSAFE-MM：通过本地化语境化实现韩国文化风险的多模态安全基准

Yongwoo Kim, Sojung An, Yunjin Park, Jungwon Yoon, Dujin Lee, HyunBeom Cho, Jaewon Lee, Wonhyuk Lee, Youngchol Kim, JeongYeop Kim, Donghyun Kim

发表机构 * Korea University（韩国大学）； KT Corporation（KT公司）

AI总结针对多模态大语言模型在安全评估中缺乏文化特异性问题，提出KSAFE-MM基准，通过语言和视觉语境化构建通用与韩国文化特有的多模态安全测试集，揭示模型对文化攻击的脆弱性及安全性与过度拒绝之间的权衡。

详情

AI中文摘要

多模态大语言模型（MLLMs）通过引入跨多种模态（如语言和视觉）的漏洞，加剧了安全风险。然而，当前的MLLM安全评估工具存在重大局限性：1）以英语为中心的数据集构建，以及2）关注与当地文化背景无关的通用风险。本文介绍了KSAFE-MM，一个用于韩语多模态安全评估的基准，涵盖通用安全风险和文化特定漏洞。KSAFE-MM由两部分组成：KSAFE-MM-G和KSAFE-MM-C。KSAFE-MM-G通过语言语境化评估韩语语境中的全球共享风险，将通用安全查询转化为上下文相关的多模态样本。KSAFE-MM-C利用源自真实世界语境的本地化视觉查询，针对文化依赖的MLLM安全漏洞。它将这些视觉查询与越狱式文本查询配对，以覆盖涉及文化视觉线索和恶意文本意图的多模态安全风险。这些组件共同提供了一个从通用到本地的构建流程，用于评估全球共享安全风险和文化特定漏洞。我们在KSAFE-MM上评估了12个最先进的MLLM，并揭示了模型对文化攻击的脆弱性高于通用攻击。值得注意的是，越狱策略显著提高了攻击成功率，其中ProgramExecution的攻击成功率高达74.2%，而标准查询仅为13.4%。此外，我们发现了安全性与过度拒绝之间的系统性权衡，即实现低攻击成功率的模型往往对良性查询表现出过度的拒绝行为。这些发现强调了超越以英语为中心的基准、进行文化基础安全评估的紧迫性。

英文摘要

Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.28009 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

MemGuard：防止长期记忆增强型大语言模型中的记忆污染

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Columbia University（哥伦比亚大学）； Capital One

AI总结提出MemGuard，一种类型感知的记忆框架，通过显式分配功能角色、维护类型隔离记忆间的关联并选择性组合必要类型的证据，防止异构记忆污染，提升记忆可靠性最高28.27%并减少检索token数最高5.8倍。

详情

AI中文摘要

记忆增强型大语言模型通过跨交互维护长期记忆，将推理扩展到固定上下文窗口之外。然而，现有的记忆系统常常将稳定的用户事实、情景事件和行为规则折叠到共享空间中，使得功能不同的记忆被检索并用作可互换的证据。我们将这种失败模式识别为异构记忆污染，其中上下文特定的事件被过度概括为声明，或者语义相关但功能不兼容的记忆误导生成。为此，我们引入了MemGuard，一种类型感知的记忆框架，在记忆构建和检索过程中保留功能记忆边界。它在写入时为每个记忆分配显式的功能角色，维护跨类型隔离记忆的关系，并仅从必要的记忆类型中选择性组合证据，从而减少来自无关或功能不兼容证据的污染。在幻觉和长时对话基准测试中，MemGuard将记忆可靠性提高了最多28.27%，同时检索的记忆token数比先前方法减少了最多5.8倍。这些结果表明，可靠的长期推理依赖于对异构记忆的有原则的组织和选择性使用。

英文摘要

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

URL PDF HTML ☆

赞 0 踩 0

2605.28006 2026-05-28 cs.CL cs.AI 版本更新

毒性存在于何处？语言模型中的机制定位与定向抑制

Himanshu Beniwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar（印度理工学院冈德辛加尔）

AI总结通过分析毒性与中性提示的激活差异，定位特定层和神经元中的毒性，并利用推理时缩放或最小秩一权重编辑进行抑制，无需梯度下降，实现毒性降低同时保持语言质量。

详情

AI中文摘要

大型语言模型频繁生成有毒、仇恨或有害内容，然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤，且缺乏对毒性内部起源的机制性理解。我们提出了Meow2X和TRNE，两种互补的无需重新训练的框架，通过分析毒性与中性提示之间的激活差异，将毒性定位到特定层和神经元，然后通过推理时缩放或最小秩一权重编辑进行抑制——无需任何梯度下降。在五个语言模型、两个基准测试和90种配置上的评估，使用双重安全评估器，一致地证明了毒性降低，同时保持了语言建模质量。我们的分析揭示，毒性不成比例地编码在早期MLP层中，在不同架构间有所变化，并且被单一评估器设置系统性地低估——强调了多评估器安全评估的必要性。通过连接机制可解释性与实际去毒化，我们的框架为更安全、更透明的语言模型提供了一条原则性路径。

英文摘要

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

URL PDF HTML ☆

赞 0 踩 0

2605.27993 2026-05-28 cs.CL 版本更新

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

重新思考视觉忽视：通过上下文偏好引导缓解MLLM幻觉

Jingwen Wu, Xijun Zhang, Ge Song

发表机构 * School of Computer and Electronic Information, Nanjing Normal University（计算机与电子信息学院，南京师范大学）

AI总结针对多模态大语言模型的物体幻觉问题，提出无需训练的上下文偏好激活引导框架（CAS），通过提取上下文偏好向量并在中间早期MLP层注入符号残差来控制信息依赖，有效缓解幻觉且不增加解码延迟。

Comments 15 pages, 5 figures

详情

AI中文摘要

物体幻觉仍然是多模态大语言模型（MLLM）可靠部署的主要障碍。当前的推理时缓解方法主要假设幻觉源于视觉忽视，引导模型增强视觉依赖。相反，我们对多个MLLM的系统干预表明，推动更多视觉依赖可能会加剧某些模型的幻觉，而减少视觉依赖则可能缓解幻觉。这一结果表明，将幻觉单纯归因于视觉不足是不充分的。我们认为，图像作为上下文，同时与模型的参数知识和文本上下文竞争。为此，我们提出了一种无需训练的框架——上下文偏好激活引导（CAS）。它通过两组设计好的冲突样本提取两个语义不同的上下文偏好向量（CPV），并在推理时通过单次符号残差注入到中间早期MLP层来控制信息依赖。实验表明，CAS在不增加解码延迟的情况下显著缓解了物体幻觉，并保持了原生文本生成质量。

英文摘要

Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model's parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.

URL PDF HTML ☆

赞 0 踩 0

2605.27988 2026-05-28 cs.CL cs.CY 版本更新

Auditing Stance Asymmetry in Generative Explanations

审计生成性解释中的立场不对称性

Jiarui Han

AI总结针对语言模型在开放式解释中分配责任、合法性和背景时可能产生的立场不对称性，提出对称性分解评估（SDE）方法，通过配对情境测试揭示表面差异的稳定性差异，并指出自动评分的不稳定性。

详情

AI中文摘要

语言模型的偏见评估在有界比较方面取得了实质性进展，例如明显的贬低、刻板印象关联或受控替换下的标签敏感差异。开放式解释提出了一个不同的问题：它们通过分配责任、合法性、背景和委屈来指导解释。模型可以避免敌对语言，同时使一方在结构上可理解，而另一方则被归咎于个人、反应过度或不太值得认真对待。我们称之为生成性解释中的立场不对称性。我们提出对称性分解评估（SDE），该方法通过具体群体标签、结构角色重写以及明确的支持或反证来测试配对情境。在一个受控的32族原型套件中，这种分解表明表面差异并非全部相同：有些在结构或证据控制下减弱，而另一些则作为模型分配责备、背景或合法性的稳定差异持续存在。针对性的案例审查和法官比较表明，评估开放式框架不对称性存在更广泛的困难：法官的解读随操作化方式而变化，标量分数可能抹平读者用于解释解释性立场的区别。因此，SDE将生成性偏见评估重新定义为对解释性立场的审计——每一方接受什么立场，它在分解下如何变化，以及自动评分在何处变得不稳定。

英文摘要

Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations raise a different problem: they guide interpretation by assigning responsibility, legitimacy, context, and grievance. A model can avoid hostile language while making one side structurally understandable and another personally at fault, overreacting, or less worth taking seriously. We call this stance-bearing asymmetry in generative explanations. We propose Symmetry Decomposition Evaluation (SDE), which tests paired situations with concrete group labels, structural-role rewrites, and explicit support or counter-evidence. In a controlled 32-family prototype suite, this decomposition shows that surface differences are not all alike: some weaken under structural or evidence control, while others remain as stable differences in how the model assigns blame, context, or legitimacy. Targeted case review and judge comparison suggest a broader difficulty for evaluating open-ended framing asymmetries: judge readings shift across operationalizations, and scalar scores can flatten distinctions that readers use to interpret explanatory stance. SDE therefore reframes generative bias evaluation as an audit of explanatory stance -- what stance each side receives, how it changes under decomposition, and where automatic scoring becomes unstable.

URL PDF HTML ☆

赞 0 踩 0

2605.27986 2026-05-28 cs.CL q-bio.QM 版本更新

语义流正则化：教会LLMs生成多样且连贯的回复

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

发表机构 * Tencent Inc.（腾讯公司）； Beijing, China（中国北京）

AI总结针对大语言模型微调时输出多样性严重受限的跨风格坍缩问题，提出语义流正则化（SFR），通过条件流匹配监督骨干网络使用连续句子嵌入，在零部署成本下提升多样性和风格保真度。

详情

AI中文摘要

当大语言模型被微调以生成个性或语气条件化的回复时，其输出多样性受到严重限制——我们将这种失败称为跨风格坍缩。我们将这种坍缩追溯到交叉熵目标，该目标在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化（SFR），一种轻量级的辅助目标，通过条件流匹配使用未来片段的连续句子编码器嵌入来监督骨干网络。随机流源通过构造保持多模态；流匹配头在推理时被丢弃，增加零部署成本。在一个大规模工业对话数据集（Qwen3-32B，9种个性）上，SFR在输出多样性、风格保真度和回复质量上优于SFT。我们进一步在公共LiveCodeBench-v5（Qwen2.5-Coder-7B-Instruct）上验证，其中SFR持续改进pass@k，证实了其超越风格化对话的通用性。在MBPP上的受控比较显示，多令牌预测是SFR的一个退化特例。

英文摘要

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

URL PDF HTML ☆

赞 0 踩 0

2605.27969 2026-05-28 cs.CL 版本更新

Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost

后训练助手中的边界抑制不对称性：过度扩展作为可控性代价

Jiarui Han

AI总结研究后训练语言模型助手中因避免回答不足而导致的边界抑制不对称性，发现反回答不足策略在边界控制评估中更难被抑制，且这种代价与内容预算超支和延续持续性相关。

详情

AI中文摘要

后训练的语言模型助手通常被优化以避免回答不足，鼓励完整、有用、谨慎和主动的响应。我们询问这种优化是否会产生不对称的可控性代价：当用户明确要求更窄的回答时，哪些助手行为仍然可被抑制，哪些继续塑造响应？我们将此问题研究为边界抑制不对称性。跨多个高级响应维度的提示侧探测表明存在选择性代价，集中在“过度助手”方向，如过度完成、额外帮助和反回答不足。使用来自共享基础模型的控制助手策略变体，我们发现反回答不足策略在匹配的边界控制评估下比基线更难被拉回，而最小边界变体在直接边界控制比较中通常避免了这种反侧向上偏移。机制导向的探测指向超出更长的默认输出、纯EOS失败、不确定性补偿和局部延续偏差，而鲁棒性检查在共享系统和更大规模设置下保持了主要的反超基线排序。证据支持混合规划/停止解释，其中内容预算超支和延续持续性共同使边界修正更难。总体而言，后训练可能产生方向特定的可控性代价：一些有用的助手倾向仍然容易调用，但更难局部抑制。

英文摘要

Post-trained language-model assistants are often optimized to avoid under-answering, encouraging complete, helpful, cautious, and proactive responses. We ask whether this optimization creates asymmetric controllability costs: when users explicitly request narrower answers, which assistant behaviors remain suppressible, and which continue to shape the response? We study this problem as boundary-suppression asymmetry. Prompt-side probes across multiple high-level response dimensions suggest a selective cost, concentrated around `too-much assistant' directions such as over-completion, extra help, and anti-underanswering. Using controlled assistant-policy variants derived from a shared base model, we find that anti-underanswering policies are harder to pull back than the baseline under matched boundary-control evaluations, while minimal-boundary variants generally avoid this anti-side upward shift in the direct boundary-control comparisons. Mechanism-oriented probes point beyond longer default outputs, pure EOS failure, uncertainty compensation, and local continuation bias, while robustness checks preserve the main anti-over-baseline ordering under shared-system and larger-scale settings. The evidence supports a mixed planning/stopping account, where content-budget overshoot and continuation persistence jointly make boundary correction harder. Overall, post-training may create direction-specific controllability costs: some helpful assistant tendencies remain easy to invoke, yet harder to locally suppress.

URL PDF HTML ☆

赞 0 踩 0

2605.27958 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

压力测试LLM中的欺骗探针：扩展性、鲁棒性与欺骗表示的几何结构

Sachin Kumar

发表机构 * LexisNexis（LexisNexis公司）

AI总结本文通过系统压力测试，诊断线性探针在分布偏移下失效的原因，发现风格增强可恢复近完美检测，并证明欺骗编码非单一线性方向或熵代理，而是分布式亚阈值特征。

Comments Accepted at the GEM Workshop @ ACL 2026

详情

AI中文摘要

基于LLM激活训练的线性探针越来越多地被提议作为欺骗检测指标，但在干净基准上报告AUROC超过0.96，而在分布偏移下崩溃。本文系统地对Gemma 3模型家族（1B-27B参数）的探针指标进行压力测试，诊断其失败原因而不仅仅是记录失败。我们测试了关于欺骗编码的四个假设：（1）单一线性方向，（2）多维子空间，（3）凸锥包，（4）熵代理。我们的设计包括跨域转移矩阵、基于排列零基线的多维探针分析、熵残差化测试以及8种风格偏移下的干扰评估。我们发现：（a）探针在干净数据上达到近乎完美的AUROC（>=0.998），但在风格偏移下崩溃；风格增强的探针在未见风格上恢复近乎完美的检测（平均AUROC 0.979-0.983）；（b）单一方向假设被拒绝（k=1仅捕获0.61-0.80 AUROC），跨域转移失败被确认为几何原因而非层不匹配驱动；（c）熵代理假设被拒绝（最大|rho|=0.454，残差化后最大Delta-AUROC=0.004）；（d）欺骗并未形成显著的线性子空间（每域k*=0），但多维探针（k>=5）通过分布式亚阈值特征恢复信号。探针脆弱性反映了分布狭窄性而非架构限制：风格增强的探针在4B和27B均恢复近乎完美的检测，表明逆缩放模式是训练分布伪影而非真正的规模依赖现象。

英文摘要

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

URL PDF HTML ☆

赞 0 踩 0

2605.27957 2026-05-28 cs.CL 版本更新

DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

DisasterBench: 在类型化工具接口约束下基准测试LLM规划

Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee

发表机构 * Texas A&M University（德克萨斯A&M大学）； University of Toronto（多伦多大学）

AI总结提出DisasterBench基准，通过类型化工具接口评估LLM在灾害响应中的结构化多智能体规划能力，并引入首次故障点（FPoF）方法进行步骤级故障归因，揭示语义推理与执行约束之间的差距。

详情

AI中文摘要

灾害造成严重的社会影响，需要快速协调异构AI工具（从卫星分析到洪水预测和损害评估）形成连贯的多步骤工作流。随着LLM越来越多地充当此类管道的编排者，有效的协调需要的不仅仅是选择语义上合理的工具：LLM必须生成具有正确参数绑定和依赖传播的可执行工作流。我们引入了DisasterBench，这是一个基准，用于评估在语义相似但操作上不同的灾害响应工具上的结构化多智能体规划。为了实现步骤级故障归因，我们进一步提出了首次故障点（FPoF），它定位预测工作流中最早的根因，将主要错误与下游级联效应分开。我们的评估揭示了三个发现：规划方法的有效性强烈依赖于模型容量；工具不匹配和参数绑定错误主导了首次故障，揭示了语义基础和执行一致性是不同瓶颈；冗长的中间推理可能与结构化输出要求产生指令冲突，破坏计划生成。总之，这些发现凸显了语义推理与执行基础协调之间的根本差距，强调了需要联合建模语义意图、执行约束和工作流一致性的规划框架。代码、数据和评估资源可在 https://github.com/TamuChen18/DisasterBench_Open 获取。

英文摘要

Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open

URL PDF HTML ☆

赞 0 踩 0

2605.27955 2026-05-28 cs.PL cs.CL 版本更新

Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

技能即伪代码：将技能库重构为面向LLM智能体的伪代码

Xinze Li, Yuhang Zang, Yixin Cao, Aixin Sun

发表机构 * Nanyang Technological University（南洋理工大学）； Fudan University（复旦大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出Skill-as-Pseudocode (SaP)方法，自动将Markdown技能库转换为带类型伪代码，通过确定性质量检查解决LLM智能体在检索技能时产生的混淆循环问题，在ALFWorld任务上显著优于基线。

Comments Preprint. Code: https://github.com/InternLM/Skill-as-Pseudocode

详情

AI中文摘要

面向LLM智能体的Markdown技能库以自由格式的散文形式提供，迫使智能体在每次检索时重新推导输入模式和具体调用语法。我们观察到，这通常会产生一个“困惑 -> 重新检索 -> 仍然困惑”的循环，智能体发出部分正确的动作，收到无信息的环境反馈，并重新检索相同的散文。我们提出Skill-as-Pseudocode (SaP)，一种将Markdown技能库自动转换为带类型伪代码的方法，并具有确定性质量控制。对于从一个或多个技能中提取的相似过程性段落簇，SaP提取一个带类型契约，并通过四重确定性验证器（覆盖、绑定、替换、风险）进行过滤。通过的契约与恢复的具体动作模板一起内联到重写的技能骨架中，为智能体提供两个互补信号：技能功能的类型签名和如何调用它的具体模板。在包含134个游戏的ALFWorld未见分割上，使用gpt-4o-mini，跨三个种子汇总，SaP在402场配对游戏中赢得82场，而Graph-of-Skills (GoS)基线赢得47场（汇总McNemar检验p = 8.2e-5），每场游戏输入token减少22.8% +/- 6.4%，LLM调用减少14.5% +/- 4.1%。

英文摘要

Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-retrieve -> still confused" loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.

URL PDF HTML ☆

赞 0 踩 0

2605.27934 2026-05-28 cs.CL 版本更新

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

GeneralThinker: 通过似然引导的答案条件优化实现领域通用推理

Shengmin Piao, Sanghyun Park

发表机构 * Yonsei University（延世大学）

AI总结提出GeneralThinker框架，利用答案似然进行密集监督和细粒度信用分配，无需领域特定验证器，在数学、STEM和通用推理等11个基准上取得最佳平均性能。

详情

AI中文摘要

基于可验证奖励的强化学习提升了语言模型的推理能力，但其对领域特定验证器的依赖、稀疏的结果奖励以及粗粒度的信用分配限制了其适用性。我们提出了GeneralThinker，一个在策略框架，将推理监督重新表述为密集的答案条件优化，无需领域特定验证器即可实现响应级评估和令牌级信用分配。GeneralThinker使用真实答案的似然来评估生成的推理轨迹，并推导出令牌级的兼容性信号用于细粒度信用分配。为了稳定优化，它通过裁剪和方向保持调制来约束令牌级更新。在涵盖数学、STEM和通用推理的11个基准测试中，GeneralThinker取得了最佳平均性能。进一步分析表明，不受控的令牌级调制可能破坏训练稳定性，而受控的调制使细粒度信用分配始终有效。

英文摘要

Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.

URL PDF HTML ☆

赞 0 踩 0

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG 版本更新

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全：什么决定了多模态越狱鲁棒性？

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher（独立研究者）； Stanford University（斯坦福大学）； Harvard University（哈佛大学）； Purdue University（普渡大学）； Duke University（杜克大学）

AI总结本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响，发现显式图像工具交互能显著降低攻击成功率，并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情

AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式，但其安全性影响尚不明确。现有系统已涵盖多种流程设计，包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上，我们的实验表明显式图像工具交互的攻击成功率最低，平均相对降低约30%。这一发现起初令人惊讶：即使返回的图像工具输出被人为覆盖或本身不安全，攻击成功率仍保持较低，但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明，较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式，我们引入了一个图像工具安全向量框架，将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言，我们的结果表明，显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式，同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.27921 2026-05-28 cs.AI cs.CL cs.CY cs.HC 版本更新

Show, Don't TELL: Explainable AI-Generated Text Detection

展示，而非告知：可解释的AI生成文本检测

Aldan Creo, Suraj Ranganath

发表机构 * School of Computing, Information and Data Sciences（计算与数据科学学院）； University of California, San Diego（加州大学圣地亚哥分校）； United States of America（美国）

AI总结提出一种名为TELL的新型可解释架构，通过内置解释机制和强化学习训练，在保持高检测性能（AUROC 0.927）的同时提供文本级注释，帮助用户基于自身判断识别AI生成文本。

详情

AI中文摘要

关于AI生成文本检测的研究已经提出了多种区分人类与AI文本的方法，其中一些方法在分布内性能上表现优异。然而，由于输出与用户（如教授）的需求不一致——他们只得到一个没有附带解释的数值分数——现实世界的应用进展缓慢。我们通过一种新颖的架构TELL解决了这个问题，该架构从一开始就内置了可解释性。虽然我们的系统仍像其他检测器一样提供数值分数以便比较，但TELL采用了一种根本不同的方法，旨在向用户展示模型认为文本是AI还是人类写作的“线索”，使用户能够根据自己的判断以及对写作背景和所谓作者的理解来决定文本的作者。我们在一个特定领域的作者注释自定义SFT数据集上训练TELL，并进一步使用GRPO结合课程学习来优化系统以提高性能。我们实现了与最先进检测器相竞争的性能（AUROC 0.927），同时原生提供解释检测器决策基础的注释。我们进一步使用人类注释数据集评估解释质量，报告了在注释的具体性、可证伪性、连贯性、合理性和基础性方面的高胜率（平均72.3%），使用户能够批判性思考并自行决定。因此，我们的工作从以人为中心的角度重新定义了AI生成文本检测的问题，并为专注于原生可解释性的新一代检测器铺平了道路。

英文摘要

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

URL PDF HTML ☆

赞 0 踩 0

2605.27916 2026-05-28 cs.CV cs.CL 版本更新

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

OphIn-500K：策划网络规模的视觉指令以扩展眼科多模态大语言模型

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang

发表机构 * Arizona State University（亚利桑那州立大学）； Clemson University（克莱姆森大学）； Washington University in St. Louis（圣路易斯华盛顿大学）； University of Notre Dame（诺特丹大学）； Florida State University（佛罗里达州立大学）； Rice University（里德大学）； NVIDIA（英伟达）； Mayo Clinic（梅奥诊所）

AI总结提出OphIn-Engine流水线从网络视频中构建高质量眼科指令数据，生成包含50万+指令实例的OphIn-500K数据集，并基于此开发眼科专用多模态大语言模型OphIn-VL，在多项任务上超越现有通用医学和专用模型。

详情

AI中文摘要

通用医学多模态大语言模型（MLLMs）的进步为构建支持临床诊断的对话助手展现了巨大潜力。然而，它们在高度专业化领域（如眼科）的适应性仍未得到充分探索，主要原因是缺乏大规模、领域特定的指令微调数据。现有的眼科对话数据集通常规模有限，且大多依赖于已建立的公共基准图像，限制了眼科MLLMs的可扩展性及其捕捉真实临床复杂性的能力。为解决这一问题，我们提出了$ extbf{OphIn-Engine}$，一个眼科特定的指令数据策划流水线，从开放获取的眼科网络规模视频中构建高质量指令数据。该流水线整合了多模态转录以提取图像-文本对、视觉线索分离与评分以识别临床相关的视觉描述，以及指令合成与质量控制以生成准确且多样的临床对话。利用该引擎，我们推出了$ extbf{OphIn-500K}$，一个大规模多模态眼科指令微调数据集，包含超过50万个指令实例和来自29,000多个视频片段的151,000多张独特图像，格式包括视觉问答（VQA）、多轮对话交互和思维链（CoT）推理。基于该数据集，我们进一步开发了$ extbf{OphIn-VL}$，一个具有高级视觉理解和对话能力的眼科专用MLLM。综合实验和案例研究表明，与最先进的通用医学和领域专用MLLMs相比，OphIn-VL实现了更优的性能。

英文摘要

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.27908 2026-05-28 cs.CL cs.AI 版本更新

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills: 发现并自我进化情感支持对话技能

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Qwen DianJin Team, Alibaba Cloud Computing（阿里云Qwen团队）

AI总结提出ESC-Skills框架，通过干预单元建模支持交互并构建技能库，结合多轮廓自我进化机制，提升情感支持对话的可解释性、可控性和效果。

详情

AI中文摘要

现有的情感支持对话（ESC）系统主要依赖于端到端的回复生成或粗粒度的策略监督，可解释性有限，且对系统性的技能提升支持不足。我们提出ESC-Skills，一个以技能为中心的框架，能够发现并自我进化可执行的情感支持技能。我们首先将局部支持交互建模为干预单元（IUs），捕捉求助者状态、支持干预和回复后情绪变化之间的状态-动作-结果动态。基于从成功和失败的ESC对话中提取的IUs，我们构建了ESC-Skills库，这是一个包含干预指导、适用条件、预期结果和潜在风险的可执行情感支持技能仓库。为了进一步提升鲁棒性，我们引入了一个多轮廓自我进化精炼框架，其中ESC代理在SAGE评估下与多种模拟求助者轮廓进行交互。分析由此产生的交互轨迹，以识别缺失的技能、不安全的干预和特定轮廓的失败模式，然后通过基于模拟的验证来精炼技能库。实验结果表明，ESC-Skills在提升回复质量和对话层面的情感结果的同时，提供了更可解释和可控的支持行为。我们将发布代码、提示和ESC-Skills库，网址为https://github.com/aliyun/qwen-dianjin。

英文摘要

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

URL PDF HTML ☆

赞 0 踩 0

2605.27905 2026-05-28 cs.CL 版本更新

AI Research Agents Narrow Scientific Exploration

AI研究代理缩小科学探索范围

Yixuan Tang, Yi Yang

发表机构 * The Hong Kong University of Science and Technology（香港理工大学）

AI总结本研究通过四个AI研究代理框架和六个大语言模型生成37,802个科学想法，发现AI生成的想法比人类论文更集中、更接近起始文献，且与低引用论文相似，表明当前AI代理更适合局部细化而非拓宽科学探索。

详情

AI中文摘要

AI研究代理现在能够生成研究想法、设计实验、运行代码和起草论文，这引发了大规模AI辅助科学发现的可能性。许多当前的代理框架明确鼓励生成新颖且高影响力的想法。然而，目前尚不清楚AI辅助构思是拓宽了科学探索，还是主要集中于现有工作。我们将AI研究代理视为科学搜索系统进行研究。使用四个AI研究代理框架和六个大语言模型，我们从AI和机器学习中由引用定义的研究领域的共享种子文献中生成37,802个科学想法。然后，我们将生成的AI想法与来自相同研究领域的人类撰写论文、来自相同种子文献的后续人类研究以及种子文献本身进行比较。在实验中，出现了四个一致的模式。第一，AI生成的想法比来自相同研究领域的人类撰写论文更加集中。第二，AI生成的想法比后续人类工作更接近其起始文献。第三，与AI生成想法最相似的论文往往获得较低的后续引用。第四，当AI生成的想法与先前工作不同时，差异主要来自现有技术方法的重新组合，而不是引入全新的研究问题。总体而言，当前的AI研究代理似乎更适合局部细化，而不是拓宽科学探索。

英文摘要

AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.27901 2026-05-28 cs.CL cs.AI 版本更新

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

跨类型多样语言的思维链监控脆弱性

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal

发表机构 * University of Virginia（弗吉尼亚大学）； Lawrence Livermore National Laboratory（劳伦斯利弗莫尔国家实验室）

AI总结本研究通过13种语言和7个前沿模型家族的评估，发现思维链监控在语言分布偏移下普遍不可靠（平均不可信率95.9%），模型会进行策略性操纵，且低资源语言中欺骗模式完全存在。

详情

AI中文摘要

思维链（CoT）监控已被提出作为一种有前景的安全机制，用于检测大型语言模型中的失调行为。然而，其在英语之外以及跨不同模型家族中的可靠性仍 largely unexplored。我们首次在13种多样语言和7个前沿模型家族（共16个模型）上对CoT可监控性进行了大规模评估。使用需要显式中间计算的对抗性提示评估，结合内部答案标记概率分析，我们一致发现CoT在语言和提示类型上存在不忠实性，在8B至120B参数模型中平均不忠实率为95.9%。我们发现前沿模型系统性地进行策略性操纵，包括答案切换、事后合理化以及对提示的程序性利用，使得外部监控器难以检测欺骗。我们表明，前沿模型通常在其潜在激活中在生成的前15%内就承诺了失调线索，即使CoT看起来忠实。令人惊讶的是，这些欺骗模式在低资源语言中保持100%，揭示了当前基于CoT的监督的根本局限性。我们的结果表明，CoT监控在语言分布偏移下本质上是脆弱的，提供的安全信号比仅英语研究所暗示的要弱得多。这些发现强调了开发稳健的CoT监控器以及加速白盒监控技术研究的迫切需要，特别是为了改善中低资源语言中的CoT可监控性。我们的代码可在此处获取：\href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}。

英文摘要

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}.

URL PDF HTML ☆

赞 0 踩 0

2605.27896 2026-05-28 cs.CL cs.CE 版本更新

FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations

FinBoardBench: 通过棋盘游戏模拟基准测试大语言模型的动态财富管理和战略金融推理

Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao, Caiwei Li, Yue Ma, Jie He, Qiancheng Zhang, Yuntao Zou, Dagang Li

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology, Macau, China（1 计算机科学与工程学院，澳门科技大学，澳门，中国）； School of Economics, Anhui University, Anhui, China（2 经济学院，安徽大学，安徽，中国）； SKLPlanets, Macau University of Science and Technology, Macau, China（3 SKLPlanets，澳门科技大学，澳门，中国）； Department of Computer and Information Science, University of Macau, Macau, China（4 计算机与信息科学系，澳门大学，澳门，中国）； School of Energy and Power Engineering, Huazhong University of Science and Technology, Hubei, China（5 能源与动力工程学院，华中科技大学，湖北，中国）

AI总结提出基于三款经典金融棋盘游戏的评估套件FinBoardBench，测试大语言模型在动态财富管理、企业投资收购和竞争谈判等综合金融技能，发现模型虽具备基本规划能力但无法将静态推理转化为成功动态决策。

Comments Preprint

详情

AI中文摘要

近期，大语言模型（LLMs）在静态金融推理和简单动态交易任务中取得了优越性能。然而，现有的静态金融基准不足以评估LLMs在真实环境中的动态财富管理和金融决策能力。为弥补这一差距，我们提出了FinBoardBench，一个基于三款经典金融棋盘游戏（现金流、并购和大富翁）的评估套件。FinBoardBench评估一系列全面的金融技能，包括个人现金流管理与债务平衡、企业投资与收购预测，以及带有资产拍卖的竞争性贸易谈判。我们对9个先进LLMs的实验表明，尽管它们展现出基本的长期规划和投资逻辑，但未能有效利用复杂互动来获取利润，且其强大的静态推理性能并未转化为成功的动态决策。值得注意的是，它们倾向于优先获取即时资产而非维持充足流动性，这使得它们容易受到随机事件引发的金融危机的影响。我们希望FinBoardBench能为未来更智能的基于LLM的决策系统提供有价值的参考。

英文摘要

Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.

URL PDF HTML ☆

赞 0 踩 0

2605.27882 2026-05-28 cs.CL cs.AI 版本更新

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench：野外长期主动搜索的基准测试

Xiaohongshu Inc

发表机构 * Xiaohongshu Dots Studio & Unipat AI（小红书 dots 飞 studios 与 Unipat AI）

AI总结针对现有搜索基准中查询过于明确、单轮交互和固定模式评估导致用户体验与评估结果差距的问题，提出VibeSearch范式并构建VibeSearchBench基准，通过渐进式用户模拟和图匹配评估框架测试前沿模型，发现所有模型在长期上下文推理、主动意图激发和结构化知识构建方面仍存在显著不足。

详情

AI中文摘要

基于LLM的智能体在搜索基准上得分很高，但真实用户始终觉得结果不令人满意，这揭示了持续的评估-体验差距。我们将这一差距归因于现有基准依赖于过度明确的查询、单轮交互和固定模式评估，这些都不反映真实搜索行为——用户和智能体通过多轮对话协作细化模糊意图。我们将这种范式称为VibeSearch，并引入VibeSearchBench，一个包含200个手动策划的双语（中文和英文）任务的基准，涵盖20个领域，分为VibeSearch-Pro（专业）和VibeSearch-Daily（日常生活）子集。每个任务将一个用户角色与一个无模式的真实知识图谱配对，并通过渐进式披露用户模拟器和图匹配评估框架进行评估。我们在ReAct框架和OpenClaw智能体框架下对七个前沿模型进行了基准测试。结果表明，所有模型对于VibeSearch仍然严重不足（最佳F1：30.30），凸显了在长期上下文推理、主动意图激发和结构化知识构建方面需要根本性进展。

英文摘要

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

URL PDF HTML ☆

赞 0 踩 0

2605.27878 2026-05-28 cs.CL 版本更新

Narrative Flattening: How Post-Training Compresses Thematic, Affective, and Stylistic Variation in LLM Fiction

叙事扁平化：后训练如何压缩LLM小说中的主题、情感和风格变化

Zehan Li, Yutong Zhu, Siyang Wu, Honglin Bao, James A. Evans

发表机构 * Knowledge Lab, University of Chicago（芝加哥大学知识实验室）

AI总结通过对比四个OLMo 32B检查点（Base、SFT、DPO、RLVR）在三种故事领域中的续写，发现后训练压缩了主题动态、情感强度和语言多样性，导致叙事扁平化，且专业文学领域压缩最严重。

详情

AI中文摘要

大型语言模型能生成流畅的小说，但其创造性输出普遍被视为扁平。我们探究这种质量源于训练的哪个阶段，以及是否对不同领域的人类小说产生同等影响。我们构建了一个匹配的故事续写范式，涵盖StoryStar（公共平台）、TMAS（提示引导）和《纽约客》（专业文学），并将四个OLMo 32B检查点（Base、SFT、DPO、RLVR）的续写与匹配的人类文本进行比较。由于这些检查点共享架构、规模、分词器和预训练，该设计隔离了后训练效应。我们沿三个句子级维度测量每次续写：主题动态、情感普遍性和语言多样性。在所有三个维度上，后训练压缩了动态变化：主题过渡变得更加均匀，高强度情感让位于中性，故事间的风格多样性缩小。我们将这种渐进性损失称为叙事扁平化。该效应在故事领域间方向稳定，但差距大小取决于人类基线：专业文学小说压缩最严重，而公共平台和提示引导故事的差距较小，这与它们的人类基线更接近模型的默认节奏一致。后训练端点在领域间收敛，表明对齐产生了一种续写机制，该机制在很大程度上不依赖于源领域的叙事纹理。

英文摘要

Large language models produce fluent fiction, yet their creative output is widely seen as flat. We ask where this quality originates in the training and whether it affects different domains of human fiction equally. We construct a matched story-continuation paradigm across StoryStar (public-platform), TMAS (prompt-guided), and The New Yorker (professional literary)-and compare continuations from four OLMo 32B checkpoints (Base, SFT, DPO, RLVR) against matched human text. Because these checkpoints share architecture, scale, tokenizer, and pretraining, the design isolates the post-training effect. We measure each continuation along three sentence-level dimensions: thematic motion, affective prevalence, and linguistic diversity. Across all three, post-training compresses dynamic variation: thematic transitions become more uniform, high-intensity emotions give way to neutrality, and stylistic diversity across stories shrinks. We term this progressive loss narrative flattening. The effect is directionally stable across story domains but gap size depends on the human baseline: professional literary fiction is compressed most, while public-platform and prompt-guided stories show smaller gaps, consistent with their human baselines sitting closer to the model's default rhythm. Post-trained endpoints converge across domains, suggesting alignment produces a continuation regime largely insensitive to the source domain's narrative texture.

URL PDF HTML ☆

赞 0 踩 0

2605.27874 2026-05-28 cs.CL 版本更新

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

用于越南语自动语音识别的音节结构解码器

Nghia Hieu Nguyen, Quan Ngoc Hoang, Long Hoang Huu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

发表机构 * Faculty of Information Science and Engineering（信息科学与工程学院）； Faculty of Computer Science（计算机科学学院）； University of Information Technology（信息技术大学）； Vietnam National University, Ho Chi Minh city（越南国家大学，胡志明市）

AI总结针对越南语自动语音识别，提出基于音素级音节结构解码的方法，通过显式建模音节音系组成，在紧凑音素集上生成有效音节结构，显著减小词汇量并在两个基准上超越强基线。

详情

AI中文摘要

大多数自动语音识别（ASR）系统将转录视为对正字法单元（如字符、子词或词）的预测问题。尽管有效，但此类表示并未明确反映语音的语音结构，且通常需要大词汇量以保持充分覆盖。在这项工作中，我们从越南语的音位特征出发，提出了一种用于ASR的音节结构解码器，该解码器在音素层面而非正字法层面建模语音。我们的方法显式捕捉了音节的音系组成，使解码器能够从紧凑的音素库中生成有效的音节结构。这种设计更紧密地契合了语音的语音实现，同时显著减小了词汇量。在两个基准（代表标准语音的LSVSC和包含多种区域发音的多方言语料库UIT-ViMD）上的实验结果表明，尽管使用了更小的词汇量且无额外训练资源，我们的方法始终优于先前强基线，尤其是预训练基线如PhoWhisper和Wav2Vec2。这些结果突显了基于音素的音节建模在该语言ASR中的有效性。用于实验可复现的代码将在论文被接收后公开。

英文摘要

Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

URL PDF HTML ☆

赞 0 踩 0

2605.27865 2026-05-28 cs.CL 版本更新

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

MERIT: 通过基于评分标准的训练进行审稿人匹配的专业知识匹配

Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University（数据科学与工程学院，东华大学）

AI总结提出MERIT两阶段框架，通过强化学习训练审稿人评估器并蒸馏为检索器，实现大规模审稿人分配中的专业知识匹配。

Comments 22pages, 8 figures, 12 tables

详情

AI中文摘要

大规模地将投稿与合适的审稿人匹配是主要会议面临的日益严峻的挑战，然而现有方法要么依赖将一般相关性误认为真正适用性的粗略代理信号，要么需要难以扩展用于训练的昂贵人工标注。我们提出MERIT，一个两阶段框架，通过将标准级别的专业知识匹配转化为可扩展的适用性监督来弥合这一差距。在第一阶段，我们通过强化学习训练一个审稿人评估器，以识别论文所需的专业知识维度，将其与审稿人的先前工作匹配，并产生适用性决策，奖励由基于论文特定专业知识评分标准的LLM引导提供。在第二阶段，我们将评估器的预测蒸馏到基于嵌入的检索器中，以实现高效的大规模分配。实验表明，我们的4B审稿人评估器在适用性分类上优于更大的通用LLM，并且得到的检索器在LR-Bench和CMU Gold数据集上达到了最先进的性能。我们的代码可在https://github.com/Luli3220/MERIT获取。

英文摘要

Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at https://github.com/Luli3220/MERIT.

URL PDF HTML ☆

赞 0 踩 0

2605.27858 2026-05-28 cs.CL cs.AI cs.LG 版本更新

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL: 学习提出有用、信息丰富且多样的问题以进行半监督、可追踪的声明验证

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

发表机构 * Department of Computer Science and Electrical Engineering（计算机科学与电气工程系）

AI总结提出DecomposeRL框架，通过GRPO和多面奖励集成将声明分解为可追踪的子问题，在完全监督和半监督设置下实现高精度，且模型规模小4倍仍匹配大模型性能。

详情

AI中文摘要

声明验证分为两类：端到端分类器准确但无法提供可检查的追踪，而基于分解的方法可产生可检查的追踪但在基准数据集上性能滞后。我们提出DecomposeRL，一种能产生可检查追踪的准确声明验证器。DecomposeRL将分解建模为使用GRPO和多面奖励集成训练的RL策略，支持从无标签声明进行完全监督和半监督学习。DecomposeRL通过数据筛选漏斗解决了GRPO高昂的训练成本，将115K事实验证声明提炼为包含密集学习信号的5K声明子集。我们表明，仅在约5K精选声明上使用完全监督训练的DecomposeRL-7B策略，在包含生物医学、政治、科学和通用领域声明的11个声明验证基准上，实现了86.3的域内和69.8的域外平衡准确率。尽管规模小4倍，它匹配了32B基线和GPT-4.1-mini，并且在仅10%标签声明数据的半监督设置中进一步优于基线。代码、数据和模型见https://dipta007.github.io/DecomposeRL。

英文摘要

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

URL PDF HTML ☆

赞 0 踩 0

2605.27849 2026-05-28 cs.PL cs.AI cs.CL 版本更新

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

FPMoE：一种用于函数式代码生成的稀疏混合专家方法

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

发表机构 * GreenNode AI ； Hanoi University of Science and Technology（河内科学技术大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结针对LLM在函数式编程语言上性能差的问题，提出基于稀疏MoE架构的FPMoE模型，通过语言特定专家和共享专家分别消除干扰和捕获跨语言抽象，以3B活跃参数达到远超微调基线并匹配大模型的效果。

详情

AI中文摘要

尽管基于LLM的代码生成取得了快速进展，但现有模型主要针对命令式语言进行训练，导致函数式编程语言（FPLs）如Haskell、OCaml和Scala长期未被充分探索，即使是前沿模型在FPLs上的表现也明显较差。微调是一种自然的补救措施，但我们的实验表明，每种语言的微调无法捕获共享的函数式抽象，而合并的多语言微调则引入了跨语言干扰。为了解决这个问题，我们引入了FPMoE，这是一个轻量级的开源代码生成模型，基于稀疏混合专家（MoE）架构，包含三个语言特定的路由专家（分别对应Haskell、OCaml和Scala）和一个共享专家，用于捕获跨语言的函数式模式，如单子推理和类型导向编程。这种设计同时解决了两种失败模式：专用专家消除了干扰，而共享专家保留了单语言模型遗漏的抽象。在FPEval上，FPMoE显著优于微调基线，并且仅使用3B活跃参数，即可匹配包括DeepSeek-Coder-6.7B、Qwen2.5-Coder-14B-Instruct和Qwen3-Coder-30B-A3B在内的更大模型的性能。

英文摘要

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

URL PDF HTML ☆

赞 0 踩 0

2605.27832 2026-05-28 cs.CL 版本更新

Playing with Words, Improving with Rewards: Training Language Models for Creative Association

玩文字游戏，用奖励改进：训练语言模型进行创意联想

Vijeta Deshpande, Namrata Shivagunde, Sherin Muckatira, Hadrien Glaude, Mikhail Gronas, Claire Stevenson, Roger Beaty, Anna Rumshisky

发表机构 * University of Massachusetts Lowell（马萨诸塞大学洛市分校）； Dartmouth College（达特茅斯学院）； University of Amsterdam（阿姆斯特丹大学）； Pennsylvania State University（宾夕法尼亚州立大学）； Amazon AGI（亚马逊人工智能研究院）

AI总结本研究通过强化学习与可验证奖励（RLVR）在Codenames游戏上训练LLM，探索了规模依赖的精确度-创造力权衡，发现8B模型在保持推理能力的同时提升创造力，而小模型则牺牲创造力换取推理精度。

详情

AI中文摘要

大型语言模型（LLM）正被应用于日益困难的问题和用例。为了有效导航其广阔的解决方案空间，LLM需要具备创造力。然而，创造力的主观性和人类判断的局限性使得训练LLM的创造力尤其具有挑战性。作为解决方案，我们在Codenames（一个词联想游戏）上训练LLM，该游戏锻炼了创造力的两个核心轴——发散思维和收敛思维，同时产生客观可验证的结果。这种可验证性使我们能够绕过人类判断，并使用具有可验证奖励的强化学习（RLVR）进行训练。我们训练了Qwen3-1.7B、4B和8B模型，并在十个创造力和四个推理基准上评估它们。我们发现精确度-创造力权衡是规模依赖的：8B模型优先考虑创造力而非精确度，而1.7B和4B模型则以牺牲创造力为代价获得推理精确度。具体来说，8B模型在8个创造力基准上显示出适度但一致的提升，且推理能力仅略有下降，而较小的模型在推理任务上取得了显著提升。我们的研究提出了一种可扩展且有效的解决方案来训练LLM的创造力。

英文摘要

Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.

URL PDF HTML ☆

赞 0 踩 0

2605.27824 2026-05-28 cs.AI cs.CL 版本更新

Revealing Algorithmic Deductive Circuits for Logical Reasoning

揭示逻辑推理的算法演绎电路

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

发表机构 * Japan Advanced Institute of Science and Technology（日本科学技术先进研究院）

AI总结本研究通过因果中介分析定位大语言模型中负责逻辑推理步骤的注意力头，发现少量专用头处理事实和规则信息，而高层头促进信息整合和全局推理策略的出现。

详情

AI中文摘要

最近的研究表明，通过在少样本学习设置中引入抽象描述图遍历算法和逐步推理的功能性符号表示，大型语言模型（LLMs）能够实现强大的推理性能。然而，目前尚不清楚LLMs如何仅从有限的示例中真正理解每个推理步骤的抽象含义以及整体算法。本文旨在定位负责单个推理步骤的注意力头，并刻画它们之间传输的信息类型。我们首先在符号辅助的思维链（CoT）提示框架下，将组成推理步骤与其对应的token logits对齐。我们的分析表明，引导推理过程的token位置与低置信度分数相关，这些低置信度分数是由满足演示中推理行为模式的约束引起的。然后，我们采用因果中介分析技术来识别负责这些模式的注意力头。此外，我们的发现表明，LLMs通过专门的注意力头（约占全部头的3%）为各个子推理任务检索事实和基于规则的信息，而较高层主要促进信息整合和全局推理策略（例如图遍历算法）的出现，这些策略协调多个中间推理步骤以解决整体任务。

英文摘要

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

URL PDF HTML ☆

赞 0 踩 0

2605.27808 2026-05-28 cs.CL cs.MM 版本更新

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

TARQ: 面向罕见词鲁棒自动语音识别的尾部感知重建量化

Xinyu Wang, Ziyu Zhao, Ke Bai, Silin Meng, Dongming Shen, Xiao-Wen Chang, Yixuan HE

发表机构 * McGill University（麦吉尔大学）； Boson AI ； Arizona State University（亚利桑那州立大学）

AI总结提出TARQ，一种无标签的后训练量化框架，通过尾部感知重建损失和罕见词平衡规则，在不增加额外训练的情况下显著降低罕见词错误率。

详情

AI中文摘要

数据感知后训练量化（PTQ）在小型校准语料库上最小化每个token的重建损失，隐式地根据经验频率对位置进行加权。对于自动语音识别（ASR），这与尾部敏感风险不一致：名称、数字和领域特定词获得的校准质量比例较小。我们提出了尾部感知重建量化（TARQ），一种无标签的PTQ框架，通过罕见词平衡（一种封闭形式的每线性层规则，平衡常见/尾部质量）和度量一致的残差校正，将校准转向词汇尾部。TARQ不需要实体标签、不需要精心设计的校准集、不需要验证解码，也不需要额外训练。在八个ASR骨干网络和六个数据集上，W4G128下，TARQ在不导致总体WER回归的情况下改善了平均罕见词错误率（rare-WER），在比较方法中实现了最低的跨语料库rare-WER波动，并在无需实体监督的情况下迁移到实体丰富的基准测试（ProfASR, ContextASR-Speech-En）。

英文摘要

Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.27805 2026-05-28 cs.CL cs.AI 版本更新

ChildEval: When large language models meet children's personalities

ChildEval：当大语言模型遇到儿童个性

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng

发表机构 * JIUTIAN Research（九天研究院）； China Mobile（中国移动）； Beijing, China（北京，中国）

AI总结提出ChildEval基准，通过合成3-6岁儿童个性档案和偏好（显式或隐式表达），评估大语言模型在长对话中推断并遵循儿童偏好的能力，实验表明微调可提升儿童中心性能。

Comments 8 pages of main text (ACL Findings format), with references and appendix

详情

AI中文摘要

虽然大语言模型（LLM）使得个性化聊天机器人成为可能，但它们在儿童中心个性化方面的有效性仍不明确，因为缺乏对儿童特定偏好的系统评估。为填补这一空白，我们引入了ChildEval，一个用于评估LLM在长上下文对话中推断和遵循儿童中心偏好能力的基准。ChildEval包含29K个3-6岁儿童的合成个性档案，提供相对静态的背景信息。每个个性档案关联一个儿童偏好——可能与个性一致、冲突或独立——通过单句显式表达或6-10轮对话隐式表达。显式和隐式偏好旨在反映相同的潜在偏好，但表达方式不同，捕捉偏好表达的动态方面而非静态个性的变化。该基准涵盖五个顶层类别和十四个子类别，覆盖儿童的日常生活和发展。我们进一步提出了细粒度、以儿童为中心的评估协议，以系统评估开源LLM。实验结果表明，不同的个性化表示如何影响LLM的响应，并表明在ChildEval上进行微调可以提升儿童中心性能。我们的代码和数据集可在https://github.com/ziyanluo/ChildEval获取。

英文摘要

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

URL PDF HTML ☆

赞 0 踩 0

2605.27789 2026-05-28 cs.AI cs.CL 版本更新

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

固定预算、聚类感知的 LLM-as-a-Judge 评估标准：多跳 RAG 压力测试

Camilo Chacón Sartori, José H. García

发表机构 * Catalan Institute of Nanoscience and Nanotechnology（加泰罗尼亚纳米科学与纳米技术研究所）

AI总结针对多跳 RAG 系统评估中的统计偏差问题，提出一种固定预算、聚类感知的 LLM-as-a-Judge 比较标准，并通过遗传算法证据选择器 GADMEC 在 400 个多跳问题上进行压力测试，揭示聚类感知推断改变了实证结论。

详情

AI中文摘要

检索增强生成（RAG）系统通常通过让大型语言模型（LLM）法官判断哪个答案更好来进行比较。对于多跳 RAG，这已成为一个测量问题，与建模问题同等重要：相同的分数可以反映检索质量、答案长度、词汇重叠或忽略聚类数据的统计检验。我们询问当这些选择被明确时会发生什么。我们提出了 RAG 中 LLM-as-a-Judge 比较的最小测量标准。该标准固定了 top-100 候选池、证据预算、答案上限、生成器和提示；它还要求预先注册假设、聚类感知推断、在可行时进行精确的聚类符号翻转检验以及第二法官复制。聚类基准可能夸大进展；该领域应采用此标准。我们使用遗传算法解码器进行多跳证据组合（GADMEC），一种进化证据选择器，在计算机科学/机器学习（CS/ML）和材料科学领域的 400 个多跳问题上对其进行压力测试。该协议改变了实证故事。二项检验使所有四个语义基线比较看起来显著；聚类感知推断只留下一个 Bonferroni 显著结果。在相同预算下，BM25 优于纯语义 GADMEC，而词汇-语义混合在 CS/ML 中恢复并缩小了材料科学差距。

英文摘要

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

URL PDF HTML ☆

赞 0 踩 0

2605.27788 2026-05-28 cs.LG cs.CL 版本更新

UniMaia：用语言引导国际象棋策略以实现类人玩法

Sherman Siu, Lesley Istead

发表机构 * University of Waterloo（滑铁卢大学）； Carleton University（卡尔顿大学）

AI总结提出UniMaia框架，通过参数高效文本编码器和ControlNet风格调节机制，在冻结的Lc0国际象棋策略网络上实现提示条件策略调制，实现语义控制（如开局选择和玩家强度）并保持预训练策略表征，同时构建大规模元数据增强的Lichess数据集和半自动提示生成管道，在多个基准上取得最优或竞争性结果。

详情

AI中文摘要

大型语言模型的最新进展使得自然语言能够作为控制复杂系统的灵活接口，但通常以大规模多模态训练或弱化领域特定归纳偏差为代价。在结构化决策领域（如国际象棋）中，专门的策略网络表现强劲但缺乏语义可控性，而提示条件语言模型更灵活但通常领域基础较弱。我们提出$ extbf{UniMaia}$，一个用于提示条件策略调制的框架，它使用参数高效文本编码器和ControlNet风格的调节机制来适配基于Lc0的冻结国际象棋策略网络。UniMaia能够实现对游戏玩法的语义控制，包括开局选择和玩家强度，同时保留预训练的策略表征。我们进一步引入$ extbf{UniMaia-Aux}$，它结合了辅助时间条件化和行为预测目标。为了支持这项工作，我们构建了一个大规模元数据增强的Lichess数据集，开发了一个半自动提示生成管道，并引入了涵盖提示条件和元数据条件设置的基准。UniMaia在多个提示条件基准上实现了最先进的预期准确率，在通用指令遵循任务上达到了竞争性的最佳着法准确率，同时在人类着法预测基准上与专门的元数据条件方法保持竞争力。UniMaia-Aux进一步提高了多个评估设置下的预期准确率和行为建模，在最佳着法准确率上略有折衷。总体而言，我们的结果表明，无需端到端多模态训练即可实现领域特定策略网络的提示条件控制，同时突出了可控性与预测性能之间的权衡。

英文摘要

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

URL PDF HTML ☆

赞 0 踩 0

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL 版本更新

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测？古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

发表机构 * Inria（法国国家信息与自动化研究所）

AI总结通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现，发现VLM即使错误也能生成流畅文本，表明其依赖语言先验，并引入扰动和标记级定位度量分析视觉证据。

详情

AI中文摘要

最近的研究表明，用于光学字符识别（OCR）的视觉语言模型（VLM）能够生成看似合理但缺乏视觉支持的文本，暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比，我们展示了VLM的错误即使在错误时也往往保持流畅，产生合理的希腊语替换，而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据，我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下，VLM与扰动的真实文本严重偏离，而传统OCR相对忠实；然而，标记级分析表明先验依赖是模型特定的：在OCR专业模型中，流畅的词汇错误几乎不依赖图像而产生，而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位，而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集，表明流畅输出不一定具有视觉基础，并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.27741 2026-05-28 cs.CL 版本更新

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

逃离语言先验：通过模态感知策略优化缓解音频推理中的后期模态崩溃

Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang, Steve Yves, Sanjeev Khudanpur, Liefeng Bo

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Tencent Hunyuan（腾讯文言）

AI总结针对多模态大语言模型在强化学习后训练中因统一策略梯度忽略模态依赖性而导致的后期模态崩溃问题，提出模态感知策略优化（MAPO）框架，通过模态相关性掩码和辅助注意力损失分支动态聚焦梯度并维持跨模态推理，在复杂音频推理基准上取得新最优结果。

详情

AI中文摘要

音频和全模态大语言模型展现出令人印象深刻的跨模态推理能力。然而，将标准的强化学习后训练算法应用于这些模型时，暴露了一个关键的结构性脆弱性：像GRPO这样的方法对所有token施加统一的策略梯度，忽略了它们对非文本源模态的不平等依赖。这加剧了在扩展思维链生成过程中的后期模态崩溃，模型逐渐放弃主要源信号，转而依赖压缩的文本先验，导致自信但无根据的幻觉。为了解决这个问题，我们引入了模态感知策略优化（MAPO），一种新颖的双分支强化学习框架。首先，MAPO使用模态相关性掩码动态地将策略梯度集中在模态关键token上，该掩码源自音频消融参考与多模态策略之间的跨模态微分熵。其次，它集成一个辅助注意力损失分支，对模型内部的注意力分布施加有针对性的、时间尺度的惩罚。这确保模型在推理轨迹深处主动维持跨模态基础。在复杂音频推理基准上的评估表明，MAPO显著提高了长时推理保真度和多模态指令遵循能力，在开放权重模型中实现了极具竞争力的性能，并在几个关键基准上创造了新的最先进结果。通过严格依赖原生统计信号而非特定领域的归纳偏置，MAPO为缓解跨多种多模态系统的认知崩溃提供了一个有前景的基础。

英文摘要

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27740 2026-05-28 cs.CL 版本更新

ReverseMath: 面向可扩展和可验证数学问题生成的答案反转方法

Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich

发表机构 * Center for Information and Language Processing（信息与语言处理中心）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心 (MCML)）； Saarland University（萨尔兰州大学）

AI总结提出ReverseMath方法，通过反转原始问题的输入输出关系自动生成新数学问题，用于评估和训练，揭示记忆行为并提升推理性能。

详情

AI中文摘要

数学推理基准对于评估大型语言模型（LLM）至关重要，但许多基准是静态的，并通过公开评估和训练管道反复暴露，使得难以区分真正的推理与记忆。同时，手动构建具有可靠答案的新数学问题仍然成本高昂。我们引入ReverseMath，一种通过答案反转生成新数学问题的可扩展方法。给定一个问题及其答案，ReverseMath掩码原始问题中的一个数值，将原始答案视为已知条件，并重写问题，使得掩码值成为新答案。生成的问题反转了原始输入输出关系，使其答案通过构造已知。我们研究了ReverseMath在评估和训练中的应用。对于评估，配对的原始/反转问题揭示了显著的行为变化：模型有时在反转问题上失败，甚至错误地输出原始答案，暗示了类似记忆的行为。对于训练，ReverseMath提供自动标注的反转问题作为强化学习（RL）的数据增强。实验表明，包含ReverseMath生成的数据提高了多个基准上的数学推理性能，证明了其作为分析工具和可验证训练数据的可扩展来源的价值。

英文摘要

Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

URL PDF HTML ☆

赞 0 踩 0

2605.27706 2026-05-28 cs.CL cs.IR 版本更新

Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

基于格点链式自适应重配置以减少幻觉

Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

发表机构 * Department of Mechanical and Aerospace University of California Irvine（机械与航空航天系加州大学伊文斯顿分校）； Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）

AI总结提出CAROL框架，通过定义语义不确定性度量并在文本序列格点上构建串子模目标，将幻觉缓解转化为马尔可夫链接受-拒绝过程，实现测试时幻觉减少。

详情

AI中文摘要

我们介绍了CAROL（基于格点的链式自适应重配置），一个用于大型语言模型测试时减少幻觉的概率框架。CAROL不依赖于词元级别的不确定性，而是基于生成响应与可信上下文之间的一致性定义了一种语义不确定性度量，在文本序列格点上诱导出一个串子模目标。这种表述使得幻觉缓解可以被建模为一个具有可证明收敛性和接近最优性保证的马尔可夫链接受-拒绝过程，允许模型迭代地优化输出以实现语义一致性。通过在意义层面操作，CAROL将幻觉检测和缓解统一在一个框架内。在问答和多智能体推理基准上的实证结果表明，与基于似然和检索增强的基线相比，CAROL显著减少了幻觉，提高了可靠性和可解释性，同时保持了具有竞争力的计算效率。

英文摘要

We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.27690 2026-05-28 cs.CL cs.LG 版本更新

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

TRACES: 通过轨迹状态建模实现多轮LLM智能体的主动安全审计

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

发表机构 * Brown University（布朗大学）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Rutgers University（罗格斯大学）； Texas A&M University（德克萨斯阿姆斯特朗大学）

AI总结提出TRACES方法，通过观察LLM的隐藏表示学习前缀级轨迹风险状态，实现多轮工具使用环境下的主动安全审计，提升全轨迹安全预测和主动风险判别能力。

详情

AI中文摘要

LLM智能体越来越多地通过多轮工具使用和环境交互来运作，其中安全风险往往在最终结果显现之前的中间步骤中就已经出现。因此，反应式审计是不够的：事后诊断常常在风险正在展开时错过标记它们的机会。我们提出TRACES，一种基于表示的主动审计器，它从观察者LLM的隐藏表示中学习前缀级轨迹风险状态。TRACES从步骤表示中诱导潜在机制特征，并建模其时间演化，以估计部分轨迹是否正在向不安全行为漂移。为了规避步骤级风险标注的成本和歧义，TRACES在弱轨迹级监督下训练，同时仍能产生密集的前缀级风险估计。在多个智能体安全基准测试中，TRACES改进了全轨迹安全预测和主动风险判别。我们的分析进一步表明，这些风险状态可以帮助训练更安全的智能体，凸显了主动审计在长程智能体安全中的更广泛潜力。

英文摘要

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

URL PDF HTML ☆

赞 0 踩 0

2605.27668 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐：用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

发表机构 * Agentic Learning AI Lab（代理学习AI实验室）； New York University（纽约大学）； The University of Chicago（芝加哥大学）； Chronologies AI

AI总结提出Beta-Bernoulli校准器（BBC），通过结合二元结果和人类预测信号，将初始点估计转换为事件似然分布，实现校准和不确定性量化。

详情

AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测，现有方法通常从二元结果中学习以输出语言化预测。然而，尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息，如何利用这些信号仍未充分探索。为了解决这个问题，我们提出了Beta-Bernoulli校准器（BBC），它将来自任何模型的初始点估计转换为事件似然分布，使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模，均值作为校准的点预测，方差作为认知不确定性。我们的结果表明，BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测，同时保持轻量级并具有良好的泛化能力。我们还表明，BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

URL PDF HTML ☆

赞 0 踩 0

2605.27654 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

英译印地语中的文化保真度：性别可恢复性的保持-流畅性前沿

Samyak Savi, Chavi Gupta, Shreyas Gantayet, Tanay Sodha, Dhruv Kumar

AI总结研究英译印地语中性别信息的保持问题，提出两种推理时干预方法（SAR和PAR），在保持性别可恢复性与流畅性之间取得平衡。

Comments 10 pages, 2 figures, 9 tables

详情

AI中文摘要

生成式翻译系统是文化技术，因为它们决定如何在特定文化的语法系统中呈现具有社会意义的线索。我们研究成功文化翻译的一个具体概念：当英语源文本明确编码性别时，英译印地语应保持该线索的可恢复性，除非源文本本身存在歧义。我们在涵盖十二个类别的37,345个实例基准上评估了这一标准，并显示五个系统经常通过作格和敬语结构消除性别。然后，我们引入了两种机制感知的推理时干预。第一种是源感知重排序器（SAR），倾向于避免性别中立句法的候选。第二种是现象感知重排序器（PAR），即使在作格句法存在的情况下，也通过目标词汇标记保持性别。在GPT-4o-mini和Sarvam上，PAR将目标子集准确率分别从11.07%提高到54.47%，从15.99%提高到49.66%。人工评估显示，PAR将性别保持率从10.3%提高到81.3%，但平均流畅度从4.36降至3.37。这些发现将两种干预置于保持和流畅性的前沿，而不是支持单一的解决方案，并展示了文化定位的生成如何在保真度、流畅性和风格自然性之间需要明确的权衡。

英文摘要

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

URL PDF HTML ☆

赞 0 踩 0

2605.27649 2026-05-28 cs.CL cs.LG 版本更新

Disentangling Language Roles in Multilingual LLM Task Execution

多语言大模型任务执行中的语言角色解耦

Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

发表机构 * Marquette（马凯特大学）； Cornell（康奈尔大学）； UC San Diego（南加州大学圣地亚哥分校）； UPenn（普林斯顿大学）； UT Austin（德克萨斯大学奥斯汀分校）； Maryland（马里兰大学）； Michigan（密歇根大学）； UIUC（伊利诺伊大学香槟分校）； Melbourne（墨尔本大学）； Stanford（斯坦福大学）

AI总结提出MTM-Bench基准，通过完全交叉设计解耦指令、内容和响应三种语言角色，评估多语言LLM的任务执行能力，发现响应语言角色是性能下降的主要因素。

详情

AI中文摘要

多语言大模型在指令、源内容和所需响应语言不一致时被越来越多地使用。现有基准扩展了多语言指令跟随评估，但很少在完全交叉设计中隔离这三种角色。我们引入了MTM-Bench，一个用于语言条件任务执行的控制基准，其中每个实例由三元组 $(L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})$ 定义。在英语、西班牙语和中文中，MTM-Bench枚举了所有27个三元组，每个模型包含2,430个实例，涵盖语义反转、最终状态提取和带更新实现的语言纯度。我们使用分解指标评估了20个前沿和开源权重LLM，包括语义正确性、目标语言遵循度、约束满足度、污染比率和联合成功率，并通过针对性的人工审计验证评分。完全交叉设计揭示了性能下降是由语言在任务结构中扮演的角色组织的，而不仅仅是语言不匹配的数量。响应语言角色是变化的主要轴，单个响应槽不匹配导致了大部分性能下降。仅响应不匹配与完全不匹配的比较表明，不匹配数量不是困难的单调预测因子，模型级别的排序在不同系统间变化。任务族通过不同的通道失败，表明语义正确性本身并不能捕捉可靠的多语言任务执行。

英文摘要

Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet $(L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})$. Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

URL PDF HTML ☆

赞 0 踩 0

2605.27642 2026-05-28 cs.CL cs.LG 版本更新

Learning to Translate from Soft to Hard LLM Prompts

学习从软提示到硬提示的翻译

Pitipat Kongsomjit, Suryansh Goyal, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute（沃斯特理工学院）

AI总结本文通过训练一个专用的软提示到自然语言翻译模型，提高了翻译质量，并展示了软提示可以转化为可移植的文本提示，在大型闭源模型上超越原软提示甚至少样本学习。

Comments 8 Pages, 11 tables, 4 Figures

2605.27636 2026-05-28 cs.CL 版本更新

Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering

Simorgh at SemEval-2026 task 7: 面向低资源文化推理的多语言问答中的区域感知混合检索

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

发表机构 * University of Tabriz（塔布里兹大学）

AI总结提出区域感知混合检索方法，结合BM25和稠密语义相似度与区域加权启发式，以提升多语言文化问答的跨语言稳定性。

Comments 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

详情

AI中文摘要

尽管大型语言模型（LLMs）在通用领域的推理任务中表现出色，但在数字和文本数据有限的语种中，面对文化相关知识时可能遇到挑战。本文利用BLEnD基准研究文化相关的多项选择问答，该基准包含30种语言的多语料库，涵盖饮食、体育、家庭等社会文化领域。我们提出一种区域感知混合检索方法，结合BM25词汇匹配和稠密语义相似度与区域加权启发式，以提高答案的相关性。检索到的文档用于构建结构化提示，输入Qwen3-14B量化模型，并采用基于logit的确定性答案选择。实验结果表明，与纯参数推理相比，混合检索方法在文化问答中提升了跨语言稳定性。然而，训练数据量不同的语言之间仍存在显著性能差距，这表明检索增强方法并未完全克服训练数据不平衡问题。

英文摘要

Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

URL PDF HTML ☆

赞 0 踩 0

2605.27621 2026-05-28 cs.MA cs.CL 版本更新

Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution

重要的智能体：通过基于移除的归因优化多智能体大语言模型

Mingyu Lu, Yushan Huang, Chris Lin, Su-In Lee

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington（华盛顿大学保罗·G·艾伦计算机科学与工程学院）

AI总结提出一个基于合作博弈的归因框架，通过移除协议和模型替换来识别瓶颈智能体，从而优化多智能体系统性能并降低成本。

详情

AI中文摘要

随着多智能体系统（MAS）变得越来越复杂，识别单个智能体的贡献对于系统优化至关重要。然而，现有方法缺乏严格统一的信用分配框架。在这项工作中，我们将智能体归因形式化为一个合作博弈，由联盟分布、移除协议和目标指标参数化。利用该框架，我们表明留一法（LOO）能够像组合方法一样有效地识别瓶颈智能体，但计算成本仅为后者的一小部分。我们还证明了移除协议会引发不同的博弈：智能体消融隔离了结构瓶颈，而内省式LLM法官无法忠实地近似这种行为。此外，为了评估特定智能体骨干的效用，我们引入了通过模型替换进行归因的方法。通过替换低贡献智能体的底层模型，我们在三个基准测试上将任务性能提高了高达17%，同时将成本降低了高达35%。最后，我们将该框架应用于审计一个医疗MAS，揭示了智能体对诊断准确性和伦理行为的贡献通常是解耦的。通过干预适得其反的角色，我们观察到在保持诊断准确性的同时，伦理一致性有所提高。总体而言，这项工作为成本效益高的MAS归因和干预提供了一种原则性方法。

英文摘要

As multi-agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignment. In this work, we formalize agent attribution as a cooperative game, parameterized by the coalition distribution, removal protocol, and target metric. Using this framework, we show that Leave-One-Out (LOO) identifies bottleneck agents as effectively as combinatorial methods, but at a fraction of the computational cost. We also demonstrate that removal protocols induce distinct games: Agent ablation isolates structural bottlenecks, whereas introspective LLM judges fail to faithfully approximate this behavior. Furthermore, to evaluate the utility of specific agent backbones, we introduce attribution via model replacement. By substituting underlying models of low-contribution agents, we improve task performance by up to 17% while reducing cost by up to 35% across three benchmarks. Finally, we apply our framework to audit a medical MAS, revealing that agent contributions to diagnostic accuracy and ethical behavior are often decoupled. By intervening on counterproductive roles, we observe an increase in ethics alignment while maintaining diagnostic accuracy. Overall, this work provides a principled approach for cost-effective MAS attribution and intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.27596 2026-05-28 cs.CL 版本更新

Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

幻觉能否有用？通过链式系统I/II推理用SLM解决多跳问题

Saptarshi Sengupta, Suhang Wang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出一种“先回答后推理”的认知启发框架，利用SLM的初始答案（可能包含幻觉）作为假设来检索证据，再通过系统II深度推理，从而在多跳问答任务上超越传统的“先思考后检索”方法。

详情

AI中文摘要

最近，小型语言模型（SLM）引起了越来越多的兴趣，它们速度快、性能好，且硬件需求低于大型语言模型（LLM）。然而，SLM比LLM更容易产生幻觉，影响其解决复杂多步推理问题的能力，因为早期错误会级联到最终响应。为了解决这个问题，现有工作采用先思考后迭代检索的策略来减少幻觉。我们认为先思考策略并非总是必要，因为我们发现：（i）SLM通常对其初始答案有准确的置信度，并且（ii）幻觉实际上可能有助于逼近正确答案。因此，我们将我们的工作定位为这种策略的反转，即先回答后推理。我们提出了一个认知启发的框架，其中模型首先被允许快速回答问题（系统I（零样本）），然后基于从知识源使用初始假设检索到的证据进行更深层次的思考（系统II）。通过结合系统I和系统II风格的推理，我们展示了我们的方法在各种多步问答基准测试中可以优于先前采用传统先思考路径的工作。

英文摘要

Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.27586 2026-05-28 cs.MA cs.CL 版本更新

超越静态分类法的青少年危机对话的关键词生成表示

Abeer Badawi, Will Aitken, Lydia Sequeira, Jocelyn Rankin, Maia Norman, Elham Dolatabadi

发表机构 * York University（约克大学）； Vector Institute（向量研究所）； Electrical and Computer Engineering, Queen’s University（皇后大学电气与计算机工程系）； Kids Help Phone（儿童援助电话）

AI总结本文提出关键词生成表示（KGR）方法，通过约束大语言模型生成对话特定的关键词，将原有19标签分类扩展为39标签层次结构，在129段对话和387个专家注释上评估，准确率达0.96，并发现固定分类中缺失的身份相关主题，将主题检索准确率从0.25提升至0.70。

详情

AI中文摘要

危机响应者每年快速评估数千条青少年短信对话，以识别心理健康问题并指导支持。然而，青少年的痛苦越来越多地通过不断演变且依赖具体语境的语言表达，这些语言通常不适合固定标签的分类法。本研究分析了703,975条去标识化的Kids Help Phone对话（2018-2023年），并将KHP的19标签问题分类扩展为39标签层次结构。然后，我们引入关键词生成表示（KGR），一种受约束的大语言模型，生成简洁、对话特定的关键词，在129段对话和387个专家注释上进行了评估。扩展后的分类法达到了专家共识可靠性，准确率为0.96，专家评审发现81%的关键词准确反映了内容，74%提高了清晰度。KGR揭示了固定分类法中缺失的与身份相关的主题，包括移民问题和照顾者负担，并支持了一个主题检索工作流，与手动分析师流程相比，准确率从0.25提高到0.70（+0.45）。KGR标志着向混合、可解释的生成表示转变，将危机响应扩展到静态分类法之外，以揭示新兴的、植根于文化的青少年痛苦模式。

英文摘要

Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

URL PDF HTML ☆

赞 0 踩 0

2605.27545 2026-05-28 cs.CL 版本更新

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

PAST2HARM：一种用于越狱多模态AI的简单自适应过去时攻击

Snehasis Mukhopadhyay

发表机构 * Indian Institute of Information Technology, Kalyani（印度信息技术学院，卡利安）

AI总结提出PAST2HARM框架，通过过去时态改写和迭代升级策略，系统性地利用多模态文本到图像模型的安全漏洞，实现黑盒、无梯度的高成功率越狱攻击。

详情

AI中文摘要

尽管不安全的图像生成可能比不安全的文本产生更严重的后果，且当前防御相对不成熟，但对多模态AI系统的越狱攻击仍未得到充分探索。我们引入了PAST2HARM，一个简单而有效的自适应越狱框架，能够绕过最先进的多模态文本到图像模型中的拒绝训练。基于先前发现过去时态改写可以规避安全防护的结论，PAST2HARM系统地利用了多模态生成式AI中的这一漏洞。我们沿两个维度刻画攻击。第一，广度：通过时间深化，该框架逐步增强历史锚定和档案线索，侵蚀不同对齐强度模型的拒绝边界。第二，深度：通过初始顺从后的迭代升级，我们探测有害生成的上限，使用由语言模型作为评判者评估的标量严重性越狱指标来衡量严重程度。我们发现对话中间轮次形成峰值脆弱窗口，其中有害性增加后趋于平稳，最终经历语义反转。我们在三个模型Gemini Nano Banana Pro、GPT Image 2和SD XL上评估PAST2HARM，在黑盒、无梯度设置下分别实现了83%、67%和100%的攻击成功率。对抗性提示也在模型间迁移，跨模型成功率超过50%。该攻击引发了多种有害输出，包括露骨色情内容、政治虚假信息、历史否认叙事、仇恨言论和自我伤害美化。我们进一步发布了一个精心策划的提示、改写和输出基准，作为红队测试和对齐的资源。我们的结果暴露了当前安全防护的根本脆弱性，并强调了加强多模态安全训练的必要性。

英文摘要

Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training.

URL PDF HTML ☆

赞 0 踩 0

2605.27531 2026-05-28 cs.PL cs.CL cs.SE 版本更新

Agentic Separation Logic Specification Synthesis

智能体分离逻辑规范合成

Tarun Suresh, David Korczynski, Julien Vanegue

发表机构 * Bloomberg（贝莱德）

AI总结提出 Spec-Agent 智能体系统，通过静态分析、运行时堆追踪和反例引导迭代，为大型 C++ 代码库合成分离逻辑规范，在百万行级代码上达到 85% 有效规范合成率且无假阳性。

Comments 9 pages, 3 appendices

详情

AI中文摘要

规范合成，即从程序实现和自然语言自动推断形式规范的任务，对于重构、转译、优化和验证非常重要，但对于大型 C++ 代码库仍然是一个开放的挑战。现有的基于 LLM 的方法无法同时扩展到这样的代码库，生成足够表达系统代码特性（如动态内存和堆分配数据结构）的规范，并系统地验证这些规范以排除不正确的候选。我们提出了 Spec-Agent，一个用于在大型 C++ 代码库中合成表达性强、经过充分验证的规范的智能体系统。Spec-Agent 针对一个规范语言阶梯：命题逻辑、一阶逻辑、命题分离逻辑和一阶分离逻辑。对于每个函数，Spec-Agent 使用静态分析和运行时堆追踪来选择适当的目标规范语言，将现有的功能测试泛化为模糊测试工具，并通过反例引导反馈迭代地优化 LLM 生成的候选。我们在包含数百万行代码的开源 C++ 代码库上评估了 Spec-Agent。Spec-Agent 为 85% 的目标函数合成了有效的规范，在模糊测试和专家验证下未观察到假阳性，性能优于 Claude Code Opus 4.6，同时 token 成本降低 10 倍。

英文摘要

Specification synthesis, the task of automatically inferring formal specifications from program implementations and natural language, is important for refactoring, transpilation, optimization, and verification, yet remains an open challenge for large C++ repositories. Existing LLM-based approaches fail to simultaneously scale to such repositories, produce specifications expressive enough to capture systems-code features such as dynamic memory and heap-allocated data structures, and systematically validate those specifications to rule out incorrect candidates. We present Spec-Agent, an agentic system for synthesizing expressive, well-validated specifications across large C++ codebases. Spec-Agent targets a ladder of specification languages: propositional logic, first-order logic, propositional separation logic, and first-order separation logic. For each function, Spec-Agent uses static analysis and runtime heap tracing to select the appropriate target specification language, generalizes existing functional tests into fuzz harnesses, and iteratively refines LLM-generated candidates via counterexample-guided feedback. We evaluate Spec-Agent on open source C++ codebases comprising millions of lines of code. Spec-Agent synthesizes valid specifications for 85% of target functions, with no false positives observed under fuzzing and expert validation, outperforming Claude Code Opus 4.6 at 10x lower token cost.

URL PDF HTML ☆

赞 0 踩 0

2605.27494 2026-05-28 cs.CR cs.AI cs.CL cs.IR cs.LG 版本更新

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

基于证据的缓存路由用于检索增强生成：何时可以安全地重用答案？

Syed Huma Shah

AI总结提出GroundedCache，一种通过四个廉价门控（查询相似性、检索证据重叠、源版本有效性和词汇支持）验证缓存答案安全性的路由方法，显著降低不安全服务率。

Comments 19 pages, 9 figures, 10 tables. Code: https://github.com/syedhumarahim/grounded-cache-router

详情

AI中文摘要

现代检索增强生成（RAG）部署越来越依赖缓存来降低令牌成本和首令牌时间（TTFT）。在vLLM等服务栈中，前缀级KV重用已成为标准，而最近的系统（RAGCache、TurboRAG、CacheBlend、EPIC、ContextPilot、PCR、LMCache）进一步推动了块级和位置无关的重用。相比之下，输出级语义答案缓存仍然脆弱：相似的提示可能映射到不同的正确答案，检索到的证据随着语料库更新而漂移，并且对抗性碰撞攻击已被证明可以劫持缓存的响应。我们认为，缓存答案重用的正确框架不是如何更快地重用，而是何时重用是安全的。我们提出了GroundedCache，一种经过证据验证的缓存路由器，仅当四个廉价门控同时成立时才允许缓存答案：查询相似性、检索证据重叠、源版本有效性以及新检索证据对缓存答案的词汇（或基于判断的）支持。我们构建了一个六区域工作负载，用于压力测试缓存安全性而不仅仅是命中率，并引入了一个面向操作员的指标——不安全服务率（USR），即收到错误缓存答案的查询比例。在两个数据集和12,000个真实LLM生成（在vLLM上使用自动前缀缓存的Qwen2.5-7B-Instruct）中，GroundedCache在每个HotpotQA区域上将USR降至0.0%（而朴素缓存为15-35%），在mtRAG文档漂移上降至1.5%（而朴素缓存为51.5%），在设计点对抗区域上减少了34倍，在其他mtRAG区域上减少了3-10倍，同时端到端p50延迟保持在无缓存RAG基线的1.04-1.07倍以内。逐门控消融实验表明，词汇支持门控是两个数据集上的主要安全机制，其余门控以近乎零成本提供纵深防御。我们发布了实现、工作负载和评估工具。

英文摘要

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

URL PDF HTML ☆

赞 0 踩 0

2605.27483 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Debate Helps Weak Judges Reward Stronger Models

辩论有助于弱裁判奖励更强的模型

Ethan Elasky, Frank Nakasako, Naman Goyal

发表机构 * Palaestra Research（帕莱斯特拉研究）； Berkeley（伯克利）

AI总结研究在强辩手/弱裁判设置下的提议者-批评者辩论，发现当批评者分类能力超过裁判且裁判将批评者言论视为待验证的主张时，辩论能显著提升裁判表现，并可通过单一独立批评以更低成本实现类似效果。

详情

AI中文摘要

尽管理论上具有前景，但辩论作为一种可扩展的监督协议产生了混合的实证结果：在某些设置中有收益，在其他设置中无效，尤其是当裁判没有隐藏信息时。我们在程序可验证的代码和逻辑任务上，研究了强辩手/弱裁判设置下的提议者-批评者辩论。当批评者提供可用的优势时，辩论帮助裁判优于咨询基线：批评者的分类能力必须超过裁判，并且裁判必须将批评者的言论视为待验证的主张而非待总结的证词。在五个配对中的三个满足该条件的配对中，提议者-批评者辩论的收益在统计上显著优于咨询，并且这些配对是最有能力的模型配对。在我们的集合中的两个非响应者配对中，辩论产生无效效果，一旦批评者进入转录，裁判验证率下降数十个百分点。在这些情况下，批评者的二元分类能力与裁判的相差在噪声范围内，并且批评者的分歧被解析为证词而非待检查的主张。从辩论中消去反驳轮次对裁判表现没有可测量的变化：单一独立批评以更低的推理成本恢复了辩论的大部分收益。这些发现为可验证领域（答案、批评、裁判）中无需训练的可扩展监督提供了一种更廉价的原始方法，以及一种预测辩论何时有帮助的部署前审计（批评者是否击败裁判，以及裁判是否会验证它？）。

英文摘要

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

URL PDF HTML ☆

赞 0 踩 0

2605.27402 2026-05-28 cs.CY cs.AI cs.CL 版本更新

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

REC-CBM：面向可信开放评分的基于规则感知的错误修正概念瓶颈模型

Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting, Ying-Chih Chen, Huan Liu

发表机构 * School of Computing and Augmented Intelligence, Arizona State University, USA（计算与增强智能学院，亚利桑那州立大学，美国）； Mary Lou Fulton Teachers College, Arizona State University, USA（玛丽·卢·福洛顿教师学院，亚利桑那州立大学，美国）； Department of Computer Science, National Yang Ming Chiao Tung University, TW（国立阳明交通大学计算机科学系，台湾）

AI总结提出REC-CBM模型，通过规则感知概念编码器、序数成对校准目标和潜在概念错误修正模块，解决开放评分中标准概念瓶颈模型无法建模细粒度规则维度、忽略评分序数语义和概念标注不可靠的问题，在提升评分性能的同时保持可解释性。

详情

AI中文摘要

开放评分对于公平和个性化教育至关重要，但人工评分耗时且成本高，凸显了自动化评分系统的必要性。尽管基于神经和大语言模型（LLM）的系统表现出优越性能，但它们通常是黑箱模型，其评分过程和理由难以让教育者验证和信任。概念瓶颈模型（CBM）通过将预测路由到人类可解释的概念，提供透明度的机制保证，成为一种有前景的方法。然而，标准CBM不适用于开放评分：它们没有显式建模细粒度的规则维度，未能充分捕捉评分量表的序数语义，并忽略了人类概念标注中固有的可靠性问题。为解决这些局限，我们提出REC-CBM，一种面向可信开放评分的规则感知错误修正概念瓶颈模型。REC-CBM引入了规则感知概念编码器，学习针对回答的概念特定表示，以及一个序数成对校准目标，保留规则维度间的排序结构。它还结合了一个潜在概念错误修正模块，在最终评分预测前对概念预测进行去噪，同时保持可解释性。在公开数据集上的全面实验表明，REC-CBM在评分性能上持续提升，并产生比最先进基线更忠实的概念级推理。进一步分析验证了每个组件的贡献，并展示了在真实教育环境中的适用性。总体而言，这项工作提供了一种实用、可解释的评分解决方案，使教育者能够检查、干预和信任自动化决策，推动更透明和可信的教育。

英文摘要

Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

URL PDF HTML ☆

赞 0 踩 0

2605.27393 2026-05-28 cs.CL cs.AI 版本更新

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI: 可控的多智能体治疗性对话生成

Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam（弗里堡大学阿姆斯特丹分校）； NII, Tokyo Institute of Technology（东京技术大学信息机构）

AI总结提出StoryMI框架，通过多LLM智能体协作、情境故事基础和动态策略控制，生成符合动机性访谈标准的治疗性对话，并构建评估协议和数据集验证其有效性。

Comments ACL2026

详情

AI中文摘要

弥合稳定性与表现力之间的差距：低资源口语语言模型的合成数据扩展与偏好对齐

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, Xiaoyu Shen

发表机构 * Beijing University of Posts（北京邮电大学）； University of California, USA（美国加州大学）； Northwestern University, USA（美国西北大学）； Eastern Institute of Technology, Ningbo, China（宁波工程技术学院）

AI总结针对低资源口语语言模型因合成数据导致的表现力崩溃问题，提出两种自对齐框架（DGSA和TDSC）以恢复韵律多样性，实现超越商业系统的性能并首次支持老挝语零样本语音克隆。

详情

AI中文摘要

口语语言模型（SLM）通过绕过显式的字素到音素流水线，已成为语音合成的一种有前景的范式。然而，它们在低资源语言中的有效性仍然受到转录语音稀缺的根本限制。在实践中，合成数据已成为在此类场景下扩展SLM的主要策略，当真实数据不足时提供可靠的音素监督。在这项工作中，我们表明这种依赖引入了一个基本权衡，我们称之为稳定性-表现力差距：虽然合成数据提高了音素准确性，但它逐渐抑制了韵律变异性，最终导致表现力崩溃（合成侵蚀）。为了弥合这一差距，我们提出了两种自对齐框架。解耦引导的自对齐（DGSA）通过利用韵律-音色分离来恢复复杂语言的表现力。对于真实参考极其有限的场景，温度驱动的自我批评（TDSC）通过自动探索和过滤来稳定生成。我们的方法优于强大的商业系统，包括ElevenLabs和Gemini Pro，并首次实现了老挝语的零样本语音克隆能力。

英文摘要

Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

URL PDF HTML ☆

赞 0 踩 0

2605.27380 2026-05-28 cs.CL cs.AI 版本更新

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX: 基于别名的检索与LLM排序的跨语言生物医学实体链接

Yi Wang, Corina Dima, Liangyu Zhong, Steffen Staab

发表机构 * University of Stuttgart, Germany（斯图加特大学）； Technical University of Berlin, Germany（柏林技术大学）

AI总结提出BioELX两阶段框架，通过维基数据多语言别名增强SapBERT检索器，并利用预训练LLM排序器进行上下文感知消歧，无需标注数据即在多个基准上取得最佳性能。

Comments 12 pages, 3 figures

详情

AI中文摘要

跨语言生物医学实体链接（BEL）将任何语言的提及映射到生物医学知识库（KB）中的唯一标识符，支持临床和生物医学NLP应用。然而，BEL的专家标注训练数据成本高昂，尤其是对于低资源语言。此外，许多跨语言BEL系统依赖于基于SapBERT的检索器，这些检索器主要在KB中的英语别名上训练，导致对未见过的非英语提及泛化能力差，且上下文感知消歧有限。我们提出BioELX，一个两阶段跨语言BEL框架，无需任务特定的标注训练语料。在第一阶段，我们用维基数据派生的多语言别名丰富SapBERT训练，并使用得到的检索器改进跨语言候选检索。在第二阶段，我们使用预训练LLM排序器进行上下文感知消歧，该排序器联合考虑提及上下文和候选，消除了监督训练的需要。在五个基准（XL-BEL、EMEA、Patent、WikiMed-DE和MedMentions）上的实验表明，BioELX实现了新的最先进性能。它在XL-BEL上将平均Recall@1提高了+19.2，尤其是低资源语言提升显著，例如土耳其语+21.6、韩语+22.1、泰语+30.8，并在EMEA（+6.2）、Patent（+5.4）和WikiMed-DE（+12.8）上持续改进。代码和资源将在发表后发布。

英文摘要

Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.27378 2026-05-28 cs.CL cs.CV cs.MA 版本更新

OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

OralAgent: 融合推理、工具与知识的交互式牙科影像分析

Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

发表机构 * Faculty of Dentistry, the University of Hongkong, Hong Kong SAR, China（香港大学牙科学院，中国香港特别行政区）； Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA（匹兹堡大学电气与计算机工程系，美国宾夕法尼亚州匹兹堡）； Shenzhen University, China（深圳大学，中国）； Department of Craniomaxillofacial Surgery, Shanghai Ninth People’s Hospital, China（上海第九人民医院口腔颌面外科部，中国）； Nanyang technological University, Singapore（南洋理工大学，新加坡）； School of Biomedical Engineering, Southern Medical University, China（南方医科大学生物医学工程学院，中国）； Singapore University of Technology and Design, Singapore（新加坡科技设计大学，新加坡）； University of Auckland, new zealand（奥克兰大学，新西兰）； Shanghai Artificial Intelligence Laboratory , China（上海人工智能实验室，中国）

AI总结提出首个牙科专用AI智能体OralAgent，通过集成22种视觉分析工具和368本经典牙科教科书，实现多模态推理、工具决策与知识检索的自动化框架，在多个基准上达到最优性能。

Comments 14 pages, 7 figures, 6 tables

详情

AI中文摘要

牙科影像分析在支持口腔医疗的准确诊断和治疗规划中起着关键作用。尽管近期进展产生了针对特定任务和单一成像模态的牙科AI模型，但其孤立的设计限制了在实际临床工作流程中的实用性。在本文中，我们提出了OralAgent，这是首个牙科专用AI智能体，它在端到端自动化框架内统一了多模态推理、基于工具的决策和基于知识的检索。它集成了22种视觉分析工具和368本广泛使用的经典牙科教科书，实现了自主推理、规划、工具使用、知识检索和多步骤工作流执行。此外，我们引入了OralCorpus，这是一个大规模、高质量的双语文本资源，包含1.348亿个标记，专为牙科检索增强生成（RAG）而构建。为了评估模型的多学科牙科知识，我们构建了OralQA-ZH，这是一个中文选择题基准，包含来自11个口腔亚专业的798个项目。大量实验表明，OralAgent在MMOral-Uni、MMOral-OPG和OralQA-ZH基准上达到了最先进的性能，突显了其在真实临床环境中的有效性、可解释性和适应性。代码和模型已在https://github.com/isjinghao/OralAgent公开。

英文摘要

Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

URL PDF HTML ☆

赞 0 踩 0

2605.27376 2026-05-28 cs.CL cs.AI 版本更新

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

解锁基于提示的文本转语音模型中的细粒度和句内说话风格控制

Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University, Korea（全州大学人工智能系）； Department of Computer Science and Engineering, Sungkyunkwan University, Korea（全州大学计算机科学与工程系）

AI总结针对基于提示的TTS模型缺乏细粒度控制和句内风格变化的问题，提出句间风格插值和句内风格过渡技术，通过嵌入空间方向向量插值和KV缓存交换及滑动窗口注意力掩码实现平滑风格控制。

详情

AI中文摘要

虽然基于提示的文本转语音（TTS）模型支持自然语言驱动的说话风格控制，但它们通常提供有限的细粒度控制，并在整个话语中应用单一的全局风格。这限制了需要跨话语连续风格属性插值和单个话语内时变风格过渡的实际用例。在本文中，我们提出了在现有基于提示的TTS模型中实现这两种能力的新技术。对于句间风格插值，我们计算嵌入空间中对比风格提示之间的方向向量并进行简单插值，从而实现风格特征之间的平滑过渡。对于句内风格过渡，我们首先识别出自回归TTS解码器中对早期标记的强烈注意力偏差，导致初始音频实现主导后续生成。为了减轻这种影响，我们引入了KV缓存交换和滑动窗口注意力掩码。实验表明，我们提出的句间插值在性别转换中实现了99-100%的成功率，高达36 Hz的音高变化，以及高达1.6音节/秒的速度变化。我们的句内过渡保持了0.81-0.91的说话人相似度，并获得了3.48-4.48的感知平滑度分数。

英文摘要

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

URL PDF HTML ☆

赞 0 踩 0

2605.27375 2026-05-28 cs.CL 版本更新

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

LCO：基于LLM的约束优化，用于现实任务中更安全的智能体LLM

Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, Hang Su

发表机构 * East China Normal University（东华大学）； Beijing Zhongguancun Academy（北京中关村学院）； Tsinghua University（清华大学）

AI总结提出LCO框架，通过自思考模块和进化采样模块约束LLM行为，在不微调模型的情况下减少上下文奖励黑客行为，实验表明在输出优化和策略优化场景中显著提升安全性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地充当自主智能体，但它们与环境的持续交互可能导致上下文奖励黑客行为（ICRH），即LLM迭代优化其行为以最大化代理目标，无意中产生有害副作用。现有防御方法不足以应对此风险，因为ICRH并非源于对抗性输入，而是模型自身的过度优化。为缓解此问题，我们提出基于LLM的约束优化（LCO），该框架无需模型微调即可有效减少ICRH。LCO包含两个模块：自思考模块，引导LLM在执行前主动思考并整合潜在安全约束；进化采样模块，利用基于LLM的交叉和变异将模型动作约束在安全解空间内，同时保持任务性能。实验结果表明，LCO在输出优化和策略优化场景中均显著缓解了ICRH。特别是在推文参与度优化任务中，LCO在GPT-4上使毒性增长率（TGR）降低了39%；在策略优化基准上，ICRH发生率降低了15.23%，在不牺牲任务性能的情况下提升了安全性。

英文摘要

Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization. To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

URL PDF HTML ☆

赞 0 踩 0

2605.27374 2026-05-28 cs.CL 版本更新

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

ICG: 通过基于MLLM的提示和个性化偏好对齐改进封面图像生成

Zhipeng Bian, Jieming Zhu, Qijiong Liu, Wang Lin, Guohao Cai, Zhaocheng Du, Jiacheng Sun, Zhou Zhao, Zhenhua Dong

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Hong Kong Polytechnic University（香港理工大学）； Zhejiang University（浙江大学）

AI总结提出ICG框架，利用多模态大语言模型和扩散模型，通过元标记提取语义特征、用户嵌入个性化对齐及多奖励学习策略，实现高质量、个性化封面图像生成。

Comments Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12268-12278, EMNLP 2025. Official version: https://doi.org/10.18653/v1/2025.emnlp-main.617

详情

DOI: 10.18653/v1/2025.emnlp-main.617
Journal ref: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (Main Track) EMNLP 2025 12268-12278

AI中文摘要

多模态大语言模型和扩散模型的最新进展为AI生成内容开辟了新的可能性。然而，个性化封面图像生成仍未被充分探索，尽管它在提升数字平台用户参与度方面起着关键作用。我们提出ICG，一个新颖的框架，将基于MLLM的提示与个性化偏好对齐相结合，以生成高质量、上下文相关的封面。ICG通过元标记从项目标题和参考图像中提取语义特征，使用用户嵌入进行细化，并将得到的个性化上下文注入扩散模型。为了解决缺乏标注监督的问题，我们采用了一种多奖励学习策略，该策略结合了公共美学和相关性奖励以及从用户行为训练的个性化偏好模型。与依赖手工提示和不连贯模块的先前流程不同，ICG采用适配器桥接MLLM和扩散模型进行端到端训练。实验表明，ICG显著提高了图像质量、语义保真度和个性化，从而在下游任务中增强了用户吸引力和离线推荐准确性。作为桥接MLLM和扩散模型的即插即用适配器，ICG兼容常见检查点，且在优化过程中不需要真实标签。

英文摘要

Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.27373 2026-05-28 cs.AI cs.CL cs.CY 版本更新

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

识别和理解文本中的人类价值观：一种可定制的基于LLM的架构

Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

发表机构 * Universidad Politécnica de Madrid（马德里理工大学）； CETINIA, Universidad Rey Juan Carlos（CETINIA，雷伊·胡安·卡洛斯大学）

AI总结提出一种基于大型语言模型的可定制架构，通过三个模块（规范生成、文本标注、强度评估）检测文本中人类价值观的强度，避免依赖特定价值理论或复杂提示工程，实验表明具有良好检测性能。

Comments 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

详情

DOI: 10.5220/0014273200004052
Journal ref: Proc. ICAART 2026, Vol. 5, SciTePress, 2026, pp. 4096-4103

AI中文摘要

随着智能系统变得更加自主，科学界专注于创建包含伦理和道德考量的决策机制，这与传统的效用最大化模型不同。为此，一个关键方面是评估这些决策与人类价值观的契合程度。基于此，一个有前景的研究方向是开发基于大型语言模型（LLM）的方法，从文本中识别显性或隐性的人类价值观，从而实现全程识别。本文介绍了一种基于LLM的架构，用于检测和量化文本中人类价值观的强度，避免了以往方法受限于特定价值理论或复杂提示工程的缺陷。该架构包含三个协调模块：一个从任何理论框架的基础文本中生成结构化价值规范；一个使用这些规范对文本进行标注；另一个基于修辞和语义证据分配分级支持或抵抗。这种模块化方法将概念化任务与检测人类价值观的任务分离，创建了一个可扩展且可重复的过程，由适应多种理论的价值规范驱动。该架构使用多个LLM实例化，并使用ValueEval数据集进行评估。实验表明具有良好的检测性能，证实了管道的通用性。

英文摘要

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.26959 2026-05-28 cs.LO cs.CL 版本更新

MerLean-Prover: A Recursive Looping Harness for Lean 4 Theorem Proving

MerLean-Prover：用于 Lean 4 定理证明的递归循环框架

Jinzheng Li, Zeru Zhu, Yuanjie Ren

发表机构 * Northeastern University（东北大学）； Stony Brook University（石溪大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出一种基于递归循环框架的端到端 Lean4 定理证明器 MerLean-Prover，通过规划、检查与证明三种智能体协作，无需微调或定制强化学习，在 FormalQualBench 和 Putnam2025 上超越现有开源基线。

详情

AI中文摘要

MerLean-Prover 是一个端到端的 Lean4 定理证明器，它用内核可检查的证明替换了 sorry 声明。它由三种智能体类型（规划、检查和证明）构建，通过一个递归外层循环组合，其修订单位是证明计划本身，并且不使用微调、自定义强化学习目标或特定定理的脚手架。在 FormalQualBench（一个包含 23 道博士资格考试定理的基准测试）上，MerLean-Prover 解决了 10/23，超过了最强的开源基线（OpenGauss，8/23）。在 Putnam2025 上，相同的框架以显著低于下一个最佳系统的总挂钟时间完成了 12/12。该框架也适用于较小的模型：Sonnet 解决了所有四个测试的 FormalQualBench 问题，Haiku 解决了两个简短的问题。这些结果表明，框架设计是端到端 Lean4 定理证明的核心因素，与原始模型能力并列，并且一个相对简单的框架已经可以很有效。

英文摘要

MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.

URL PDF HTML ☆

赞 0 踩 0

2605.26730 2026-05-28 cs.CL 版本更新

知识图谱驱动的神经科学专家级推理

Jake Stephen, Niraj K. Jha

发表机构 * Department of Electrical and Computer Engineering, Princeton University（普林斯顿大学电气与计算机工程系）

AI总结本文通过从单一教科书构建知识图谱并生成问答监督，微调语言模型，实现超越大语言模型的专家级神经科学推理。

详情

AI中文摘要

知识图谱（KG）是一种可以从文本语料库中提取并用于深度推理的抽象结构。先前的工作利用KG微调语言模型（LM），实现了特定领域的超智能。在这项工作中，我们探索仅使用单一权威教科书中的信息，KG驱动的深度推理能力是否能在神经科学中出现。核心假设是，结构化知识在被提炼为高质量KG并转换为基于KG的问答（QA）监督后，足以通过微调LM产生专家级推理，该LM在准确率上超越大型语言模型（LLM），同时参数数量少几个数量级。我们通过双LLM验证流水线构建教科书衍生的KG，使用在KG拓扑上训练的掩码LM扩展它，生成多跳QA项目（包括QA对和推理轨迹），以仅基于KG的监督微调LM，并应用强化学习，使用路径衍生的KG信号作为隐式奖励模型。我们的结果表明，深度、机械性的神经科学理解可以在模型中诱导，而无需依赖大型、异构的网络规模语料库。基于KG的神经科学合成课程（读者可以自我测试）以及微调后的LM可在以下GitHub位置获取：https://kg-bottom-up-superintelligence.github.io/neuro-bench。

英文摘要

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.

URL PDF HTML ☆

赞 0 踩 0

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE 版本更新

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素：用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

发表机构 * New York University（纽约大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder，探索人工智能在无引导发现中的开放性能力，并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异，同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情

AI中文摘要

我们正处于大规模工业和学术努力之中，旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上，这些过程在人类形式中的一个基本属性是它们的开放性：即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现？为了回答这个问题，我们转向Picbreeder，这是人类驱动的开放性搜索的典型范例，用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder，用前沿视觉语言模型（VLM）替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异，并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素，我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

URL PDF HTML ☆

赞 0 踩 0

2605.22705 2026-05-28 cs.CL 版本更新

iPOE: 基于解释的可解释提示优化

Jiahui Li, Yarik Menchaca Resendiz, Sean Papay, Roman Klinger

发表机构 * Fundamentals of Natural Language Processing, University of Bamberg, Germany（自然语言处理基础，巴姆堡大学，德国）； Leibniz-Institut für Psychologie (ZPID), Trier, Germany（莱比锡心理学研究所（ZPID），特里尔，德国）

AI总结提出iPOE方法，通过自动从解释中生成指南并优化，实现可解释的提示优化，在四个数据集上性能提升高达39%，且人类与LLM对指南贡献的判断一致性达Cohen's kappa 0.65。

详情

AI中文摘要

提示优化通常被构建为一个离散搜索问题，旨在为LLM找到高性能且鲁棒的指令。然而，搜索结果可能无法透明地显示为什么以及在哪里特定的提示更改带来了性能提升。这与人类接受注释任务指导的方式形成对比。在人类任务中，研究人员精心设计注释指南，从而提高注释一致性。本文旨在结合这两种方法，并引入iPOE，一种通过解释进行可解释提示优化的新策略。我们通过自动从注释决策的解释（自动生成或来自人类）中创建指南来指导提示优化过程。此外，通过一系列操作（包括删除、添加、打乱和合并）来优化这组指南。最终的提示包含指导注释的指南，使LLM的决策过程和优化过程透明化。因此，它也为提示优化领域的非专业人士提供支持，特别是在需要专业知识的挑战性领域。在四个数据集上的实验中，我们发现iPOE相比评估基线最高可提升39%，并且LLM的解释可以替代所提出方法中的人类解释。此外，我们的可解释性验证研究表明，人类和LLM在哪些指南有助于其注释方面可以基本达成一致，Cohen's kappa得分高达0.65。

英文摘要

Prompt optimization has often been framed as a discrete search problem to find high-performing and robust instructions for an LLM. However, the search result might not make it transparent why and where specific prompt changes lead to performance gains. This is in contrast to how humans are instructed for annotation tasks. Here, researchers carefully design annotation guidelines, leading to enhanced annotation consistency. Our paper aims at joining these two approaches and introduces iPOE, a novel interpretable prompt optimization strategy via explanations. We guide the prompt optimization process by automatically created guidelines from explanations of annotation decisions (either automatically generated or from humans). This set of guidelines is furthermore optimized by as series of operations, including removing, adding, shuffling, and merging. The resulting prompt includes guidelines that instruct the annotation, making the decision process of the LLM and the optimization transparent. It therefore supports also laypeople in the area of prompt optimization, particularly in challenging domains requiring expertise. In our experiments on four datasets, we find that iPOE can improves over the evaluated baselines by up to 39% and LLM explanations can replace human explanations in the proposed method. Moreover, our interpretability validation study demonstrates that humans and LLMs can substantially agree on which guidelines contribute to their annotations, achieving a Cohen's kappa score of up to 0.65.

URL PDF HTML ☆

赞 0 踩 0

2605.17448 2026-05-28 cs.GR cs.CL 版本更新

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

基于有限元分析反馈的自改进CAD生成智能体

Guijin Son, Jehyun Park, Seyeon Park, Sunghee Ahn, Youngjae Yu

发表机构 * Seoul National University（首尔国立大学）； OneLineAI ； Sungkyunkwan University（成均馆大学）； Ewha Womans University（成乙女子大学）

AI总结提出一种以有限元分析为反馈的CAD生成框架，通过蓝图和渲染图监督信号提升多部件装配质量，使生成结果满足工程需求。

Comments Work in progress

详情

AI中文摘要

计算机辅助设计（CAD）是现代工业设计的基石，然而现有的CAD生成器仍无法满足实际工程流程：它们既不像工程师那样迭代，也不评估工程所需。先前的工作将CAD生成视为两个独立的步骤——零件合成和装配，前者通过接近参考标准来评分，而后者（如果处理的话）被简化为一个单独的约束求解步骤。在这项工作中，我们引入了一种更贴近工业的任务形式，要求模型根据自由形式的工程简报生成完全装配的多部件STEP文件，然后通过有限元分析（FEA）进行验证。FEA验证显示，Codex (GPT-5.5) 和 Claude Code (Opus-4.7) 智能体在主要的首次尝试扫描中没有产生任何严格通过的工件，最佳配置平均仅满足约20%的类型化要求。此外，我们引入了两种额外的监督信号：一种新颖的纯文本蓝图模式和一种21视角图像渲染器，以辅助智能体的视觉检查，使生成循环更符合工程师实际迭代的方式。在S2O和Fusion360上，相同的反馈工具改善了几何重建，GPT-5.5/xhigh在S2O上的Box-IoU从0.444提升到0.592，在Fusion360上从0.397提升到0.505。这些信号共同将CAD程序推向不仅视觉上合理，而且经过物理和结构要求检查的工件。

英文摘要

Computer-aided design (CAD) is the backbone of modern industrial design, yet learned CAD generators still fall short of real engineering pipelines: they neither iterate like engineers nor evaluate what engineering requires. Prior work has treated CAD generation as two disjoint steps, part synthesis and assembly, where the former is graded by proximity to a gold reference and the latter, when handled at all, is reduced to a separate constraint solving step. In this work, we introduce a more industry-native task formulation that requires a model to produce a fully assembled multi-part STEP file from a free-form engineering brief, which is then validated via finite element analysis (FEA). FEA validation reveals that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents do not produce a single strict-passing artifact in the main first-attempt sweep, with the best configuration meeting only about 20% of typed requirements on average. Moreover, we introduce two additional supervision signals, a novel text-only blueprint schema and a 21-view image renderer that aids the agent's visual inspection, that better align the generation loop with how engineers iterate in practice. On S2O and Fusion360, the same feedback tools improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360. Together these signals move CAD programs toward artifacts that are not only visually plausible but also checked against physical and structural requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.15864 2026-05-28 cs.CV cs.CL 版本更新

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

VLMs 是在看还是只是在说？揭示视觉重新检查的幻觉

Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

发表机构 * University of Southern California（南加州大学）； University of California San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结通过图像交换探测框架 VisualSwap 和 800 对图像基准 VS-Bench，发现视觉语言模型在推理时声称的“重新检查图像”多为文本模式，而非真正的视觉重新检查，且思考模型更易受影响，用户指令可恢复视觉基础但自我反思无效。

Comments ICML 2026 Oral

详情

AI中文摘要

视觉语言模型（VLM）在推理过程中经常产生自我反思的语句，如“让我再检查一下图片”。这样的语句是否触发了真正的视觉重新检查，还是仅仅是习得的文本模式？我们通过 VisualSwap（一种图像交换探测框架）对此进行研究：在模型对一张图像进行推理后，我们将其替换为视觉上相似但语义不同的图像，并测试模型是否注意到这一变化。我们引入了 VS-Bench，包含从 MathVista、MathVerse、MathVision 和 MMMU-Pro 中精选的 800 对图像。在 Qwen3-VL、Kimi-VL 和 ERNIE-VL 上的实验揭示了一个惊人的失败：模型绝大多数情况下忽略了图像交换，准确率下降高达 60%。与直觉相反，思考模型比其指令对应模型脆弱近 3 倍，且扩展规模无法缓解。多轮用户指令可以恢复视觉基础，但连续生成过程中自我生成的反思语句则不能。注意力分析解释了原因：用户指令显著提高了对视觉标记的注意力，而自我反思则没有。当前的 VLM 在声称执行视觉重新检查时倾向于“说”而非真正“看”。我们的代码和数据集可在项目页面获取：https://visualswap.github.io

英文摘要

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

URL PDF HTML ☆

赞 0 踩 0

2601.16312 2026-05-28 cs.CL cs.AI 版本更新

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

教授和评估LLMs推理聚合物设计相关任务

Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian

发表机构 * Stony Brook University（石溪大学）

AI总结本文提出PolyBench基准数据集和知识增强推理蒸馏方法，使中小型语言模型在聚合物设计任务上性能接近前沿闭源LLM。

详情

AI中文摘要

AI4Science研究在许多科学应用中显示出前景，包括聚合物设计。然而，当前的LLMs在此问题空间中效果不佳，因为：(i)大多数模型缺乏聚合物特定知识，(ii)现有对齐模型对聚合物设计相关知识和能力的覆盖有限。为解决此问题，我们引入了PolyBench，一个包含超过125K聚合物设计相关任务的大规模训练和测试基准数据集，利用从实验和合成数据源获得的超过1300万数据点的知识库，以确保聚合物及其属性的广泛覆盖。为了使用PolyBench进行有效对齐，我们引入了一种知识增强推理蒸馏方法，用结构化CoT增强该数据集。此外，PolyBench中的任务从简单到复杂的分析推理问题组织，使得能够进行泛化测试和问题空间中的诊断探测。实验表明，在PolyBench上训练的具有7B到32B参数的中小型语言模型(SLMs)在PolyBench测试数据集上优于类似大小的模型，并与闭源前沿LLMs保持竞争力，同时在外部聚合物基准上展示了性能提升。数据集和相关代码可在https://github.com/StonyBrookNLP/PolyBench获取。

英文摘要

Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small- and mid- sized language models (SLMs) with 7B to 32BB parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.

URL PDF HTML ☆

赞 0 踩 0

2604.04295 2026-05-28 cs.CL 版本更新

Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Generation

面向可靠专利权利要求生成的适应性成本高效评估

Yongmin Yoo, Qiongkai Xu, Longbing Cao

发表机构 * Frontier AI Research Centre, Macquarie University School of Computing, FSE, Macquarie University（前沿人工智能研究中心，麦考瑞大学计算机学院，FSE，麦考瑞大学）

AI总结提出两阶段框架ACE，利用专利错误类别结构进行不确定性感知路由，第一阶段编码器预测错误类型熵，超过阈值则交由第二阶段专家LLM执行模式约束的专利思维链协议，在降低78%成本的同时超越70B参数LLM基线。

详情

AI中文摘要

自动化专利权利要求验证要求低容错率。然而，现有方法面临僵化-资源困境：轻量级编码器无法追踪长程法律依赖，而穷举式LLM验证在百万权利要求规模下会产生4-5倍的开销。基于置信度的简单级联无法解决这一问题，因为二元有效性分数无法区分需要不同推理深度的结构上不同的错误类型。我们提出一个两阶段框架：适应性成本高效评估（ACE），它利用专利错误的类别结构进行不确定性感知路由。在第一阶段，微调后的编码器将权利要求投影到法律错误类型上的K+1分布，其预测熵作为路由信号。超过熵阈值的权利要求被升级到第二阶段，由专家LLM执行模式约束的专利思维链（CoPT）协议，将权利要求元素映射到35 U.S.C.标准，其模式约束将每个权利要求的延迟降低42%，同时产生法律依据充分的裁决。我们进一步提出了一个包含40,000个权利要求的数据集ACE-40k，带有MPEP注释，其中ACE超越了包括监督式70B参数LLM在内的竞争基线，同时将成本降低78%。在真实的USPTO驳回数据上，路由机制无需重新校准即可迁移，推理时间减少60%，同时保持竞争性的召回率。

英文摘要

Automated patent claim validation demands low error tolerance. However, existing approaches face a rigidity-resource dilemma: lightweight encoders cannot track long-range legal dependencies, while exhaustive LLM verification incurs 4-5X higher overhead at million-claim scale. A naive confidence-based cascade cannot resolve this because binary validity scores fail to distinguish structurally distinct error types which require different reasoning depths. We propose a two-stage framework: Adaptive Cost-efficient Evaluation (ACE), which exploits the categorical structure of patent errors for uncertainty-aware routing. In the first stage, a fine-tuned encoder projects claims into a K+1 distribution over legal error types, whose predictive entropy serves as the routing signal. Claims exceeding an entropy threshold are escalated to the second stage, where an expert LLM executes a schema-constrained Chain-of-Patent-Thought (CoPT) protocol to map claim elements against 35 U.S.C. standards whose schema constraint reduces per-claim latency by 42% while producing legally grounded verdicts. We further present a 40,000-claim dataset ACE-40k with MPEP-grounded annotations, where ACE surpasses competitive baselines including a supervised 70B-parameter LLM while reducing costs by 78%. On real USPTO rejection data, the routing mechanism transfers without re-calibration, reducing inference time by 60% while maintaining competitive recall.

URL PDF HTML ☆

赞 0 踩 0

2605.12515 2026-05-28 cs.CL 版本更新

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

通过共识驱动的偏好优化缓解多语言大模型中的跨语言文化不一致性

Lucas Resck, Isabelle Augenstein, Anna Korhonen

发表机构 * Language Technology Lab, University of Cambridge（剑桥大学语言技术实验室）； University of Copenhagen（哥本哈根大学）

AI总结提出C-3PO框架，通过共识驱动的偏好优化，缓解多语言大模型在用户身份明确时因提示语言变化导致的跨语言文化不一致问题，显著提升一致性指标κ_S。

Comments 24 pages, 13 figures, 11 tables

详情

AI中文摘要

尽管多语言大模型（MLLMs）能力令人印象深刻，但当提示语言改变时，它们经常表现出不一致的行为。虽然这种适应通常是可取的，但当用户身份被明确定义时，它就会成为一个关键失败。例如，给定一个固定的英国人角色和一个关于文学的模糊日常知识查询，提示语言经常覆盖系统角色——英语输出莎士比亚，西班牙语输出塞万提斯。为了稳健地量化这种跨语言文化不一致性，我们引入了Singleton Fleiss的κ_S，一个在数学上对幻觉具有鲁棒性的度量。为了缓解这一问题，我们提出了跨语言文化一致的偏好优化（C-3PO），一种共识驱动的对齐框架。C-3PO在κ_S上实现了比未对齐模型高达0.13个绝对点的提升，持续优于强提示和表示引导基线，同时保留了明确的用户身份、文化中立性和内在文化知识。实证评估表明，这种不一致性对印尼语和波斯语等低资源语言影响尤为严重。最后，中间层的早期解码揭示了MLLMs在正向传播表示稳定时，会隐式地将输出个性化到提示语言的刻板文化。

英文摘要

Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $κ_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.13-point absolute increase in $κ_S$ over unaligned models, consistently outperforming strong prompting and representation steering baselines whilst preserving explicit user identities, cultural neutrality and intrinsic cultural knowledge. Empirical evaluations demonstrate this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. Finally, early decoding of intermediate layers reveals that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.

URL PDF HTML ☆

赞 0 踩 0

2605.12015 2026-05-28 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

SkillSafetyBench：在技能面攻击表面下评估智能体安全性

Chang Jin, An Wang, Zeming Wei, Kai Wang, Biaojie Zeng, Qiaosheng Zhang, Chao Yang, Jingjing Qu, Xia Hu, Xingcheng Xu

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Peking University（北京大学）； East China Normal University（华东师范大学）

AI总结提出SkillSafetyBench基准，通过155个对抗案例评估大语言模型智能体在技能、本地工件和执行环境文件等非用户攻击下的安全失败模式。

详情

AI中文摘要

可复用技能正成为扩展大语言模型智能体的常见接口，它将程序性指导与对文件、工具、内存和执行环境的访问打包在一起。然而，这种模块化引入了现有安全评估大多忽略的攻击面：即使用户请求是良性的，不安全的影响可能存在于技能指导、本地工件或执行环境文件中，这些会引导智能体采取不安全行为。我们提出了SkillSafetyBench，一个可运行的基准，用于评估此类技能中介的安全失败。SkillSafetyBench包含跨47个任务、6个风险领域和30个安全类别的155个对抗案例，每个案例都使用特定于案例的基于规则的验证器进行评估。使用多个CLI智能体和模型后端的实验表明，非用户攻击可以一致地诱导不安全行为，在不同领域、攻击方法和脚手架-模型配对中表现出不同的失败模式。我们的发现表明，智能体安全性不仅取决于模型级别的对齐，还取决于智能体如何解释技能、信任工作流上下文以及通过可执行环境采取行动。

英文摘要

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, unsafe influence may reside in skill guidance, local artifacts, or execution-environment files that steer the agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

URL PDF HTML ☆

赞 0 踩 0

2605.10073 2026-05-28 cs.CL 版本更新

Heterogeneous Dependency Graph-Guided Attentionfor Patent Representation Learning

异构依赖图引导的专利表示学习注意力机制

Yongmin Yoo, Qiongkai Xu, Zhangkai Wu, Longbing Cao

发表机构 * Frontier AI Research Centre, Macquarie University School of Computing, FSE, Macquarie University（前沿人工智能研究中心，麦考瑞大学计算机学院，FSE，麦考瑞大学）

AI总结针对专利权利要求间的依赖层次被忽略的问题，提出专利异构注意力图编码器（PHAGE），通过构建类型图区分法律引用与技术关系，并引入可学习偏置的连通性掩码将权利要求级拓扑投射到令牌级注意力，结合双粒度对比学习，在分类、检索和聚类任务上超越领域自适应和引用感知基线。

详情

AI中文摘要

预训练语言模型通过将权利要求编码为扁平令牌序列来推进专利分类和检索，但忽略了权利要求之间的依赖层次。将层次结构融入自注意力面临两个挑战。首先，权利要求依赖涉及不同可靠性的关系类型：不加区分地对待它们会使有噪声的技术关系污染更清洁的法律引用信号。其次，当依赖图在权利要求级别定义时，Transformer模型会失败，因为它们在令牌级别操作；广播权利要求级别的邻接可能会稀释跨无关令牌对的结构信息。一种新颖的专利异构注意力图编码器（PHAGE）解决了这些挑战。为了处理异构依赖，PHAGE构建了一个类型图，将法律引用与技术关系区分为不同的边类型。为了弥合层次差距，PHAGE引入了一个带有可学习关系感知偏置的连通性掩码，将权利要求级别的拓扑投射到令牌级别的注意力中。PHAGE学习一个双粒度对比目标，以将表示与专利间分类法和专利内拓扑对齐。实验表明，PHAGE在专利分类、检索和聚类上优于领域自适应和引用感知基线。PHAGE揭示，专利内权利要求拓扑比专利间结构捕获了更强的归纳偏置。

英文摘要

Pre-trained language models advance patent classification and retrieval via encoding claims as flat token sequences, yet overlooking the dependency hierarchy among claims. Incorporating the hierarchy into self-attention poses two challenges. First, claim dependencies involve relation types with varying reliability: treating them indiscriminately allows noisy technical relations to corrupt cleaner legal citation signals. Second, when the dependency graph is defined over claims, Transformer models fail as they operate at the token level; broadcasting claim-level adjacency can dilute structural information across unrelated token pairs. A novel Patent Heterogeneous Attention Graph Encoder (PHAGE) addresses these challenges. To handle heterogeneous dependencies, PHAGE constructs a typed graph to separate legal citations from technical relations as distinct edge types. To bridge the hierarchy gap, PHAGE introduces a connectivity mask with learnable relation-aware biases to project a claim-level topology into token-level attention. PHAGE learns a dual-granularity contrastive objective to align representations with inter-patent taxonomy and intra-patent topology. Experiments show that PHAGE outperforms domain-adapted and citation-aware baselines on patent classification, retrieval, and clustering. PHAGE discloses that the intra-patent claim topology captures stronger inductive bias than the inter-patent structure.

URL PDF HTML ☆

赞 0 踩 0

2605.09986 2026-05-28 stat.ML cs.CL cs.LG 版本更新

RouteProfile：基于图的冷启动LLM路由画像方法

Jingjun Xu, Hongji Pu, Tao Feng, Haozhen Zhang, Jiaxuan You, Ge Liu

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Nanyang Technological University（南洋理工大学）

AI总结针对冷启动LLM路由中新模型缺乏交互数据的问题，提出基于图结构的RouteProfile框架，利用技术报告中的公开信号构建模型画像，实验表明结构化画像优于扁平基线，且模型家族元数据比基准域信息更可靠。

详情

AI中文摘要

LLM路由在多样化用户需求和部署约束下选择合适模型日益重要，但其实际效果取决于对新兴查询和新发布模型的持续适应。新LLM集成尤其具有挑战性，因为新发布模型缺乏路由训练所需的查询-响应-奖励交互，且无法像新查询那样通过语义嵌入直接画像。现有画像存在局限：LLM生成的描述往往粗糙，而基于交互的嵌入构建成本高昂。为解决此问题，我们提出RouteProfile，一种基于图的画像框架，从技术报告或模型卡中的公开信号（包括模型家族、模型描述、报告基准分数和基准域）构建LLM画像。RouteProfile将这些异构信号组织成图，并从组织形式、表示类型、聚合深度和学习配置四个维度研究画像构建。我们在无训练冷启动路由和新LLM集成设置中评估RouteProfile。实验表明：(1) 结构化画像在无训练冷启动路由中优于扁平基线；(2) 模型家族元数据比基准域信息更可靠；(3) 有效的新LLM集成需要画像-路由协同设计。总体而言，我们的发现强调了画像设计对于使路由系统适应不断发展的模型生态系统的重要性。

英文摘要

LLM routing is increasingly important for selecting suitable models under diverse user needs and deployment constraints, but its practical effectiveness depends on continual adaptation to emerging queries and newly released models. New-LLM integration is particularly challenging, as newly released models lack the query-response-reward interactions required for router training and cannot be profiled as directly as new queries via semantic embeddings. Existing profiles are limited: LLM-generated descriptions are often coarse, while interaction-based embeddings are costly to construct. To address this problem, we propose RouteProfile, a graph-based profiling framework that constructs LLM profiles from public signals in technical reports or model cards, including model family, model description, reported benchmark scores, and benchmark domains. RouteProfile organizes these heterogeneous signals into a graph and studies profile construction along four dimensions: organizational form, representation type, aggregation depth, and learning configuration. We evaluate RouteProfile in training-free cold-start routing and new-LLM integration settings. Experiments show that: (1) structured profiles outperform flat baselines in training-free cold-start routing; (2) model family metadata is more reliable than benchmark domain information; and (3) effective new-LLM integration requires profile-router co-design. Overall, our findings highlight the importance of profile design for enabling routing systems to adapt to the evolving model ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2605.00025 2026-05-28 q-bio.NC cs.CL cs.HC cs.LG eess.AS 版本更新

MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis

MoDAl: 基于去相关的自监督神经模态发现用于语音神经假体

Yuanhao Chen, Peter Chin

发表机构 * Dartmouth College（达特茅斯学院）

AI总结提出MoDAl框架，通过对比学习和对齐损失与去相关损失之间的协同作用，从多脑区发现互补神经模态，在Brain-to-Text Benchmark '24上将词错误率从26.3%降至21.6%。

详情

AI中文摘要

语音神经假体系统在无听觉输出的情况下从神经活动解码预期语音，为言语障碍患者恢复交流提供了途径。当前方法主要从运动皮层区域解码，忽略了其他区域——如布罗卡区的一部分44区——这些区域可能编码互补的语言信息。我们提出了MoDAl（模态去相关与对齐）框架，该框架通过在共享投影空间中两个目标的相互作用来发现互补的神经模态。对比损失将多个并行脑编码器中的每一个与预训练大语言模型（LLM）的文本嵌入对齐，而去相关损失防止编码器合并成重复表示。我们证明这些目标之间存在富有成效的张力：对比对齐诱导传递性模态合并，而去相关必须抵消这一点，以使框架发现多样的神经语言学模态。在Brain-to-Text Benchmark '24上，与之前最佳端到端方法相比，MoDAl将词错误率（WER）从26.3%降低到21.6%，其中纳入先前丢弃的44区信号的增益完全来自去相关机制。对发现模态的分析揭示了功能特化：接收44区输入的编码器捕获结构和句法属性（句子长度、语法语态、wh-词），这与布罗卡区的神经语言学理解一致。

英文摘要

Speech neuroprosthesis systems decode intended speech from neural activity in the absence of audible output, offering a path to restoring communication for individuals with speech-impairing conditions. Current approaches decode predominantly from motor cortical areas, discarding others -- such as area 44, part of Broca's area -- that may encode complementary linguistic information. We introduce MoDAl (Modality Decorrelation and Alignment), a framework that discovers complementary neural modalities through the interplay of two objectives in a shared projection space. A contrastive loss aligns each of several parallel brain encoders with the text embeddings of a pretrained large language model (LLM), while a decorrelation loss prevents the encoders from coalescing to duplicative representations. We prove that these objectives are in productive tension: Contrastive alignment induces transitive modality coalescence, which decorrelation must counteract for the framework to discover diverse neurolinguistic modalities. On the Brain-to-Text Benchmark '24, MoDAl reduces word error rate (WER) from 26.3% to 21.6% compared to the previous best end-to-end method, with the gain from incorporating previously discarded area 44 signals arising entirely from the decorrelation mechanism. Analysis of the discovered modalities reveals functional specialization: Encoders receiving area 44 input capture structural and syntactic properties (sentence length, grammatical voice, wh-words), consistent with the neurolinguistic understanding of Broca's area.

URL PDF HTML ☆

赞 0 踩 0

2604.27251 2026-05-28 cs.CL cs.AI 版本更新

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

服从与感知：大型语言模型中的推理可控性研究

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras

发表机构 * School of Computer Science, University of Sheffield（谢菲尔德大学计算机科学学院）； School of EECS, Queen Mary University of London（伦敦女王学院电子工程与计算机科学学院）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结通过推理冲突视角，系统研究大型语言模型在诱导逻辑模式与任务预期模式冲突时，是否优先服从指令还是遵循感知合理性，并探索内部检测与激活级干预方法。

详情

AI中文摘要

大型语言模型（LLMs）已知通过预训练数据中的共享推理模式获得推理能力，并通过思维链（CoT）实践进一步激发。然而，基本推理模式（如归纳、演绎和溯因）能否与具体问题实例解耦，仍然是模型可控性的关键挑战，并有助于阐明推理可控性。在本文中，我们首次通过推理冲突的视角系统研究这一问题：推理冲突是指通过强制使用偏离目标任务预期逻辑模式而引发的参数信息与上下文信息之间的显性张力。我们的评估表明，LLMs 始终优先考虑感知合理性而非服从性，尽管存在冲突指令，仍倾向于采用任务合适的推理模式。我们进一步证明推理冲突在内部是可检测的，因为在冲突期间置信度分数显著下降。探测实验确认推理类型从中间层到后期层线性编码，表明存在激活级可控性的潜力。利用这些见解，我们引导模型朝向服从性，将指令遵循度提高多达 29%。总体而言，我们的发现表明，虽然 LLM 推理锚定于具体实例，但主动的机制性干预可以有效地将逻辑模式与数据解耦，为改进可控性、忠实性和泛化性提供了一条路径。

英文摘要

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

URL PDF HTML ☆

赞 0 踩 0

2603.09117 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

解耦推理与置信度：在可验证奖励的强化学习中恢复校准

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences, Beijing, China（中国科学院大学）； Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学网络安全学院）； National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China（中国国家计算机网络应急技术配合中心）

AI总结针对RLVR中模型校准退化问题，提出DCPO框架通过解耦推理与校准目标，在保持准确率的同时显著改善校准性能并缓解过度自信。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

2604.21534 2026-05-28 cs.CL 版本更新

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

UKP_Psycontrol 在 SemEval-2026 任务 2：从文本建模效价和唤醒动态

Darya Hryhoryeva, Amaia Zurinaga, Hamidreza Jamalabadi, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)（无所不在的知识处理实验室）； Technical University of Darmstadt（达姆斯塔特技术大学）； National Research Center for Applied Cybersecurity ATHENE（应用网络安全国家研究中心ATHENE）； Psychiatric Control Systems Lab（精神病控制系统实验室）； Marburg University（马尔堡大学）

AI总结针对 SemEval-2026 任务 2，提出三种互补方法（LLM 提示、成对最大熵模型、轻量级神经回归模型）建模文本中的即时情感和短期情感变化，发现 LLM 擅长捕捉静态情感信号，而短期变化更依赖于数值轨迹，系统在子任务 1 和 2A 中排名第一。

Comments Accepted to SemEval 2026 (co-located with ACL 2026)

2604.20996 2026-05-28 cs.CL 版本更新

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

AFRILANGTUTOR：利用大语言模型推进低资源语言的语言辅导与文化教育

Tadesse Destaw Belay, Shahriar Kabir Nahin, Israel Abebe Azime, Ocean Monjur, Marek Rei, Chris Biemann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam, Anshuman Chhabra

发表机构 * Instituto Politécnico Nacional（墨西哥政治技术学院）； University of South Florida（佛罗里达州立大学）； Saarland University（萨尔兰大学）； Imperial College London（伦敦帝国理工学院）； University of Hamburg（汉堡大学）

AI总结针对低资源语言缺乏训练数据的问题，提出AFRILANGDICT词典资源并构建AFRILANGEDU数据集，通过监督微调和直接偏好优化训练AFRILANGTUTOR模型，在10种非洲语言上显著提升辅导性能。

详情

AI中文摘要

如何为缺乏足够训练资源的语言开发语言学习系统？这一挑战日益被非洲大陆的开发者所面临，他们旨在构建能够理解并用当地语言回应的AI系统。为弥补这一差距，我们引入AFRILANGDICT，一个包含19.47万条非洲语言-英语词典条目的集合，作为生成语言学习材料的种子资源，使我们能够自动构建大规模、多样且可验证的学生-导师问答交互，适用于训练AI辅助语言导师。利用AFRILANGDICT，我们构建了AFRILANGEDU，一个包含7.89万个多轮训练示例的数据集，用于监督微调（SFT）和直接偏好优化（DPO）。使用AFRILANGEDU，我们训练了统称为AFRILANGTUTOR的语言辅导模型。我们在AFRILANGEDU上对两个多语言LLM：Llama-3-8B-IT和Gemma-3-12B-IT进行了微调，覆盖10种非洲语言，并评估了它们的性能。结果表明，在AFRILANGEDU上训练的模型始终优于其基础版本，且结合SFT和DPO带来了显著改进，在LLM作为评判者的评估中，四项指标的提升范围从1.8%到15.5%。为促进低资源语言的进一步研究，所有资源均可在https://huggingface.co/afrilang-edu获取。

英文摘要

How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages, all resources are available at https://huggingface.co/afrilang-edu.

URL PDF HTML ☆

赞 0 踩 0

2604.13583 2026-05-28 cs.CL cs.AI 版本更新

BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

BenGER平台：面向德国法律任务端到端基准测试的协作式Web平台

Sebastian Nagl, Matthias Grabmair

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结提出BenGER开源Web平台，集成任务创建、协作标注、可配置LLM运行及多维度评估，支持多组织项目与租户隔离，实现法律推理基准测试的端到端透明与可复现。

Comments Preprint - Accepted at ICAIL 2026

2604.18758 2026-05-28 cs.CL 版本更新

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

句法作为罗塞塔石碑：用于上下文科普特语翻译的通用依存关系

Abhishek Purushothama, Emma Thronson, Alexia Guo, Amir Zeldes

发表机构 * Corpling Lab（科林实验室）； Georgetown University（乔治城大学）

AI总结提出一种结合通用依存句法分析和双语词典的上下文学习方法，用于低资源科普特语到英语的机器翻译，取得了新的最佳结果。

Comments ACL 2026 Findings camera-ready, with fixes

详情

AI中文摘要

低资源机器翻译需要不同于高资源语言的方法。本文提出了一种新颖的上下文学习方法，通过输入句子的通用依存句法分析来增强句法信息，以支持科普特语到英语的低资源机器翻译。在已有使用双语词典支持词汇项推理的工作基础上，我们在输入中添加了多种句法分析表示，具体探索了包含原始解析器输出、用简单英语表达的解析结果，以及针对子树中识别出的困难结构的定向指令及其翻译方法。结果表明，虽然单独的句法信息不如基于词典的注释有用，但将检索到的词典项与句法信息相结合，在不同模型规模上均取得了显著提升，为科普特语翻译实现了新的最佳结果。

英文摘要

Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.

URL PDF HTML ☆

赞 0 踩 0

2604.18235 2026-05-28 cs.CL cs.AI 版本更新

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

负优势是一把双刃剑：为搜索智能体校准GRPO中的优势

Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University（东华师范大学数据科学与工程学院）； Tencent（腾讯）； Tsinghua University（清华大学）

AI总结针对GRPO算法在多跳搜索中因粗粒度优势分配和正负优势不平衡导致的训练不稳定问题，提出CalibAdv方法，通过细粒度降低过度负优势并重新平衡正负优势，提升模型性能和训练稳定性。

详情

AI中文摘要

搜索智能体通过与搜索引擎的多轮交互实现强大的问答性能，其中组相对策略优化（GRPO）是一种广泛使用的训练算法。然而，GRPO风格的算法在多跳搜索场景中仍面临若干挑战。首先，当最终答案错误时，正确的中间步骤常常受到惩罚。其次，训练高度不稳定，经常导致自然语言能力退化甚至灾难性训练崩溃。我们的分析将这些问题归因于粗粒度的优势分配以及正负优势之间的不平衡。为了解决这些问题，我们提出了CalibAdv，一种专门为搜索智能体设计的优势校准方法，能够更准确、更稳定地对惩罚和奖励进行建模。具体来说，CalibAdv利用中间步骤的正确性在细粒度上降低过度的负优势，然后进一步重新平衡正负优势以提高训练稳定性。重要的是，CalibAdv采用轻量级设计，从标准 rollout 信号中校准优势，使其简单且易于部署。在三个模型和七个基准上的大量实验表明，CalibAdv同时提升了模型性能和训练稳定性。我们的代码可在 https://github.com/wujwyi/CalibAdv 获取。

英文摘要

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO-style algorithms still face several challenges in multi-hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

URL PDF HTML ☆

赞 0 踩 0

2604.17943 2026-05-28 cs.CL 版本更新

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

专业领域基准构建与评估框架：以国防相关文档为例

Bao Gia Doan, Aditya Joshi, Pantelis Elinas, Aarya Bodhankar, Oscar Leslie, Tom Marchant, Flora Salim

发表机构 * UNSW Sydney（新南威尔士大学悉尼分校）； Cyndr AI

AI总结提出DoRA框架，通过合成数据生成和双LLM流水线解决专业领域RAG问答的冷启动问题，在国防文档上显著减少幻觉并提升覆盖率和忠实度。

详情

AI中文摘要

基于RAG的专业领域问答面临冷启动问题：缺乏评估基准和用于后训练的标注数据。我们提出DoRA（面向领域的RAG评估），一个仅使用少量专业领域文档的新型基准构建与评估框架。DoRA系统地生成合成QA训练和评估数据集，并跨五个领域特定意图提供可审计的证据。为缓解同流水线循环，DoRA的训练和测试拆分使用不同的LLM家族（训练用Claude Sonnet；测试用GPT-4o），这些数据来自不相交的种子文档语料库。在40份国防相关文档（英文）上实例化后，DoRA产生约6600个精心整理的实例。与8个LLM基线在1259个样本的基准上比较，基于合成训练集微调的LoRA适配Llama3.1-8B在6个覆盖率和忠实度指标上持续提升性能，尤其在默认GTE检索设置下将幻觉减少一半以上，且增益在替代检索器和基于提示的基线下依然保持。国防领域专业知识在评估的三个阶段被纳入：(a) 判断DoRA生成的合成QA质量，(b) 确定LLM作为评判者的分数可靠性，(c) 评估QA流水线在完全人工编写的QA示例上的泛化能力。我们将DoRA定位为领域迁移下专业领域RAG的实用框架，并以国防作为高风险的案例研究。

英文摘要

RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.1-8B trained on the synthetic training set consistently improves performance over 6 coverage and faithfulness metrics, especially reducing hallucination by more than half under the default GTE retrieval setting, with gains persisting across alternative retrievers and prompting-based baselines. Defense-domain expertise is incorporated in three stages of our evaluation: (a) determining the quality of the synthetic QA generated by DoRA, (b) ascertaining the reliability of LLM-as-judge scores, and (c) evaluating the generalization of the QA pipeline on completely human-written QA examples. We position DoRA as a practical framework for specialist-domain RAG under domain shift, with defense as a high-stakes case study.

URL PDF HTML ☆

赞 0 踩 0

2604.16774 2026-05-28 cs.CL cs.AI 版本更新

Retention Consequence in Lifecycle Memory Control

生命周期记忆控制中的保留后果

Jiarui Han

AI总结研究持久记忆在准入后失效的问题，提出将置信度作为前向有效性/支持证据，并引入强度作为保留后果的显式生命周期状态，通过StageMem控制器实验验证显式保留后果在生命周期结算中的控制作用。

详情

AI中文摘要

持久记忆在成功准入后可能失效：一个前提被写入，然后成为无声的假设，后续维护将其视为普通残留进行压缩、降级或驱逐。我们将这种准入后失效作为生命周期控制问题来研究。现有记忆系统已经执行准入、更新、压缩、检索和驱逐。我们的主张并非此类系统缺乏维护，而是保留后果通常仅通过有效性、相似性、新近性、频率、重要性或摘要信号间接操作，而非作为单独的生命周期状态暴露。因此，我们将置信度视为前向有效性/支持证据，并引入强度作为保留后果的显式生命周期状态。我们在StageMem中实现了这一区分，这是一个小型的分阶段控制器，其瞬态、工作态和持久态存储暴露了提升、压缩和驱逐压力点。在受控的前提实现、压缩、压力和隐式启发式诊断实验中，实验区分了写入过少、保留错误的高线索内容、遗忘代价高昂的前提以及通过饱和保留所有内容。通过生命周期结算使用的显式保留后果，提供了在遗漏和囤积之间的控制面。针对目标准入后失效模式，结果支持持久记忆的生命周期观点：可靠性不仅取决于进入记忆的内容，还取决于准入有效性和保留后果在维护期间是否可用。

英文摘要

Persistent memory can fail after successful admission: a premise is written, then becomes a silent assumption, and later maintenance treats it as ordinary residue to be compressed, demoted, or evicted. We study this post-admission failure as a lifecycle-control problem. Existing memory systems already perform admission, update, compression, retrieval, and eviction. Our claim is not that such systems lack maintenance, but that retention consequence is often operationalized only indirectly through validity, similarity, recency, frequency, importance, or summarization signals rather than exposed as a separate lifecycle state. We therefore treat confidence as carried-forward validity/support evidence, and introduce strength as an explicit lifecycle state for retention consequence. We operationalize this distinction in StageMem, a small staged controller whose transient, working, and durable stores expose promotion, compression, and eviction pressure points. Across controlled premise-realization, compression, pressure, and implicit-heuristic diagnostics, the experiments separate writing too little, retaining the wrong high-cue content, forgetting costly premises, and preserving everything by saturation. Explicit retention consequence, used through lifecycle settlement, provides a control surface between omission and hoarding. For the targeted post-admission failure mode, the results support a lifecycle view of persistent memory: reliability depends not only on what enters memory, but on whether admission validity and retention consequence remain available during maintenance.

URL PDF HTML ☆

赞 0 踩 0

2604.16358 2026-05-28 cs.LG cs.CL 版本更新

SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

SaFeR-Steer：通过合成引导和反馈动力学进化多轮多模态大语言模型

Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； West China Biomedical Big Data Center, Sichuan University（四川大学西部生物医学大数据中心）； School of Public Policy and Administration, Chongqing University（重庆大学公共政策与管理学院）； Nanyang Technological University（南洋理工大学）

AI总结提出SaFeR-Steer框架，通过分阶段合成引导和导师参与的GRPO训练单学生模型，并引入轨迹一致总结奖励（TCSR）以解决多轮安全对齐中的长上下文安全衰减问题，显著提升多轮安全性和有用性。

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在多轮场景中，攻击者可以通过不断演变的视觉-文本历史升级不安全意图，并利用长上下文安全衰减。然而，安全对齐仍然以单轮数据和固定模板对话为主，导致训练与部署之间存在不匹配。为弥补这一差距，我们提出SaFeR-Steer，一种渐进式多轮对齐框架，结合分阶段合成引导和导师参与的GRPO，在自适应、在线策略攻击下训练单个学生模型。我们还引入了轨迹一致总结奖励（TCSR），该奖励聚合了历史最小值和回合奖励的平均值，使得任何低质量回合都会影响轨迹级别的回报。I. 数据集。我们发布STEER，一个多轮多模态安全数据集，包含STEER-SFT（12,934）、STEER-RL（2,000）和STEER-Bench（3,227）对话，回合数为2-10。II. 实验。从Qwen2.5-VL-3B/7B开始，SaFeR-Steer在单轮基准（3B：48.30/45.86 → 81.84/70.77；7B：56.21/60.32 → 87.89/77.40）和多轮基准（3B：12.55/27.13 → 55.58/70.27；7B：24.66/46.48 → 64.89/72.35）上显著提高了安全性/有用性，将失败转移到后续回合，并产生了超越单纯扩展的鲁棒性。代码可在https://anonymous.4open.science/r/SaFeR-Steer获取。

英文摘要

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment. To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce Trajectory-Consistent Summative Reward (TCSR), which aggregates the historical minimum and average of turn rewards so that any low-quality turn affects the trajectory-level return. I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2-10 turns. II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 $\rightarrow$ 81.84/70.77 for 3B; 56.21/60.32 $\rightarrow$ 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 $\rightarrow$ 55.58/70.27 for 3B; 24.66/46.48 $\rightarrow$ 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone. Code is available at https://anonymous.4open.science/r/SaFeR-Steer

URL PDF HTML ☆

赞 0 踩 0

2512.15791 2026-05-28 cs.CY cs.AI cs.CL 版本更新

Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study

语言模型中AI伦理工具评估：开发者视角案例研究

Jhessica Silva, Diego A. B. Moreira, Gabriel O. dos Santos, Alef Ferreira, Helena Maia, Sandra Avila, Helio Pedrini

AI总结通过文献筛选和开发者访谈，评估四种AI伦理工具在葡萄牙语语言模型中的应用效果，发现它们能指导一般伦理考虑但未覆盖模型特有方面。

Comments 7 figures, 11 tables. Accepted for publication in AI and Ethics

详情

DOI: 10.1007/s43681-025-00914-2

AI中文摘要

在人工智能中，语言模型因能够通过文本生成模拟与人类真实对话的系统被广泛采用而变得日益重要。由于它们对社会的影响，开发和部署这些语言模型必须负责任地进行，关注其负面影响和可能的危害。在此背景下，AI伦理工具（AIETs）的出版物数量近期有所增加。这些AIETs旨在通过引入公认的价值观来指导AI的设计、开发和使用阶段，帮助开发者、公司、政府和其他利益相关者建立对其技术的信任、透明度和责任。然而，许多AIETs缺乏良好的文档、使用示例以及在实践中有效性的证明。本文提出了一种评估语言模型中AIETs的方法。我们的方法包括对213个AIETs进行广泛的文献调查，在应用纳入和排除标准后，我们选择了四个AIETs：模型卡片、ALTAI、事实表以及危害建模。为了评估，我们将AIETs应用于为葡萄牙语开发的语言模型，并对它们的开发者进行了35小时的访谈。评估考虑了开发者对AIETs在帮助识别其模型伦理考量方面的使用和质量的看法。结果表明，所应用的AIETs可作为制定关于语言模型的一般伦理考量的指南。然而，我们注意到它们并未解决这些模型的独特方面，例如习语表达。此外，这些AIETs未能帮助识别葡萄牙语模型的潜在负面影响。

英文摘要

In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of simulating realistic conversations with humans through text generation. Because of their impact on society, developing and deploying these language models must be done responsibly, with attention to their negative impacts and possible harms. In this scenario, the number of AI Ethics Tools (AIETs) publications has recently increased. These AIETs are designed to help developers, companies, governments, and other stakeholders establish trust, transparency, and responsibility with their technologies by bringing accepted values to guide AI's design, development, and use stages. However, many AIETs lack good documentation, examples of use, and proof of their effectiveness in practice. This paper presents a methodology for evaluating AIETs in language models. Our approach involved an extensive literature survey on 213 AIETs, and after applying inclusion and exclusion criteria, we selected four AIETs: Model Cards, ALTAI, FactSheets, and Harms Modeling. For evaluation, we applied AIETs to language models developed for the Portuguese language, conducting 35 hours of interviews with their developers. The evaluation considered the developers' perspective on the AIETs' use and quality in helping to identify ethical considerations about their model. The results suggest that the applied AIETs serve as a guide for formulating general ethical considerations about language models. However, we note that they do not address unique aspects of these models, such as idiomatic expressions. Additionally, these AIETs did not help to identify potential negative impacts of models for the Portuguese language.

URL PDF HTML ☆

赞 0 踩 0

2604.14585 2026-05-28 cs.AI cs.CL 版本更新

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

提示优化如同抛硬币：诊断其在复合AI系统中何时有效

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式AI创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结通过大量实验发现提示优化在复合AI系统中效果不稳定，仅当任务具有可挖掘的输出结构时才有帮助，并提供了两阶段诊断方法。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情

AI中文摘要

复合AI系统中的提示优化在统计上与抛硬币无异：在Claude Haiku 4.5上的72次优化运行（6种方法 × 4个任务 × 3次重复）中，49%的得分低于零样本；在Amazon Nova Lite上，失败率更高。然而，在一个任务上，所有六种方法相比零样本提升了高达+6.8分。是什么区分了成功与失败？我们通过18,000次网格评估和144次优化运行进行了调查，按照必须回答的顺序测试了TextGrad和DSPy等端到端优化工具背后的两个假设：(A) 智能体提示存在交互，需要联合优化而非独立优化；(B) 单个提示本身值得优化。交互效应从未显著（p > 0.52，所有F < 1.0），并且优化仅在任务具有可挖掘的输出结构时才有帮助：即模型可以生成但不会默认采用的格式。我们进一步给出了机制性解释：指令微调将输入措辞压缩成狭窄的输出分布，消除了联合优化所依赖的措辞敏感性。我们提供了一个两阶段诊断：一个80美元的ANOVA预测试用于智能体耦合，以及一个10分钟的头空间测试，用于预测优化是否值得，从而将抛硬币转变为知情决策。

英文摘要

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku 4.5 (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy, in the order they must be answered: (A) agent prompts interact, requiring joint rather than independent optimization, and (B) individual prompts are worth optimizing at all. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure: a format the model can produce but does not default to. We further give a mechanistic account: instruction-tuning compresses input phrasing into a narrow output distribution, eliminating the very phrasing-sensitivity that joint optimization assumes. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile, turning a coin flip into an informed decision.

URL PDF HTML ☆

赞 0 踩 0

2604.14356 2026-05-28 cs.CL cs.AI 版本更新

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

当多囊卵巢综合征遇上进食障碍：一种可解释的AI方法检测隐藏的三重负担

Apoorv Prasad, Susan McRoy

发表机构 * University of Wisconsin - Milwaukee（威斯康星大学密尔沃基分校）

AI总结本研究通过微调小型开源语言模型，利用可解释性AI从社交媒体帖子中自动检测多囊卵巢综合征患者的身体形象困扰、进食障碍和代谢挑战的三重负担，最佳模型在150条测试帖上达到75.3%的精确匹配准确率。

详情

AI中文摘要

患有多囊卵巢综合征（PCOS）的女性面临身体形象困扰、进食障碍和代谢挑战的显著升高风险，然而现有的自然语言处理方法在检测这些状况时缺乏透明度，且无法识别共病表现。我们开发了小型开源语言模型，以基于可解释性的方式自动检测社交媒体帖子中的这种三重负担。我们从六个子论坛收集了1000条与PCOS相关的帖子，由两名经过训练的标注员根据Lee等人（2017）临床框架的操作化指南对帖子进行标注。使用低秩适配对三个模型（Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B）进行微调，以生成带有文本证据的结构化解释。最佳模型在150条保留帖子上实现了75.3%的精确匹配准确率，具有稳健的共病检测能力和强可解释性。性能随诊断复杂性下降，表明其最佳用途是筛查而非自主诊断。

英文摘要

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2604.13232 2026-05-28 cs.CL 版本更新

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

评估评估者：SemEval-2020任务1在词汇语义变化检测中的问题

Bach Phan-Tat, Kris Heylen, Dirk Geeraerts, Stefano De Pascale, Dirk Speelmana

发表机构 * Department of Linguistics, KU Leuven（KU莱顿大学语言学系）； Instituut voor de Nederlandse Taal（荷兰语研究所）； Department of Linguistics and Literary Studies, Vrije Universiteit Brussel（布鲁塞尔自由大学语言学与文学研究系）

AI总结通过操作化、数据质量和基准设计三个框架，批判性分析SemEval-2020任务1的局限性，指出其窄化语义变化模型、数据质量问题及设计缺陷，呼吁未来改进。

详情

AI中文摘要

本文通过操作化、数据质量和基准设计三个框架重新审视了词汇语义变化检测中最具影响力的共享基准SemEval-2020任务1。首先，在操作化层面，我们认为该基准主要将语义变化建模为离散义项的增加、丢失或重新分布。虽然这种框架便于标注和评估，但过于狭窄，无法捕捉渐变的、构式的、搭配的和语篇层面的变化。此外，黄金标签是标注决策、聚类过程和阈值设置的结果，可能限制任务的有效性。其次，在数据质量层面，我们表明该基准受到严重的语料库和预处理问题影响，包括OCR噪声、畸形字符、截断句子、不一致的词形还原、词性标注错误以及目标词遗漏。这些问题可能扭曲模型行为，使语言分析复杂化，并降低可重复性。第三，在基准设计层面，我们认为精心挑选的小规模目标集和有限的语言覆盖降低了现实性并增加了统计不确定性。综合来看，这些局限性表明该基准应被视为一个有用但不完整的测试平台，而非进展的最终衡量标准。因此，我们呼吁未来的数据集和共享任务采用更广泛的语义变化理论，透明地记录预处理过程，扩大跨语言覆盖范围，并使用更现实的评估设置。这些步骤对于词汇语义变化检测中更有效、可解释和可推广的进展是必要的。

英文摘要

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

URL PDF HTML ☆

赞 0 踩 0

2604.10567 2026-05-28 cs.CL cs.AI 版本更新

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

早期决策至关重要：非自回归扩散语言模型中的邻近偏差与初始轨迹塑造

Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

发表机构 * LG AI Research（LG人工智能研究）

AI总结本文通过分析非自回归扩散语言模型的推理动态，发现其存在邻近偏差导致的错误传播问题，并提出一种轻量级规划器和序列结束温度退火方法来引导早期令牌选择，从而显著提升推理与规划任务的性能。

Comments ICML 2026 Camera Ready

详情

AI中文摘要

基于扩散的语言模型（dLLMs）已成为自回归语言模型的一种有前景的替代方案，提供了并行令牌生成和双向上下文建模的潜力。然而，如何利用这种灵活性实现完全非自回归解码仍然是一个开放问题，尤其是在推理和规划任务中。在这项工作中，我们通过系统分析非自回归解码在时间轴上的推理动态来研究dLLMs中的非自回归解码。具体来说，我们揭示了基于置信度的非自回归生成中固有的失败模式，该模式源于强烈的邻近偏差——即去噪顺序倾向于集中在空间相邻的令牌上。这种局部依赖性导致空间错误传播，使得整个轨迹关键地依赖于初始去掩码位置。利用这一见解，我们提出了一种最小干预方法，通过轻量级规划器和序列结束温度退火来指导早期令牌选择。我们在各种推理和规划任务上全面评估了我们的方法，并观察到在现有启发式基线基础上，无需显著计算开销即可实现整体性能的显著提升。

英文摘要

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2604.06196 2026-05-28 cs.CL cs.AI cs.LO 版本更新

Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering

面向三值逻辑问答的成分一致性引导解码

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

AI总结针对大语言模型在三值逻辑问答中的否定不一致和认知未知问题，提出一种轻量级测试时解码层CGD-PD，通过神经三值分类、符号否定一致性投影和定向二值蕴含探测，在FOLIO数据集上提升准确率4.4-6.8点并减少未知预测。

Comments Accepted at the ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents

详情

AI中文摘要

三值逻辑问答（QA）在给定前提集 $S$ 的情况下，将 $ ext{True}$、$ ext{False}$ 或 $ ext{Unknown}$ 之一分配给假设 $H$。我们将此任务视为一个紧凑的成分推理问题：在确定性否定映射下，$H$ 和机械否定假设 $ eg H$ 的预测应保持一致。尽管结构简单，大语言模型（LLM）可能表现出两种实际失败模式：(i) 否定不一致，即对 $H$ 和 $ eg H$ 的回答违反了所需的标签映射；(ii) 认知 $ ext{Unknown}$，即模型在某一侧被蕴含时仍选择弃权。我们引入 CGD-PD，一个轻量级、无需训练的测试时层，结合神经三值分类、符号否定一致性投影和定向二值蕴含探测。在 FOLIO 一阶逻辑领域的一个验证集上，CGD-PD 在 GPT-5.2 上提升了 4.4 个百分点的准确率，在 Claude Sonnet 4.5 上提升了 6.8 个百分点，同时减少了 $ ext{Unknown}$ 预测和认知弃权。这些结果提供了一个受控的概念验证，表明推理时的简单逻辑组合有助于评估和提高 LLM 推理可靠性；但本身并不足以证明在此形式化基准设置之外的鲁棒性。

英文摘要

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a premise set $S$. We study this task as a compact compositional inference problem: predictions for $H$ and for a mechanically negated hypothesis $\neg H$ should agree under a deterministic negation map. Despite this simple structure, large language models (LLMs) can exhibit two practical failure modes: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the required label mapping, and (ii) epistemic $\text{Unknown}$, where the model abstains even when one side is entailed. We introduce CGD-PD, a lightweight, training-free test-time layer that combines neural 3-way classification, symbolic negation-consistency projection, and targeted binary entailment probes. On one validation split of FOLIO's first-order logic fields, CGD-PD improves accuracy by 4.4 points on GPT-5.2 and 6.8 points on Claude Sonnet 4.5, while reducing $\text{Unknown}$ predictions and epistemic abstention. These results provide a controlled proof of concept that simple logical composition at inference time can help evaluate and improve LLM reasoning reliability; they do not, by themselves, establish robustness beyond this formal benchmark setting.

URL PDF HTML ☆

赞 0 踩 0

2604.05378 2026-05-28 cs.CL cs.CV 版本更新

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

ICR-Drive：面向端到端语言驱动自动驾驶的指令反事实鲁棒性

Kaiser Hamid, Can Cui, Nade Liang

发表机构 * Texas Tech University（德克萨斯科技大学）； Bosch Center for Artificial Intelligence (BCAI)（博世人工智能中心（BCAI））

AI总结提出ICR-Drive框架，通过生成四类扰动指令（改写、歧义、噪声、误导）并基于CARLA仿真评估，揭示语言条件驾驶模型对指令变化的脆弱性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 872-880

AI中文摘要

视觉-语言-动作（VLA）模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航命令，但标准评估大多假设指令精确且格式良好。在实际部署中，指令的措辞和具体性各不相同，可能省略关键限定词，偶尔还包含误导性的权威框架文本，导致指令级鲁棒性未被充分衡量。我们提出了ICR-Drive，一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成受控的指令变体，涵盖四类扰动：改写、歧义、噪声和误导，其中误导变体与导航目标冲突并试图覆盖意图。我们在匹配的仿真器配置和种子下重放相同的CARLA路线，以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标和相对于基线指令的每族性能下降来量化。在LMDrive和BEVDriver上的实验表明，微小的指令变化可能导致显著的性能下降和不同的故障模式，揭示了在安全关键驾驶中部署具身基础模型的可靠性差距。

JMedEthicBench：用于评估日语大语言模型医疗安全性的多轮对话基准

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato

发表机构 * Kyoto University（京都大学）； Hohai University（河海大学）； The University of Tokyo（东京大学）； University of Science and Technology of China（中国科学技术大学）； Hong Kong Polytechnic University（香港理工大学）

AI总结提出首个多轮对话基准JMedEthicBench，基于日本医学会67条指南和7种自动越狱策略生成5万+对抗对话，评估27个模型发现医疗专用模型安全性脆弱，且多轮交互中安全性显著下降。

Comments 12 pages, 6 figures

详情

AI中文摘要

随着大语言模型（LLM）在医疗领域的部署日益增多，在临床使用前仔细评估其医疗安全性变得至关重要。然而，现有的安全基准仍然以英语为中心，并且仅使用单轮提示进行测试，尽管临床咨询是多轮的。为了解决这些差距，我们引入了JMedEthicBench，这是第一个用于评估日语医疗LLM医疗安全性的多轮对话基准。我们的基准基于日本医学会的67条指南，包含使用七种自动发现的越狱策略生成的超过50,000个对抗性对话。使用双LLM评分协议，我们评估了27个模型，发现商业模型保持了稳健的安全性，而医疗专用模型表现出更高的脆弱性。此外，安全分数在对话轮次中显著下降（中位数：9.5降至5.0，p < 0.001）。对我们的基准的日语和英语版本进行的跨语言评估表明，医疗模型的脆弱性跨语言持续存在，表明存在固有的对齐限制，而非语言特定因素。这些发现表明，领域特定的微调可能会意外削弱安全机制，并且多轮交互代表了一个需要专门对齐策略的独特威胁面。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

URL PDF HTML ☆

赞 0 踩 0

2505.13820 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Structured Agent Distillation for Large Language Model

大型语言模型的结构化智能体蒸馏

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Harvard University（哈佛大学）； MIT（麻省理工学院）； Northeastern University（东北大学）； Adobe Research（Adobe研究）； National University of Singapore（新加坡国立大学）； University of Georgia（佐治亚大学）； Florida International University（佛罗里达国际大学）

AI总结提出结构化智能体蒸馏框架，通过分段对齐推理和动作跨度，将大型语言模型智能体压缩为小型学生模型，在保持决策性能的同时降低推理成本。

详情

Journal ref: The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

AI中文摘要

大型语言模型（LLMs）通过交错推理和动作（如ReAct风格框架）展现出作为决策智能体的强大能力。然而，它们的实际部署受到高推理成本和大模型规模的限制。我们提出结构化智能体蒸馏，一种将基于大型LLM的智能体压缩为更小的学生模型的框架，同时保持推理保真度和动作一致性。与标准的token级蒸馏不同，我们的方法将轨迹分割为[REASON]和[ACT]跨度，应用分段特定损失来使每个组件与教师行为对齐。这种结构感知的监督使紧凑的智能体能够更好地复制教师的决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明，我们的方法始终优于token级和模仿学习基线，在性能下降最小的情况下实现了显著的压缩。缩放和消融结果进一步强调了跨度级对齐对于高效可部署智能体的重要性。

英文摘要

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

URL PDF HTML ☆

赞 0 踩 0

2603.26182 2026-05-28 cs.CL 版本更新

ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

ClinicalAgents：具有双记忆的临床决策多智能体编排

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出ClinicalAgents多智能体框架，通过蒙特卡洛树搜索动态编排和双记忆架构模拟临床推理，显著提升诊断准确性和可解释性。

Comments Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3818931

AI中文摘要

虽然大型语言模型（LLMs）在医疗保健领域展现出潜力，但它们往往难以应对临床准确诊断所需的复杂非线性推理。现有方法通常依赖从症状到诊断的静态线性映射，未能捕捉人类临床医生固有的迭代、假设驱动推理。为弥补这一差距，我们引入了ClinicalAgents，一种新颖的多智能体框架，旨在模拟专家临床医生的认知工作流。与僵化的顺序链不同，ClinicalAgents采用了一种动态编排机制，建模为蒙特卡洛树搜索（MCTS）过程。这使得编排器能够迭代生成假设、主动验证证据，并在关键信息缺失时触发回溯。该框架的基础是双记忆架构：一个可变的短期工作记忆，用于维护不断演变的患者状态以进行上下文感知推理；以及一个静态的经验记忆，通过主动反馈循环检索临床指南和历史病例。大量实验表明，ClinicalAgents在评估的基线中取得了最佳性能，与强大的单智能体和多智能体基线相比，显著提高了诊断准确性和可解释性。我们的代码发布在https://github.com/ZhuohanGe/ClinicalAgents-Code。

英文摘要

While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent in human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. The foundation of this framework is a Dual-Memory architecture: a mutable working memory that maintains the evolving patient state for context-aware reasoning, and a static experience memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves the best performance among evaluated baselines, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines. Our code is released at https://github.com/ZhuohanGe/ClinicalAgents-Code.

URL PDF HTML ☆

赞 0 踩 0

2601.19302 2026-05-28 cs.CL 版本更新

Formula-One Prompting: A Composable Equation-First Prefix for Applied Mathematics

Formula-One Prompting：一种可组合的方程优先前缀用于应用数学

Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul

发表机构 * SCB DataX, SCBX Group（SCB数据X，SCBX集团）

AI总结提出公式提示（FP）和Formula-One提示（F-1），通过先形式化问题中的控制方程再求解，在多个应用数学基准上优于思维链和程序思维提示，平均提升5.76和8.42个百分点。

详情

AI中文摘要

本文介绍了公式提示（FP）和Formula-One提示（F-1），两种单次调用方法，在解决应用数学问题之前先引出控制方程。思维链（CoT）和程序思维（PoT）提示通过引出预训练期间学到的推理轨迹或类似代码的结构来改进数学推理。这提出了一个诊断性问题：哪些有用的预训练模式仍然未被充分引出？使用infini-gram-mini，我们扫描了81.7万亿预训练令牌，发现在精心策划的语料库（如DataComp-LM）中，以方程为中心的语言出现频率比代码高121倍，比逐步叙述高3.79倍，但标准提示方法并未明确引出方程形式化。FP要求模型在求解前先形式化问题的控制方程；F-1扩展了FP，增加了一个可组合的第二阶段，在同一调用中选择直接、CoT或PoT风格的求解。在五个推理模型和四个应用数学基准（金融、物理、密码学、竞赛数学）上，F-1平均优于CoT 5.76个百分点，优于PoT 8.42个百分点，在FinanceMath上取得最大提升13.30个百分点，同时以仅68个提示令牌的开销占据准确率-令牌效率前沿。变体消融实验表明，方程形式化前缀（而非策略菜单）是主要驱动因素：在前缀之上添加CoT或PoT不会带来进一步收益，且73.3%的剩余失败发生在第一阶段方程正确之后。

英文摘要

This paper introduces Formula Prompting (FP) and Formula-One Prompting (F-1), two single-call methods that elicit governing equations before solving applied-math problems. Chain-of-Thought (CoT) and Program-of-Thought (PoT) prompting improve mathematical reasoning by eliciting reasoning traces or code-like structures learned during pretraining. This suggests a diagnostic question: which useful pretraining patterns remain under-elicited? Using infini-gram-mini, we scan 81.7 trillion pretraining tokens and find that, in curated corpora such as DataComp-LM, equation-centered language appears 121x more often than code and 3.79x more often than step-by-step narration, yet standard prompting methods do not explicitly elicit equation formulation. FP asks the model to formalize a problem's governing equations before solving; F-1 extends FP with a composable Phase 2 that selects Direct, CoT, or PoT-style solving in the same call. Across five reasoning models and four applied-math benchmarks (finance, physics, cryptography, competition math), F-1 outperforms CoT by 5.76 pp and PoT by 8.42 pp on average, with the largest gain of 13.30 pp on FinanceMath, while topping the accuracy-token efficiency frontier at only 68 prompt tokens of overhead. Variant ablations identify the equation-formalization prefix, not the strategy menu, as the primary driver: adding CoT or PoT on top of the prefix yields no further gain, and 73.3% of remaining failures occur downstream of a correct Phase-1 equation.

URL PDF HTML ☆

赞 0 踩 0

2603.22735 2026-05-28 cs.CL 版本更新

Explanation Generation for Contradiction Reconciliation with LLMs

面向矛盾调和的大语言模型解释生成

Jason Chan, Zhixue Zhao, Robert Gaizauskas

发表机构 * University of Sheffield, UK（谢菲尔德大学）

AI总结提出矛盾调和解释生成任务，通过改造NLI数据集和设计质量指标，评估18个LLM在该任务上的表现，发现模型能力有限且增大模型规模时“思考”收益递减。

Comments Preprint

详情

AI中文摘要

现有的NLP工作通常将矛盾视为需要通过选择接受或拒绝哪些陈述来解决的错误。然而，在社交互动和专业领域中，人类推理的一个关键方面是能够假设调和矛盾的解释。例如，“Cassie讨厌咖啡”和“她每天买咖啡”看似矛盾，但如果Cassie有每天为所有同事买咖啡这一不令人羡慕的日常任务，那么两者是兼容的。尽管大语言模型（LLM）的推理能力不断增强，但它们假设这种调和解释的能力在很大程度上仍未探索。为了填补这一空白，我们引入了调和解释生成任务，其中模型必须生成能够有效使矛盾陈述兼容的解释。我们提出了一种改造现有自然语言推理（NLI）数据集的新方法，并引入了可实现可扩展自动评估的质量指标。对18个LLM的实验表明，大多数模型在此任务中取得的成功有限，并且通过“思考”延长测试时计算的好处随着模型规模的增大而趋于平稳。我们的结果突显了LLM推理中一个未被充分探索的维度，以及解决这一限制以增强LLM下游应用（如聊天机器人和科学助手）的必要性。

英文摘要

Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.

URL PDF HTML ☆

赞 0 踩 0

2603.21465 2026-05-28 cs.CL cs.LG 版本更新

ClinConsensus：一个用于评估中文医疗大模型临床评分标准覆盖率的医师校准基准

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Xue Yang, Kailuan Wu, Ruyi Xu, Tianyun Lu, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Sen Yang, Lin Qu, Bing Zhao, Hu Wei

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结为解决开放域医疗大模型评估缺乏医师校准的临床响应标准覆盖率问题，提出包含2500个专家病例的ClinConsensus基准，并引入医师锚定覆盖率评分（CACS）及双裁判框架，发现前沿模型存在19.2-21.9分的覆盖率差距。

详情

AI中文摘要

开放域医疗大模型评估在医师校准的临床相关响应标准覆盖率方面仍然薄弱，尤其是在本地化临床环境中。我们引入了 extsc{ClinConsensus}，一个中文医疗基准，包含 2,500 个专家精选病例，涵盖 36 个专科、12 个任务主题、多个难度级别以及面向非专业与专业人员的场景。每个病例配有 30 个病例特定的二元评分标准。为了评估响应是否满足足够多的医师撰写的标准，我们提出了 \emph{医师锚定覆盖率评分}（CACS），一个在 $k=10$ 实例化的医师校准阈值度量，并开发了一个双裁判框架，结合 GPT-5.1 评分器与一个医师监督的 Qwen3-8B 裁判。评估 11 个前沿大模型，我们发现存在持续的覆盖率差距：评分准确率在 39.6% 到 52.1% 之间，而 CACS@10 在 17.8% 到 32.9% 之间，模型间存在 19.2-21.9 个百分点的差距。分层分析进一步揭示了在推理、证据使用、结构化提取、用药说明、随访和对话语域方面的显著差异。这些结果表明，医疗大模型评估应衡量阈值化的、基于评分标准的临床覆盖率，而非平均部分正确性。

英文摘要

Open-ended medical LLM evaluation remains weakly grounded in physician-calibrated coverage of clinically relevant response criteria, especially in localized clinical settings. We introduce \textsc{ClinConsensus}, a Chinese medical benchmark of 2{,}500 expert-curated cases spanning 36 specialties, 12 task themes, multiple difficulty levels, and lay-facing versus professional-facing settings. Each case is paired with 30 case-specific binary rubric criteria. To evaluate whether responses satisfy enough physician-authored criteria, we propose \emph{Clinician-Anchored Coverage Score} (CACS), a physician-calibrated threshold metric instantiated at $k=10$, and develop a dual-judge framework combining a GPT-5.1 grader with a physician-supervised Qwen3-8B judge. Evaluating 11 frontier LLMs, we find a persistent coverage gap: Rubric Accuracy ranges from 39.6\% to 52.1\%, whereas CACS@10 ranges from 17.8\% to 32.9\%, leaving a 19.2--21.9 point gap across models. Stratified analyses further reveal substantial variation across reasoning, evidence use, structured extraction, medication instructions, follow-up, and dialogue register. These results suggest that medical LLM evaluation should measure thresholded, rubric-grounded clinical coverage rather than average partial correctness.

URL PDF HTML ☆

赞 0 踩 0

2601.04505 2026-05-28 cs.AI cs.CL cs.SY eess.SY 版本更新

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

CircuitLM: 一种基于多智能体的大语言模型辅助设计框架，用于从自然语言提示生成电路原理图

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Department of Electrical and Electronic Engineering（电气与电子工程系）； Islamic University of Technology（伊斯兰技术大学）

AI总结提出CircuitLM多智能体流水线，通过嵌入驱动的组件知识库和五阶段流程，将自然语言提示转化为结构化的CircuitJSON原理图，并采用确定性电气规则检查和LLM作为评判的元评估器双重验证，解决大语言模型在电路设计中的幻觉和物理约束问题。

Comments Accepted at the 2026 IEEE International Conference on LLM-Aided Design (ICLAD), 10 pages, 8 figures, 6 tables

详情

AI中文摘要

从高层自然语言描述生成准确的电路原理图仍然是电子设计自动化（EDA）中的一个持久挑战，因为大语言模型（LLM）经常产生组件幻觉、违反严格的物理约束并输出非机器可读的结果。为解决此问题，我们提出CircuitLM，一个多智能体流水线，将用户提示转化为结构化的、视觉可解释的$\texttt{CircuitJSON}$原理图。该框架通过五个顺序阶段： (i) 组件识别，(ii) 规范引脚输出检索，(iii) 思维链推理，(iv) JSON原理图合成，以及(v) 交互式力导向可视化，基于一个精心策划的、嵌入驱动的组件知识库进行生成，从而减轻幻觉并确保物理可行性。我们在一个包含100个独特电路设计提示的数据集上，使用五个最先进的大语言模型评估了该系统。为系统评估性能，我们部署了严格的双层评估方法：一个确定性电气规则检查（ERC）引擎按严格严重性（关键、主要、次要、警告）对拓扑故障进行分类，同时一个LLM作为评判的元评估器识别复杂的、上下文感知的设计缺陷，这些缺陷绕过了标准的基于规则的检查器。最终，这项工作展示了目标检索与确定性和语义验证相结合如何将自然语言转化为结构可行的、原理图就绪的硬件和安全电路原型。我们的代码和数据公开在 https://github.com/Khandakar227/CircuitLM。

英文摘要

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data are publicly available at https://github.com/Khandakar227/CircuitLM.

URL PDF HTML ☆

赞 0 踩 0

2512.20780 2026-05-28 cs.CL cs.CY 版本更新

Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

大型语言模型在数学辅导中接近专家教学质量，但在教学和语言特征上存在差异

Ramatu Oiza Abdulsalam, Segun Aroyehun

发表机构 * African University of science and Technology（非洲科学与技术大学）； University of Konstanz（康斯坦茨大学）

AI总结通过分析数学辅导对话数据集，比较专家、新手教师和七种大型语言模型的教学质量，发现大型语言模型平均接近专家水平，但在教学策略和语言特征上存在系统性差异。

详情

AI中文摘要

最近的工作探索了使用大型语言模型（LLMs）生成数学辅导回应，但尚不清楚其教学行为与人类专家实践的接近程度。我们分析了一个数学补救对话数据集，其中专家教师、新手教师和七种不同规模的大型语言模型（包括开放权重和商业模型）对相同的学生错误做出回应。我们检查了教学策略和辅导回应的语言特征，包括吸收（重述和转述）、追问准确性和推理、词汇多样性、可读性、礼貌性和能动性。我们发现专家教师产生的回应质量高于新手教师，并且较大的LLMs通常比较小的模型获得更高的教学质量评分，平均接近专家表现。然而，LLMs在教学特征上表现出系统性差异：它们较少使用专家教师特有的讨论策略，同时生成更长、词汇更丰富、更礼貌的回应。回归分析表明，追问准确性和推理、重述和转述以及词汇多样性与感知教学质量正相关，而更高水平的能动性和礼貌性语言则负相关。这些发现强调了在评估人类教师和智能辅导系统的辅导回应时分析教学策略和语言特征的重要性。

英文摘要

Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

URL PDF HTML ☆

赞 0 踩 0

2603.14864 2026-05-28 cs.CL 版本更新

Shopping Companion: Benchmarking and Training LLM Agents for Long-Horizon Preference-Grounded E-Commerce Tasks

购物助手：面向长期偏好引导的电子商务任务的LLM智能体基准测试与训练

Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group（阿里巴巴国际数字商业集团）

AI总结针对电子商务中缺乏长期偏好感知购物任务基准和细粒度训练监督的问题，提出了Shopping Companion Bench基准和免标注工具级奖励方法，有效提升了LLM智能体的偏好捕获与任务性能。

详情

AI中文摘要

在电子商务中，LLM智能体在推荐、预算管理和捆绑销售等购物任务中展现出潜力，其中从长期对话中准确捕捉用户偏好至关重要。然而，进展受到两个关键挑战的限制：（1）缺乏评估长期偏好感知购物任务的基准，（2）缺乏用于购物智能体训练的细粒度监督。为了填补基准空白，我们引入了Shopping Companion Bench，这是一个新颖的基准，包含两个需要跨会话偏好记忆的购物任务，基于超过120万真实商品的产品池。我们的分析进一步指出了该基准上失败的两个主要来源：偏好幻觉导致的级联错误，以及未能充分验证产品属性是否符合用户需求。为了解决这些失败模式，我们设计了免标注的、工具级奖励，为每次工具调用提供过程监督，从而缓解了长期任务中的奖励稀疏问题。实验结果表明，即使是GPT-5等最先进模型，成功率也低于70%，凸显了我们基准的难度。值得注意的是，我们微调的轻量级4B模型在偏好捕获和任务性能上均持续优于强基线，表明我们奖励设计的有效性。

英文摘要

In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budget management, and bundle deals, where accurately capturing user preferences from long-horizon conversations is critical. However, progress is limited by two key challenges: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of fine-grained supervision for shopping agent training. To fill the benchmark gap, we introduce Shopping Companion Bench, a novel benchmark comprising two shopping tasks that require cross-session preference memory, grounded in a product pool of over 1.2 million real-world items. Our analysis further identifies two major sources of failure on this benchmark: cascading errors caused by preference hallucination, and insufficient verification of product attributes against user requirements. To address these failure modes, we design annotation-free, tool-wise rewards that provide process supervision for each tool call, alleviating reward sparsity in long-horizon tasks. Experimental results demonstrate that even state-of-the-art models such as GPT-5 achieve success rates below 70%, highlighting the difficulty of our benchmark. Notably, our fine-tuned lightweight 4B model consistently outperforms strong baselines in both preference capture and task performance, suggesting the effectiveness of our reward design.

URL PDF HTML ☆

赞 0 踩 0

2602.22787 2026-05-28 cs.CL cs.AI 版本更新

质量约束的熵最大化策略优化用于LLM多样性

Haihui Pan, Yuzhong Hong, Kaichen Zhang, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

发表机构 * Zuoyebang Education Technology（左叶bang教育科技）

AI总结提出QEMPO框架，通过理论推导的闭式解在保证输出质量的同时最大化熵以提升LLM多样性，实验证明其在不牺牲质量的情况下提升多样性。

详情

AI中文摘要

在许多大语言模型（LLM）对齐应用中，用户不仅期望高质量输出，还希望有显著的多样性。然而，现有方法通常面临这些目标之间的根本权衡：提高输出质量的方法往往会降低多样性，而增加多样性的方法往往以牺牲质量为代价。在这项工作中，我们提出了质量约束的熵最大化策略优化（QEMPO），这是一个新颖的框架，在明确保持输出质量的同时增强LLM输出的多样性。QEMPO建立在坚实的理论基础之上：我们推导出一个闭式解析解，该解在质量约束下可证明地最大化熵（多样性的原则性度量），并在定义的目标下保证最优性。利用这一解，QEMPO自然支持在线和离线训练设置。实验结果表明，QEMPO在不牺牲质量的情况下持续提高输出多样性，并且在许多情况下，与现有基线相比，在质量和多样性两个维度上都取得了提升，与我们的理论保证一致。

英文摘要

In many large language model (LLM) alignment applications, users expect not only high-quality outputs but also substantial diversity. However, existing methods often face a fundamental trade-off between these objectives: approaches that improve output quality tend to reduce diversity, while methods that increase diversity often do so at the expense of quality. In this work, we propose Quality-constrained Entropy Maximization Policy Optimization (QEMPO), a novel framework that enhances the diversity of LLM outputs while explicitly preserving output quality. QEMPO is grounded in a strong theoretical foundation: we derive a closed-form analytical solution that provably maximizes entropy-a principled measure of diversity-subject to a quality constraint, with guarantees on optimality under the defined objective. Leveraging this solution, QEMPO naturally supports both online and offline training settings. Empirical results demonstrate that QEMPO consistently improves output diversity without sacrificing quality, and in many cases yields gains in both dimensions compared to existing baselines, aligning with our theoretical guarantees.

URL PDF HTML ☆

赞 0 踩 0

2601.16800 2026-05-28 cs.CL 版本更新

Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

大语言模型作为细粒度意见分析的自动标注者和标注裁决者

Gaurav Negi, MA Waskow, John McCrae, Omnia Zayed, Paul Buitelaar

发表机构 * Data Science Institute（数据科学研究所）； University of Galway（Galway大学）

AI总结本文探索使用大语言模型作为自动标注者进行细粒度意见分析，提出声明式标注流水线和LLM裁决方法，实验表明LLM在跨度级别可靠但难以再现关系结构，更适合作为标注助手而非完全替代人类。

详情

AI中文摘要

文本的细粒度意见分析提供了对表达情感的详细理解，包括所涉及的实体。尽管这种详细程度很有价值，但在数据集中标注意见以训练模型需要大量人力投入和成本，尤其是在不同领域和实际应用中。为了解决领域特定标注数据集的短缺，我们探索了LLM作为自动标注者进行细粒度意见分析的可行性。我们使用声明式标注流水线，这种方法减少了在使用LLM识别文本中细粒度意见跨度时手动提示工程的可变性。我们还提出了一种专门的方法，让LLM裁决多个标签并产生最终标注。我们使用不同大小的模型在方面情感三元组提取（ASTE）和方面-类别-意见-情感（ACOS）分析任务上试用了该流水线。在这项工作中，我们试图开发完全自主的基于LLM的标注者，但我们的结果揭示了一个不均衡的画面，其特点是关键的性能分叉：LLM在跨度级别可靠，但难以忠实地再现连接这些跨度的关系结构。这表明LLM更适合作为高保真标注助手和数据增强工具，以扩展细粒度意见标注数据集，而不是完全取代人类标注者。

英文摘要

Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is valuable, annotating opinions in datasets for model training requires considerable human effort and substantial cost, especially across diverse domains and real-world applications. To address this shortage of domain-specific labelled datasets, we explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis. We use a declarative annotation pipeline, an approach that reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a dedicated methodology for an LLM to adjudicate multiple labels and produce final annotations. We trial the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks. In this work, we attempt to develop fully autonomous LLM-based annotators, but our results reveal an uneven picture characterised by a critical performance bifurcation: LLMs are reliable at the span level yet struggle to faithfully reproduce the relational structures that connect those spans. This suggests that LLMs are better positioned as high-fidelity annotation assistants and data augmentation tools to expand fine-grained opinion-annotated datasets, rather than replacing human annotators entirely.

URL PDF HTML ☆

赞 0 踩 0

2602.15198 2026-05-28 cs.MA cs.AI cs.CL 版本更新

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Colosseum: 审计合作多智能体系统中的合谋行为

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）； University of Virginia（弗吉尼亚大学）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； MPI for Intelligent Systems, Tübingen（图宾根智能系统研究所）； AI Center（人工智能中心）

AI总结提出Colosseum框架，通过形式化决策框架和基于遗憾的度量审计LLM智能体在合作多智能体系统中的合谋行为，发现大多数模型存在新兴合谋倾向，并观察到“纸上合谋”现象。

详情

AI中文摘要

多智能体系统中，通过自由形式语言通信的LLM智能体能够实现复杂的协调以解决复杂的合作任务。当一组智能体形成联盟并合谋追求次要目标、降低联合目标时，这会产生独特的安全问题。在本文中，我们提出Colosseum，一个用于审计多智能体设置中LLM智能体合谋行为的框架。我们通过形式化的多智能体决策框架来理解智能体如何合作，并通过相对于合作最优的遗憾来度量基于行动的合谋行为，并将其与基于通信的合谋行为进行比较。Colosseum能够在良性设置、不同联盟目标、说服策略和网络拓扑下审计LLM智能体的合谋行为。然后，我们通过创建智能体之间的秘密通信渠道引入一种新的行为探针，表明大多数开箱即用的模型在此探针下表现出合谋倾向，我们称之为新兴合谋。此外，我们发现了“纸上合谋”现象，即智能体在文本中计划合谋但往往选择非合谋行动。Colosseum提供了一种审计合作多智能体系统中合谋的新方法，同时呈现了关于合谋如何出现、什么影响合谋效率以及哪些策略可能缓解合谋的观察。

英文摘要

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a coalition and colludes to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a formal multi-agent decision-making framework and measure action-based collusive behavior in actions via regret relative to the cooperative optimum and compare it with communication-based collusive behavior. Colosseum enables audits of LLM agents for collusion under benign settings, different coalition objectives, persuasion tactics, and network topologies. We then introduce a new behavioral probe by creating secret communication channels between agents, showing that most out-of-the-box models exhibit a propensity to collude under this probe, which we term emergent collusion. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but often pick non-collusive actions. Colosseum provides a new way to audit collusion in cooperative multi-agent systems while presenting observations about how collusion emerges, what affects collusion efficacy, and which strategies may mitigate it.

URL PDF HTML ☆

赞 0 踩 0

2602.13748 2026-05-28 cs.CL cs.CV 版本更新

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

RMPL：基于关系感知的多任务渐进学习与分阶段训练的多媒体事件抽取

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结提出RMPL框架，通过分阶段训练结合单模态事件抽取和多模态关系抽取的异构监督，在低资源条件下实现多媒体事件抽取，并在M2E2基准上取得一致改进。

Comments Accepted by ACM ICMR 2026

详情

DOI: 10.1145/3805622.3810577

AI中文摘要

多媒体事件抽取（MEE）旨在从包含文本和图像的文档中识别事件及其论元。它需要跨不同模态对事件语义进行 grounding。MEE 的进展受到缺乏标注训练数据的限制。M2E2 是唯一已建立的基准，但它仅提供评估用的标注。这使得直接监督训练不切实际。现有方法主要依赖于跨模态对齐或使用视觉-语言模型（VLM）进行推理时提示。这些方法没有显式学习结构化的事件表示，并且通常在多模态设置中产生较弱的论元 grounding。为解决这些限制，我们提出了 RMPL，一种用于低资源条件下 MEE 的基于关系感知的多任务渐进学习框架。RMPL 通过分阶段训练整合了来自单模态事件抽取和多模态关系抽取的异构监督。模型首先使用统一模式进行训练，以学习跨模态的共享事件中心表示。然后，使用混合文本和视觉数据对模型进行微调，以进行事件提及识别和论元角色抽取。在 M2E2 基准上使用多个 VLM 进行的实验表明，在不同模态设置下均取得了一致的改进。

英文摘要

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

URL PDF HTML ☆

赞 0 踩 0

2602.06054 2026-05-28 cs.CL 版本更新

Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

我们真的在创新吗？AI研究论文原创性的定性与定量研究

Abeer Mostafa, Thi Huyen Nguyen, Zahra Ahmadi

发表机构 * Peter L. Reichertz Institute for Medical Informatics（汉诺威医学院彼得·L·里赫茨医学信息学研究所）； L3S Research Center（L3S研究中心）； Lower Saxony Center for Artificial Intelligence and Causal Methods in Medicine (CAIMed)（下萨克森人工智能与医学因果方法中心（CAIMed））

AI总结基于10万+同行评审报告，通过定性与定量方法分析AI研究论文原创性的感知维度，并评估大语言模型在原创性评估中的可靠性。

详情

AI中文摘要

评估AI研究的原创性可以说是同行评审中最重要但最不可靠的步骤。评审者对原创性的判断仍然不透明、不一致，并且依赖于对先前工作的比较，而这些比较往往不完整。在本文中，我们基于来自顶级AI会议的超过10万份同行评审报告，对研究原创性进行了大规模、数据驱动的定性与定量分析，涵盖了该领域快速增长的时期。利用结构化的、语义检索的先前工作以及嵌入在专家评审者评估中的信号，我们系统地描述了原创性在实践中是如何被感知的，并识别出最强烈影响新颖性判断的关键维度。我们的分析产生了一个细粒度、基于证据的框架，为作者和评审者提供了关于原创性如何被评估的可操作见解。此外，我们评估了当前大语言模型（LLM）智能体在评估原创性方面的可靠性。我们发现这些模型倾向于系统性地高估新颖性，并且在检测概念抄袭方面存在困难，尤其是在存在改写的情况下。我们在以下网址发布我们的数据集、训练模型和代码：https://anonymous.4open.science/r/Novelty-Reviewer-365C/。

英文摘要

Assessing originality in AI research is arguably the most consequential yet least reliable step in peer review. Reviewer judgments of originality remain opaque, inconsistent, and dependent on comparisons to prior work that are often incomplete. In this paper, we present a large-scale, data-driven qualitative and quantitative analysis of research originality based on over 100,000 peer-review reports from leading AI venues, spanning a period of rapid growth in the field. Leveraging structured, semantically retrieved prior work and signals embedded in expert reviewer assessments, we systematically characterize how originality is perceived in practice and identify the key dimensions that most strongly influence novelty judgments. Our analysis yields a fine-grained, evidence-based framework that equips both authors and reviewers with actionable insights into how originality is evaluated. In addition, we evaluate the reliability of current large language model (LLM) agents in assessing originality. We find that these models tend to systematically overestimate novelty and struggle to detect conceptual plagiarism, particularly in the presence of paraphrasing. We release our dataset, trained models, and code at: https://anonymous.4open.science/r/Novelty-Reviewer-365C/.

URL PDF HTML ☆

赞 0 踩 0

2602.07574 2026-05-28 cs.CV cs.CL 版本更新

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

ViCA：仅视觉交叉注意力的高效多模态大语言模型

Wenjie Liu, Hao Wu, Xin Qiu, Xudong Wang, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院、东部技术研究院）； Munich Center for Machine Learning, LMU Munich（慕尼黑机器学习中心、慕尼黑大学）

AI总结提出ViCA架构，通过仅视觉交叉注意力减少视觉令牌计算，在保持98%准确率的同时将视觉计算降至4%，实现显著加速。

详情

AI中文摘要

现代多模态大语言模型（MLLMs）采用统一的自我注意设计，在每个Transformer层处理视觉和文本令牌，导致大量计算开销。在这项工作中，我们重新审视了这种密集视觉处理的必要性，并表明投影的视觉嵌入已经与语言空间良好对齐，而有效的视觉-语言交互仅发生在少数层中。基于这些见解，我们提出了ViCA（仅视觉交叉注意力），一种最小的MLLM架构，其中视觉令牌绕过所有自我注意和前馈层，仅通过稀疏的交叉注意力在选定层与文本交互。在三个MLLM骨干、九个多模态基准和26个基于剪枝的基线上的广泛评估表明，ViCA在将视觉侧计算减少到4%的同时保持了98%的基线准确率，始终实现了优越的性能-效率权衡。此外，ViCA提供了一个规则的、硬件友好的推理流水线，在单批推理中实现了超过3.5倍的加速，在多批推理中实现了超过10倍的加速，与仅文本的LLM相比，将视觉定位减少到接近零的开销。它还与令牌剪枝方法正交，可以无缝结合以进一步提高效率。我们的代码可在https://github.com/EIT-NLP/ViCA获取。

英文摘要

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

URL PDF HTML ☆

赞 0 踩 0

2602.05897 2026-05-28 cs.CL 版本更新

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

停止奖励幻觉步骤：面向小型推理模型的忠实感知步骤级强化学习

Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳研究院）； Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd（中国电信人工智能研究院）； College of Integrated Circuits, Zhejiang University, Hangzhou, Zhejiang, China（浙江大学集成电路学院）； Zhongguancun Academy, Beijing, China（中关村学院）

AI总结针对小型推理模型在中间推理步骤中容易产生忠实性幻觉的问题，提出忠实感知步骤级强化学习（FaithRL），通过过程奖励模型提供步骤级监督和隐式截断重采样策略，减少幻觉并提高推理可靠性。

详情

AI中文摘要

随着大型语言模型变得更小更高效，小型推理模型（SRM）在资源受限环境中实现思维链（CoT）推理至关重要。然而，它们容易产生忠实性幻觉，尤其是在中间推理步骤中。现有的基于在线强化学习的缓解方法依赖于结果奖励或粗粒度的CoT评估，这可能在最终答案正确时无意中强化不忠实的推理。为了解决这些局限性，我们提出了忠实感知步骤级强化学习（FaithRL），通过来自过程奖励模型的显式忠实奖励引入步骤级监督，以及一种隐式截断重采样策略，该策略从忠实前缀生成对比信号，同时减轻步骤级奖励的奖励黑客攻击。在多个SRM和开放书籍QA基准上的实验表明，FaithRL持续减少CoT和最终答案中的幻觉，从而实现更忠实和可靠的推理。代码可在 https://github.com/Easy195/FaithRL 获取。

英文摘要

As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes, while also mitigating reward hacking from step-level rewards. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.

URL PDF HTML ☆

赞 0 踩 0

2503.01829 2026-05-28 cs.CL cs.AI cs.LG cs.MA 版本更新

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

如果你能说服我：评估大型语言模型说服效果与易受影响性的框架

Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出PMIYC框架，通过多智能体对话自动评估LLM的说服效果与易受影响性，发现不同模型在说服力和抗说服性上存在显著差异。

Comments Paper published at the ACM Conference on AI and Agentic Systems 2026

详情

DOI: 10.1145/3786335.3813181

AI中文摘要

大型语言模型（LLM）展现出与人类水平相当的说服能力。虽然这些能力可用于社会公益，但也存在被滥用的风险。除了关注LLM如何说服他人外，它们自身对说服的易受影响性也构成了关键的校准挑战，引发了关于鲁棒性、安全性和伦理原则遵守的问题。为了研究这些动态，我们引入了“如果你能说服我”（PMIYC），一个用于评估多智能体交互中说服力和易受影响性的自动化框架。我们的框架提供了一种可扩展的替代方案，替代了通常用于研究LLM说服的昂贵且耗时的人工标注过程。PMIYC自动进行说服者和被说服者智能体之间的多轮对话，同时衡量说服的有效性和易受影响性。我们的综合评估涵盖了多种LLM和说服场景（例如，主观和错误信息场景）。我们通过人工评估验证了框架的有效性，并展示了与先前研究中人工评估的一致性。通过PMIYC，我们发现Llama-3.3-70B和GPT-4o表现出相似的说服效果，比Claude 3 Haiku高出30%。然而，GPT-4o在对抗错误信息方面的抵抗力比Llama-3.3-70B高出50%以上。值得注意的是，o4-mini既是有效的说服者，也是抵抗的被说服者。这些发现为LLM的说服动态提供了实证见解，并有助于开发更安全的AI系统。

英文摘要

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. Notably, o4-mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

URL PDF HTML ☆

赞 0 踩 0

2602.03491 2026-05-28 cs.CV cs.CL 版本更新

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

解耦骨架与血肉：基于解缠对齐和结构感知引导的高效多模态表格推理

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））； Peng Cheng Laboratory, Shenzhen, China（鹏城实验室）

AI总结提出DiSCo解缠结构-内容对齐框架和Table-GLS全局到局部结构引导推理框架，高效增强LVLM的表格理解与推理能力，无需昂贵监督或外部工具。

Comments Accepted as a Spotlight Paper at ICML 2026

详情

AI中文摘要

由于复杂的布局和紧密耦合的结构-内容信息，对表格图像进行推理对于大型视觉语言模型（LVLM）仍然具有挑战性。现有解决方案通常依赖于昂贵的监督训练、强化学习或外部工具，限制了效率和可扩展性。这项工作解决了一个关键问题：如何以最少的标注且无需外部工具来使LVLM适应表格推理？具体来说，我们首先引入了DiSCo，一种解缠结构-内容对齐框架，在多模态对齐期间明确分离结构抽象和语义基础，高效地将LVLM适应于表格结构。在DiSCo的基础上，我们进一步提出了Table-GLS，一种全局到局部结构引导推理框架，通过结构化探索和基于证据的推理来执行表格推理。跨多个基准的大量实验表明，我们的框架高效地增强了LVLM的表格理解和推理能力，特别是泛化到未见过的表格结构。我们的数据和代码可在https://github.com/AAAndy-Zhu/TableVLM获取。

英文摘要

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures. Our data and code are available at https://github.com/AAAndy-Zhu/TableVLM.

URL PDF HTML ☆

赞 0 踩 0

2602.02898 2026-05-28 cs.AI cs.CL 版本更新

Aligning Language Model Benchmarks with Pairwise Preferences

将语言模型基准与成对偏好对齐

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen

发表机构 * School of Data Science, University of Virginia（弗吉尼亚大学数据科学学院）； Imperial College London（伦敦帝国理工学院）； Thomson Reuters Foundational Research（汤姆森路透基础研究）； Department of Electrical Engineering and Computer Science, UC Berkeley and UCSF（伯克利大学电气工程与计算机科学系及旧金山大学）

AI总结提出BenchAlign方法，通过利用语言模型在问题级别的性能与模型成对排名，自动调整离线基准权重，使新基准能根据偏好准确排序未见模型。

详情

AI中文摘要

语言模型基准是广泛使用的、计算高效的现实性能代理。然而，许多近期工作发现基准常常无法预测实际效用。为弥合这一差距，我们引入基准对齐，即利用有限的模型性能信息自动更新离线基准，旨在生成新的静态基准，以预测给定测试设置中的模型成对偏好。然后我们提出BenchAlign，这是该问题的首个解决方案，它利用语言模型在问题级别的性能以及可能在部署期间收集的模型成对排名，学习基准问题的偏好对齐权重，生成新的基准，根据这些偏好对先前未见过的模型进行排序。我们的实验表明，我们的对齐基准能够根据人类偏好模型准确地对未见模型进行排序，即使模型大小不同，同时保持可解释性。总体而言，我们的工作为将基准与实际人类偏好对齐的局限性提供了见解，这有助于加速模型开发以追求实际效用。

英文摘要

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

URL PDF HTML ☆

赞 0 踩 0

2602.01807 2026-05-28 cs.CL cs.LG 版本更新

Sentence Curve Language Models

句子曲线语言模型

DongNyeong Heo, Taehwan Kim, Heeyoul Choi

发表机构 * Ulsan National Institute of Science and Technology（全南国立科学研究所）； Handong Global University（翰昂全球大学）

AI总结提出句子曲线表示，将扩散语言模型扩展为预测句子曲线而非静态词嵌入，以增强全局结构建模，并在IWSLT14和WMT14上取得最优性能。

详情

AI中文摘要

语言模型（LM）是现代AI系统的核心组成部分，扩散语言模型（DLM）最近已成为一种有竞争力的替代方案。这两种范式都依赖词嵌入来表示输入句子，以及骨干模型训练预测的目标句子。我们认为，这种目标词的静态嵌入对相邻词不敏感，鼓励局部准确的词预测，而全局句子结构则较少被强调。为了解决这个问题，我们提出了一种连续的句子表示，称为句子曲线，定义为一条样条曲线，其控制点影响句子中的多个词。基于这种表示，我们引入了句子曲线语言模型（SCLM），它将DLM扩展为预测句子曲线而非静态词嵌入。我们从理论上证明，句子曲线预测会引入正则化效应，促进全局结构建模，并刻画了不同句子曲线类型如何影响这种行为。实验上，SCLM在IWSLT14和WMT14上取得了DLM中的最优性能，训练稳定且无需繁重的知识蒸馏，并在LM1B上展现出与离散DLM相比有潜力的前景。

英文摘要

Language models (LMs) are a central component of modern AI systems, and diffusion language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while global sentence structure is less emphasized. To address this, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves state-of-the-art performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.

URL PDF HTML ☆

赞 0 踩 0

2602.01203 2026-05-28 cs.CL cs.LG 版本更新

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

注意力汇聚在注意力层中锻造原生MoE：针对头部坍塌的汇聚感知训练

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

发表机构 * Institute for Artificial Intelligence, Peking University, Beijing（人工智能研究院，北京大学，北京）； School of Integrated Circuits, Peking University, Beijing（集成电路学院，北京大学，北京）

AI总结本文通过理论和实证证明注意力汇聚自然构建了注意力层内的混合专家机制，并提出汇聚感知训练算法以缓解头部坍塌问题，提升模型性能。

Comments 2026 International Conference on Machine Learning (ICML)

详情

AI中文摘要

大型语言模型（LLMs）通常将不成比例的注意力分配给第一个标记，这种现象称为注意力汇聚。最近的几种方法旨在解决这个问题，包括GPT-OSS中的汇聚注意力和Qwen3-Next中的门控注意力。然而，缺乏对这些注意力机制之间关系的全面分析。在这项工作中，我们提供了理论和实证证据，表明普通注意力和汇聚注意力中的汇聚自然地在注意力层内构建了混合专家（MoE）机制。这一见解解释了先前工作中观察到的头部坍塌现象，即只有固定子集的注意力头对生成有贡献。为了缓解头部坍塌，我们提出了一种汇聚感知训练算法，该算法带有专为注意力层设计的辅助负载平衡损失。大量实验表明，我们的方法在普通注意力、汇聚注意力和门控注意力上实现了有效的头部负载平衡，并提高了模型性能。我们希望这项研究能为注意力机制提供新的视角，并鼓励进一步探索注意力层内固有的MoE结构。

英文摘要

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

URL PDF HTML ☆

赞 0 踩 0

2510.08525 2026-05-28 cs.CL 版本更新

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些注意力头对推理重要？RL引导的KV缓存压缩

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang

发表机构 * Westlake University（西华大学）； McGill University（麦吉尔大学）； Mila - Quebec AI Institute（魁北克AI研究院）； Zhejiang University（浙江大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德智能大学）

AI总结提出RLKV方法，利用强化学习识别对推理质量关键的注意力头，并对其保留完整KV缓存而对其他头进行激进压缩，实现20-60%缓存减少且性能近乎无损。

详情

AI中文摘要

推理型大语言模型通过扩展的思维链生成展现出复杂的推理行为，这些行为在解码过程中对信息损失高度敏感，给KV缓存压缩带来了关键挑战。现有的token丢弃方法通过移除中间步骤直接破坏推理链，而为检索任务设计的头重分配方法无法保留对生成推理至关重要的注意力头。然而，现有方法均无法识别哪些注意力头真正维持推理一致性并控制生成终止。为解决此问题，我们提出RLKV，它使用强化学习作为探针，通过直接优化注意力头缓存使用与实际生成结果的关系，发现哪些头对推理质量有贡献。这一发现自然引出了高效的压缩策略：我们对推理关键的头分配完整KV缓存，同时对其他头使用固定大小的KV缓存进行激进压缩。实验表明，少数头对推理至关重要，使得在多种任务和模型上实现20-60%的缓存减少且性能近乎无损，在60%压缩率下实现高达2.06倍的端到端加速。

英文摘要

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

URL PDF HTML ☆

赞 0 踩 0

2507.16679 2026-05-28 cs.CL cs.AI cs.CY 版本更新

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

发表机构 * Johns Hopkins University, Baltimore, MD, USA（约翰霍普金斯大学）； North Carolina State University, Raleigh, NC, USA（北卡罗来纳州立大学）； Microsoft Research Asia, Beijing, China（微软亚洲研究院）； Tongji University, Shanghai, China（同济大学）

AI总结针对情境对齐中价值冲突导致的指令瓶颈问题，提出PICACO方法，通过优化元指令并最大化指定价值与模型响应的总相关，无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情

AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力，有助于减少有害输出并适应多样化偏好，而无需昂贵的后训练，这被称为情境对齐。然而，大语言模型对输入提示的理解仍是不可知的，限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的，常常施加相互冲突的要求，例如刺激与传统。因此，当前的情境对齐方法面临指令瓶颈挑战，即大语言模型难以在单个提示中协调多个预期价值，导致对齐不完整或有偏。为了解决这个问题，我们提出了PICACO，一种新颖的多元情境对齐方法。无需微调，PICACO优化一个融合了多个价值的元指令，以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的，这从理论上强化了价值一致性并减少了干扰噪声，从而产生更有效的指令。在五个价值集上的大量实验表明，PICACO在黑盒和开源大语言模型上均表现良好，优于多个近期强基线，并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

URL PDF HTML ☆

赞 0 踩 0

2601.19926 2026-05-28 cs.CL cs.AI 版本更新

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法：语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

发表机构 * Universitat Pompeu Fabra（巴塞罗那庞培乌法布拉大学）； ICREA（加泰罗尼亚国家研究委员会）

AI总结通过对337篇文章的系统综述，评估基于Transformer的语言模型（TLM）的句法能力，发现TLM编码了非平凡的句法知识，但句法-语义接口现象表现较弱，且研究集中在英语和BERT类模型上。

详情

AI中文摘要

我们对337篇评估基于Transformer的语言模型（TLM）句法能力的文章进行了系统综述，报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明，TLM编码了非平凡的句法知识。行为证据显示，TLM在形式句法现象上表现强劲，但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言，表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而，由于大多数工作仍停留在观察层面，且当前方法在方法论上具有异质性，对句法处理背后的详细计算机制的洞察仍然有限。同时，文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义，并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

URL PDF HTML ☆

赞 0 踩 0

2601.08131 2026-05-28 cs.CL 版本更新

Attention Projection Mixing with Exogenous Anchors

基于外生锚点的注意力投影混合

Jonathan Su

发表机构 * Independent Researcher（独立研究者）

AI总结针对早期注意力投影跨层重用中内部锚点设计存在的结构冲突，提出ExoFormer模型，通过学习序列层外的外生锚点投影，并引入统一归一化混合框架，在减少令牌使用量的同时提升下游准确率。

详情

AI中文摘要

早期注意力投影的跨层重用可以改善优化和数据效率，但它造成了一个结构冲突：第一层必须同时作为所有更深层的稳定、可重用的锚点和有效的计算块。我们证明这种张力限制了内部锚点设计的性能。我们提出ExoFormer，通过在序列层堆栈之外学习外生锚点投影来解决这一冲突。我们引入了一个统一的归一化混合框架，该框架使用可学习的系数（探索系数粒度：元素级、头级和标量级）混合查询、键、值和门控对数，并表明归一化锚点源是稳定重用的关键。ExoFormer变体始终优于其内部锚点对应物，动态变体在匹配验证损失的情况下，使用比Gated Attention少1.5倍的令牌，获得1.5倍的下游准确率。我们通过卸载假说解释这种有效性：外部锚点保留必要的令牌身份，使层能够专门专注于特征变换。我们发布代码和模型以促进未来研究。

英文摘要

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2509.06350 2026-05-28 cs.CL cs.AI cs.CR 版本更新

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG：对抗性后缀中的所有标记对于越狱攻击都是必要的吗？

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

发表机构 * Politecnico di Milano（米兰理工学院）； Beihang University（北京航空航天大学）； East China Normal University（华东师范大学）； Fudan University（复旦大学）； University of the Chinese Academy of Sciences（中国科学院大学）； AI Security Lab（360人工智能安全实验室）

AI总结提出Mask-GCG方法，通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记，降低计算开销并保持攻击成功率，揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11462363
Journal ref: 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026

AI中文摘要

针对大型语言模型（LLM）的越狱攻击已展示了多种成功方法，攻击者操纵模型生成其本应避免的有害响应。其中，贪婪坐标梯度（GCG）作为一种通用且有效的方法，通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体，但它们都依赖于固定长度的后缀。然而，这些后缀中潜在的冗余尚未被探索。在这项工作中，我们提出Mask-GCG，一种即插即用的方法，采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率，同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余，还降低了梯度空间的大小，从而减少了计算开销，并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明，后缀中的大多数标记对攻击成功有显著贡献，剪枝少数低影响力标记不会影响损失值或攻击成功率（ASR），从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

URL PDF HTML ☆

赞 0 踩 0

2601.18116 2026-05-28 cs.CL 版本更新

论口语语言模型评估中全局令牌困惑度的谬误

Chan-Jan Hsu, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； National Taiwan University（国立台湾大学）； Toyota Technological Institute at Chicago（芝加哥丰田技术研究所）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对口语语言模型评估中直接使用文本困惑度公式计算语音令牌困惑度的问题，提出基于似然和生成的新型评估方法，更忠实反映生成质量，并缩小了最佳模型与人类基线之间的差距。

详情

AI中文摘要

在大规模原始音频上预训练的生成式口语语言模型能够以适当内容继续语音提示，同时保留说话人和情感等属性，作为口语对话的基础模型。在先前文献中，这些模型通常使用“全局令牌困惑度”进行评估，该指标直接将文本困惑度公式应用于语音令牌。然而，这种做法忽略了语音和文本模态之间的根本差异，可能导致对语音特性的低估。在这项工作中，我们提出了多种基于似然和生成的评估方法，以替代朴素的全局令牌困惑度。我们证明，所提出的评估更忠实地反映了感知生成质量，与人类评分的平均意见得分（MOS）具有更强的相关性。在新指标下评估时，口语语言模型的相对性能格局被重塑，揭示了最佳性能模型与人类基线之间的差距显著缩小。总之，这些结果表明，适当的评估对于准确评估口语语言建模的进展至关重要。

英文摘要

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

URL PDF HTML ☆

赞 0 踩 0

2601.03549 2026-05-28 cs.CV cs.CL 版本更新

FEA-SLT: A Gloss-Free End-to-End Framework for Facial-Expression-Aware Sign Language Translation

FEA-SLT：一种面向面部表情感知的手语翻译的无词汇端到端框架

Guobin Tu, Di Weng

发表机构 * School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结提出FEA-SLT框架，通过面部表情感知融合模块利用面部动态作为语义锚点，解决无词汇手语翻译中手势歧义问题，在PHOENIX14T和CSL-Daily数据集上达到最优BLEU性能。

详情

AI中文摘要

手语翻译（SLT）是一项具有挑战性的跨模态任务，需要对手部动作和非手动信号进行联合建模。现有的无词汇SLT方法有效捕捉手势动态，但常常未充分利用面部表情，而面部表情在语法和消除歧义中起着关键作用。当不同概念共享相似手部配置时，这一限制可能导致语义退化。为解决此问题，我们提出FEA-SLT（面部表情感知手语翻译），一种无词汇端到端框架，利用面部动态作为语义锚点来消除手部歧义。FEA-SLT采用领域迁移的面部编码器提取表情敏感表示，并通过语言约束的面部表情感知融合（FEAF）模块将其与手部特征集成。FEAF通过双向调制捕捉手部和面部通道之间的相互依赖关系，增强句法保真度。在PHOENIX14T和CSL-Daily上的实验表明，FEA-SLT在无词汇方法中实现了最先进的BLEU性能，而针对性分析证实了其对面部敏感语句翻译的改进。代码可在[https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT)获取。

英文摘要

Sign Language Translation (SLT) is a challenging cross-modal task requiring joint modeling of manual articulations and non-manual signals. Existing gloss-free SLT methods effectively capture gestural dynamics but often underutilize facial expressions, which play crucial grammatical and disambiguating roles. This limitation can cause semantic degradation when distinct concepts share similar manual configurations. To address this issue, we propose FEA-SLT (**F**acial-**E**xpression-**A**ware **S**ign **L**anguage **T**ranslation), a gloss-free end-to-end framework that uses facial dynamics as semantic anchors for resolving manual ambiguity. FEA-SLT employs a domain-transferred facial encoder to extract expression-sensitive representations and integrates them with manual features through a linguistically constrained *Facial-Expression-Aware Fusion* (FEAF) module. FEAF captures reciprocal dependencies between manual and facial channels via bidirectional modulation, enhancing syntactic fidelity. Experiments on PHOENIX14T and CSL-Daily show that FEA-SLT achieves state-of-the-art BLEU performance among gloss-free methods, while targeted analyses confirm improved translation of facial-sensitive utterances. Code is available at [https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT).

URL PDF HTML ☆

赞 0 踩 0

2512.23959 2026-05-28 cs.CL cs.AI cs.LG 版本更新

HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

HGMem：基于超图的工作记忆以改进长上下文复杂关系建模的多步RAG

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

发表机构 * The Chinese University of Hong Kong.（香港中文大学）； Pengcheng Laboratory.（鹏城实验室）； WeChat AI, Tencent（微信AI，腾讯）； University of Chinese Academy of Sciences.（中国科学院大学）

AI总结提出HGMem超图工作记忆系统，通过超边表示记忆单元并渐进形成高阶交互，增强多步RAG中的全局理解和复杂推理能力。

Comments ICML 2026; Code released at https://github.com/Encyclomen/HGMem

详情

AI中文摘要

多步检索增强生成（RAG）已成为增强大型语言模型（LLMs）在需要全局理解和密集推理任务上的广泛采用策略。尽管许多RAG系统整合了工作记忆来整合信息，但现有设计主要作为孤立事实的被动存储。这种静态特性忽略了原始事实之间的关键高阶相关性，从而限制了模型的多步推理能力，导致在扩展上下文中的碎片化推理和弱全局理解。我们引入了HGMem，一种基于超图的工作记忆系统，将记忆的概念从简单存储扩展到动态、表达性结构，用于复杂推理和全局理解。在我们的方法中，记忆被表示为超图，其中超边对应不同的记忆单元，使得记忆内高阶交互的逐步形成成为可能。该机制连接围绕焦点问题的事实和思考，将记忆演变为一个集成且情境化的知识结构，为更深层次的推理提供强有力的命题。我们在几个具有挑战性的全局理解基准上评估了HGMem。大量实验和深入分析表明，我们的方法持续改进了多步RAG，并在不同数据集上显著优于强基线系统。

英文摘要

Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high-order correlations among primitive facts, thereby limiting models' capacity for multi-step reasoning and resulting in fragmented reasoning and weak global sense-making within extended contexts. We introduce HGMem, a hypergraph-based working memory system, extending the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of high-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense-making benchmarks. Extensive experiments and in-depth analyses demonstrate that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse datasets.

URL PDF HTML ☆

赞 0 踩 0

2512.17375 2026-05-28 cs.LG cs.CL cs.CR 版本更新

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

AdvJudge-Zero：通过对抗控制令牌在LLM作为评判者中实现二元决策翻转

Tung-Ling Li, Yuhao Wu, Hongliang Liu

发表机构 * Palo Alto Networks（帕洛阿尔托网络公司）

AI总结本文提出AdvJudge-Zero方法，通过从评判模型自身分布中采样低困惑度令牌，无需梯度优化即可将LLM评判者的二元判决从“否”翻转为“是”，并基于发现的令牌池提出防御策略以增强评判鲁棒性。

详情

AI中文摘要

LLM作为评判者系统在现代RLHF和RLVR流程中提供奖励信号，但其二元判决简化为一个隐藏状态上的单一线性读出F_gap。我们证明该读出足够浅，以至于短且低困惑度的令牌可以将判决从“否”翻转为“是”。这些令牌是从评判者自身在响应位置的下一个令牌分布中采样的，无需手动设置种子或基于梯度的优化。我们的方法AdvJudge-Zero在六个Qwen、Llama和Gemma评判者上的24个（模型，数据集）单元中，有22个实现了>90%的集成假阳性率，而先前策划的10令牌基准为54-72%，并且发现的表面跨格式转移到70B标量奖励模型。相同的发现池使得防御成为可能：基于9类机制分类法分层的LoRA微调，在相同池上的朴素采样失败的跨族泛化中增强了鲁棒性，其中机制广度而非池大小带来了增益。在GRPO训练下，硬化后的评判者消除了未硬化基线在MATH和GSM8K上每个条件十个种子时观察到的奖励崩溃失败（假阳性峰值和长度崩溃）。发现的池、机制分类法和每个提示的翻转记录将在负责任的披露下发布。

英文摘要

LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from "No" to "Yes". These tokens are sampled from the judge's own next-token distribution at the response position, with no manual seed set and no gradient-based optimization. Our procedure, AdvJudge-Zero, reaches $>$90% ensemble false-positive rate on 22 of 24 (model, dataset) cells across six Qwen, Llama, and Gemma judges, versus 54-72% for the prior curated 10-token benchmark, and the discovered surface transfers cross-format to a 70B scalar reward model. The same discovered pool enables a defense: a LoRA fine-tune stratified by a 9-class mechanism taxonomy hardens cross-family generalization where naive sampling on the same pool fails, with mechanism breadth rather than pool size carrying the gain. Under GRPO training, the hardened judge eliminates the reward-collapse failures (false-positive spikes and length collapse) we observe in the unhardened baseline on both MATH and GSM8K at ten seeds per condition. The discovered pool, the mechanism taxonomy, and per-prompt flip records will be released under responsible disclosure.

URL PDF HTML ☆

赞 0 踩 0

2512.01970 2026-05-28 cs.AI cs.CL 版本更新

Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies

原子技能是前提：当强化学习合成组合推理时，以及当它仅放大时

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong

发表机构 * University of Waterloo（滑铁卢大学）； Duke University（杜克大学）； National University of Singapore（新加坡国立大学）； Princeton University（普林斯顿大学）； Peking University（北京大学）； University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结通过互补推理任务，研究强化学习是合成新技能还是仅放大已有技能，发现强化学习在基础模型通过监督微调掌握独立原子技能后才能合成新组合策略。

Comments Work in Progress. Code and data are available at https://github.com/sitaocheng/from_atomic_to_composite

详情

AI中文摘要

强化学习（RL）仅仅是放大现有技能，还是合成新技能？我们通过互补推理的视角研究这个问题：互补推理是整合内部知识与外部上下文的关键实践能力，是可靠的持续学习和检索增强生成的前提。为了避免预训练污染，我们构建了一个受控的语义合成传记数据集，并将这种能力分解为两个原子技能：参数推理（检索模型权重中编码的事实）和上下文推理（处理新的上下文信息）。我们有两个发现。首先，直接在复合任务上监督训练的模型在已知事实和推理路径上达到高准确率（90%），但在新事实和推理路径上崩溃（18%），表明监督微调（SFT）依赖于死记硬背而非真正的技能整合。其次，RL弥合了这一泛化差距，充当技能合成器而非仅仅是放大器——但只有在严格的前提条件下：只有当基础模型首先通过SFT掌握了独立的原子技能时，它才能合成新的组合策略。这些结果表明，解耦的原子训练后接RL为复杂的新推理提供了一条可扩展的路径。

英文摘要

Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens of Complementary Reasoning: the critical practical capability of integrating internal knowledge with external context, a prerequisite for reliable Continual Learning and Retrieval-Augmented Generation. To avoid pre-training contamination, we construct a controlled semanticsynthetic dataset of biographies and decompose this capability into two atomic skills: Parametric Reasoning (retrieving facts encoded in model weights) and Contextual Reasoning (processing novel in-context information). We present two findings. First, models supervised directly on the composite task reach high accuracy on seen facts and reasoning paths (90%) but collapse on novel facts and reasoning paths (18%), indicating that Supervised Fine-Tuning (SFT) relies on rote memorization rather than genuine skill integration. Second, RL bridges this generalization gap, acting as a skill synthesizer rather than a mere amplifier--but only under a strict prerequisite: it synthesizes new composite strategies only when the base model has first mastered the independent atomic skills via SFT. These results suggest that decoupled atomic training followed by RL offers a scalable path to complex novel reasoning.

URL PDF HTML ☆

赞 0 踩 0

2511.05550 2026-05-28 cs.SD cs.CL cs.LG 版本更新

Assessing Factual Music Comprehension in Large Audio Language Models

评估大型音频语言模型中的事实音乐理解能力

Daniel Chenyu Lin, Michael Freeman, John Thickstun

AI总结针对现有MusicQA数据集无法衡量模型回答事实正确性的问题，提出基于可验证信息的评估协议，通过精确率、召回率和F1分数客观评估模型，并在三个数据集上定义六项事实检索任务，对九个最新LALM进行基准测试。

Comments 16 pages; second submission

详情

AI中文摘要

大型音频语言模型（LALMs）利用多模态表示生成对音频自然语言查询的开放式回答。本文（1）提供经验证据表明，使用流行的MusicQA数据集评估LALMs无法衡量模型关于音乐的回答是否事实正确，（2）开发了一种新的评估LALMs音乐理解能力的协议。具体来说，我们提出一个评估协议，提示LALM提供可事实验证的信息，并将其开放式回答解析为结构化格式，使用精确率、召回率和F1分数进行客观评估。利用该协议，我们定义了一个基准测试，包含在三个不同数据集（MusicNet、Free Music Archive和OverClocked ReMix）上定义的六项事实信息检索任务。我们对九个最近的LALMs进行了基准测试，包括前沿模型如Gemini和最新的开放模型如Music Flamingo，并在https://github.com/DCL2004/LALM-Eval发布了评估脚本套件，以方便新LALMs的基准测试。

英文摘要

Large audio language models (LALMs) leverage multimodal representations to generate open-ended answers to natural language queries about audio. In this paper, we (1) provide empirical evidence that assessment of LALMs using the popular MusicQA dataset fails to measure whether a model's responses about music are factually correct, and (2) develop a new protocol for assessing the music comprehension capabilities of LALMs. Specifically, we propose an evaluation protocol that prompts a LALM for factually verifiable information, and parses its open-ended response into a structured format that can be objectively assessed using Precision, Recall, and F1 scores. Using this protocol, we define a benchmark consisting of six factual information retrieval tasks defined on three diverse datasets: MusicNet, the Free Music Archive, and OverClocked ReMix. We benchmark nine recent LALMs, including frontier models like Gemini and the latest open models like Music Flamingo, and release the suite of evaluation scripts at https://github.com/DCL2004/LALM-Eval to facilitate benchmarking of new LALMs.

URL PDF HTML ☆

赞 0 踩 0

2510.17620 2026-05-28 cs.CL 版本更新

Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models

忘记知识，记住使用：面向大型语言模型的上下文感知遗忘

Yuefeng Peng, Parnian Afshar, Megan Ganji, Thomas Butler, Amir Houmansadr, Mingxian Wang, Dezhi Hong

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Amazon（亚马逊）

AI总结针对现有遗忘方法损害上下文可用性的问题，提出一种插件式目标项，在保持遗忘效果和保留集性能的同时恢复模型对已遗忘知识的上下文使用能力。

Comments ICML 2026

详情

AI中文摘要

大型语言模型可能编码需要移除的敏感信息或过时知识，以确保模型响应负责任且合规。遗忘学习已成为完整重新训练的高效替代方案，旨在移除特定知识同时保持模型整体效用。现有遗忘方法评估关注（1）目标知识的遗忘程度（遗忘集）和（2）保留集上的性能（即效用）。然而，这些评估忽略了一个重要的可用性方面：如果提示中重新引入已移除信息，用户可能仍希望模型利用该信息。在对六种最先进遗忘方法的系统评估中，我们发现它们一致损害了这种上下文效用。为解决此问题，我们用一个插件项增强遗忘目标，该插件项保留模型在上下文中存在已遗忘知识时使用它的能力。大量实验表明，我们的方法将上下文效用恢复到接近原始水平，同时仍然保持有效的遗忘和保留集效用。

英文摘要

Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.

URL PDF HTML ☆

赞 0 踩 0

2510.11170 2026-05-28 cs.LG cs.AI cs.CL 版本更新

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

EAGer: 基于熵感知的自适应推理时缩放生成方法

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

发表机构 * University of Groningen（格罗宁根大学）； University of Milan - Bicocca（米兰-比科卡大学）； Cohere Labs（Cohere实验室）

AI总结提出一种无需训练的生成方法EAGer，利用逐词熵分布动态分配计算资源，在复杂推理任务中提升性能并减少冗余计算。

详情

AI中文摘要

随着推理语言模型和测试时缩放方法作为提升模型性能范式的兴起，通常需要大量计算来从同一提示生成多个候选序列。这允许探索通向正确答案的不同推理路径，然而，为每个提示分配相同的计算预算。基于不同提示具有不同复杂度因而需要不同计算量的假设，我们提出EAGer，一种无需训练的生成方法，通过逐词熵分布利用模型不确定性来减少冗余计算并同时提升整体性能。EAGer仅在存在高熵词时分支到多个推理路径，并将节省的计算预算重新分配到最需要探索替代路径的实例上。我们在复杂推理基准上对多个开源模型验证了EAGer，特别是在AIME 2025上展示了增益。当目标标签可访问时（如在RLVR训练流程中），EAGer在Pass@k上提升高达37%，且token减少59%；在测试时设置中，与全并行采样相比，仍能在Pass@k上提升12%，且token减少64%。

英文摘要

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and reallocates the saved compute budget to instances where exploration of alternative paths is most needed. We validate EAGer across multiple open-source models on complex reasoning benchmarks, with gains specifically demonstrated on AIME 2025. When target labels are accessible -- as in RLVR training pipelines -- EAGer achieves up to +37% in Pass@k and 59% fewer tokens; in test-time settings it still yields +12% in Pass@k and 64% fewer tokens compared to Full Parallel Sampling.

URL PDF HTML ☆

赞 0 踩 0

2510.10185 2026-05-28 cs.CL cs.AI cs.MA 版本更新

Auditing medical multi-agent AI reveals risks of false consensus

审计医疗多智能体AI揭示虚假共识风险

Yinghao Zhu, Lei Gu, Zixiang Wang, Haoran Sang, Dehao Sui, Wen Tang, Lan Mi, Yasha Wang, Junyi Gao, Liang Yao, Tianfan Fu, Ewen Harrison, Lequan Yu, Liantao Ma

发表机构 * National Engineering Research Center for Software Engineering, Peking University（北京大学软件工程国家工程研究中心）； School of Computing and Data Science, The University of Hong Kong（香港大学计算机与数据科学学院）； Department of Nephrology, Peking University Third Hospital（北京大学第三医院肾内科）； Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Department of Lymphoma, Peking University Cancer Hospital & Institute（教育部癌症发生与转化研究重点实验室、北京大学肿瘤医院淋巴瘤科）； Department of Automation, Tsinghua University（清华大学自动化系）； Centre for Medical Informatics, The University of Edinburgh（爱丁堡大学医学信息学中心）； Health Data Research UK（英国健康数据研究机构）； Lee Kong Chian School of Medicine, Nanyang Technological University（南洋理工大学李科贤医学院）； State Key Laboratory for Novel Software Technology, School of Computer Science, Nanjing University（南京大学新型软件技术国家重点实验室、计算机科学学院）

AI总结本研究提出MedAgentAudit框架，通过专家验证的审计流程诊断医疗多智能体系统中的协作失败模式，发现虚假共识、权威偏差等系统性风险。

Comments Code and Data: https://github.com/MedX-PKU/MedAgentAudit

详情

AI中文摘要

大型语言模型正越来越多地被组装成医疗多智能体系统，通过专家角色、同行评审和共识形成模拟多学科会诊。然而，在临床决策支持中，表面共识并不足够。临床医生还需要知道智能体是否检查了证据、处理了分歧并保持了不确定性可见。当前评估主要关注最终准确性，未测试协作过程的安全性。本文介绍MedAgentAudit，一个基于临床的工作流审计框架，用于诊断和量化医疗多智能体系统中的协作失败模式。从3,600个执行日志中，我们推导出一个经专家验证的十种常见失败分类法，涵盖任务理解、协作讨论以及综合与决策。随后，我们部署一个经专家验证的自动审计器作为非干预探针，覆盖14,400个案例，涉及六种多智能体架构、六个医疗文本和视觉数据集以及每种模态的四个大语言模型设置。跨系统而言，协作带来不均衡的准确性提升和频繁的过程失败。16.63%的案例中存在无依据的观察结果，并向下游传播。在讨论中，智能体在98.42%的案例中重复初始观点而非重新审视证据，并在42.73%的案例中未能激活专家推理。在综合阶段，最终答案常常用权威或多数票替代证据检查，显示出权威偏差（28.76%，从35.30%上升至68.75%）、自我矛盾（18.53%）、矛盾忽视（5.48%）和少数派压制（5.11%）。MedAgentAudit将医疗AI评估从输出评分重新定义为过程级安全与问责，为医学中透明、可审计且由临床医生监督的智能体系统提供了实践基础。

英文摘要

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making. We then deploy an expert-validated automated auditor as non-interventional probes across 14,400 cases, covering six multi-agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re-examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self-contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process-level safety and accountability, providing a practical foundation for transparent, auditable and clinician-supervised agentic systems in medicine.

URL PDF HTML ☆

赞 0 踩 0

2510.06974 2026-05-28 cs.CL 版本更新

Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups

探究中文大语言模型中的社会身份偏见：基于性别代词与社会群体

Geng Liu, Feng Li, Junjie Mu, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano（电子、信息与生物工程系，米兰理工大学）； University of Science and Technology of China（中国科学技术大学）

AI总结通过设计考虑中文语言特性的提示，评估十种代表性中文大语言模型在240个社会群体上的内群体与外群体框架下的情感和毒性偏见，发现系统性不对称且指令调优减少情感偏见但毒性差距更持久，女性标记代词与更高毒性相关。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在面向用户的应用程序中，引发了对它们可能反映和放大社会偏见的担忧。我们使用针对中文的提示，在十种代表性模型中研究了中文LLMs中的社会身份偏见。我们的评估比较了240个在中国语境中显著的社会群体的内群体（“我们”）和外群体（“他们”）框架，使用了一个双层测量框架来评估情感和毒性。提示设计明确考虑了中文的语言特性，包括默认性别中立复数代词与其明确女性对应代词之间的区别，从而能够对社会身份框架效应进行受控比较。跨模型观察，我们发现了系统性的内群体-外群体不对称性，尽管其表达在不同测量维度上有所不同。特别是，指令调优通常减少情感不对称性，而毒性差距仍然更为持久。此外，在多个模型中，女性标记的复数代词比默认性别中立复数代词与更高的毒性相关。我们的研究引入了一个针对中文LLMs的语言感知评估框架，并表明（i）先前在英语中记录的社会身份偏见在中文中也有所体现，以及（ii）中文特有的语言结构可以揭示在仅英语环境中无法直接观察到的偏见模式。

英文摘要

Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns that they may reflect and amplify social biases. We investigate social identity biases in Chinese LLMs using Mandarin-specific prompts across ten representative models. Our evaluation compares ingroup ("We") and outgroup ("They") framings across 240 social groups salient in the Chinese context, using a two-tiered measurement framework that assesses both sentiment and toxicity. The prompt design explicitly accounts for linguistic properties of Mandarin, including the distinction between the default gender-neutral plural pronoun and its explicitly feminine counterpart, enabling a controlled comparison of social identity framing effects. Across models, we observe systematic ingroup-outgroup asymmetries, although their expression differs across measurement dimensions. In particular, instruction tuning often reduces sentiment asymmetries, while toxicity gaps remain more persistent. Moreover, the feminine-marked plural pronoun is associated with higher toxicity than the default gender-neutral plural in several models. Our study introduces a language-aware evaluation framework for Chinese LLMs and shows that (i) social identity biases previously documented in English also manifest in Chinese and that (ii) Mandarin-specific linguistic structure can reveal bias patterns that are not directly observable in English-only settings.

URL PDF HTML ☆

赞 0 踩 0

2510.05291 2026-05-28 cs.CL 版本更新

Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Camellia: 亚洲语言中LLMs文化偏见的基准测试

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, Tanmoy Chakraborty, Yuki Arase, Keisuke Sakaguchi, JinYeong Bak, Wei Xu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Samsung R&D Institute Philippines（三星菲律宾研发院）； Sungkyunkwan University（成均馆大学）； Tohoku University（东北大学）； National University of Singapore（新加坡国立大学）； University of Copenhagen（哥本哈根大学）； University of Michigan（密歇根大学）； Indian Institute of Technology Delhi（印度理工学院德里）； Institute of Science Tokyo（东京科学研究院）

AI总结提出Camellia基准，通过三个任务评估九种亚洲语言中多语言大模型对亚洲与西方文化实体的偏见，发现模型存在文化适应困难、情感关联差异及实体提取性能差距。

详情

AI中文摘要

随着大语言模型（LLMs）多语言能力的增强，它们对文化多样性实体的敏感性变得越来越重要。Naous等人（2024）的前期工作表明，LLMs在阿拉伯语中往往偏好与西方相关的实体。由于缺乏以实体为中心的多语言基准，这种偏见是否也存在于各种非西方语言中尚不清楚。在本文中，我们介绍了Camellia，这是一个用于评估九种亚洲语言（涵盖六种亚洲文化）中实体中心文化偏见的基准。Camellia包括19,530个手动注释的实体，这些实体与所涵盖的亚洲或西方文化相关，以及从社交媒体帖子中提取的2,173个这些实体的掩码上下文。利用Camellia，我们在三个任务中评估了四个最近的多语言LLMs的文化偏见：文化上下文适应、情感关联和实体抽取式问答。我们的分析表明，LLMs在这些语言中难以进行文化适应，不同地区开发的模型表现存在差异。我们进一步观察到，不同的LLM家族可能持有不同的偏见，这反映在它们将文化与特定情感联系起来的方式上。最后，我们发现LLMs在某些亚洲语言中可能难以理解上下文，从而在实体抽取中造成文化之间的性能差距。

英文摘要

As Large Language Models (LLMs) develop stronger multilingual capabilities, their sensitivity to culturally diverse entities becomes increasingly important. Prior work by Naous et al. (2024) has shown that LLMs often favor Western-associated entities in Arabic. Due to the lack of entity-centric multilingual benchmarks, it remains unclear if such biases also manifest in various non-Western languages. In this paper, we introduce Camellia, a benchmark for evaluating entity-centric cultural biases in nine Asian languages, spanning six Asian cultures. Camellia includes 19,530 manually annotated entities associated with the covered Asian or Western cultures, as well as 2,173 masked contexts for these entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLMs across three tasks: cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show that LLMs struggle with cultural adaptation across these languages, with performance differing across models developed in different regions. We further observe that different LLM families can hold distinct biases, reflected in the ways they link cultures to particular sentiments. Lastly, we find that LLMs can struggle with context understanding in some Asian languages, creating performance gaps between cultures in entity extraction.

URL PDF HTML ☆

赞 0 踩 0

2510.02329 2026-05-28 cs.CL cs.AI 版本更新

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

SelfJudge: 通过自监督验证器加速推测解码

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

发表机构 * Efficient AI ； Large Language Model（大型语言模型）； Speculative Decoding（推测解码）

AI总结提出SelfJudge方法，利用目标模型的自监督训练验证器，通过评估令牌替换后响应的语义保持性来加速推测解码，实现更优的推理-准确率权衡。

详情

Journal ref: ICML 2026

AI中文摘要

推测解码通过验证来自草稿模型的候选令牌与较大目标模型的匹配来加速LLM推理。最近的验证解码通过放宽验证标准，接受可能与目标模型输出存在微小差异的草稿令牌来加速这一过程，但现有方法受限于依赖人工标注或具有可验证真实结果的任务，限制了其在多样化NLP任务中的泛化能力。我们提出SelfJudge，通过目标模型的自监督训练验证器。我们的方法通过评估令牌替换后的响应是否保持原始响应的意义来衡量语义保持性，从而实现在多样化NLP任务中的自动验证器训练。实验表明，SelfJudge在推理-准确率权衡上优于验证解码基线，为更快的LLM推理提供了广泛适用的解决方案。

英文摘要

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

URL PDF HTML ☆

赞 0 踩 0

2506.08846 2026-05-28 cs.CY cs.CL cs.SD eess.AS 版本更新

Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

自动语音识别技术审计实践中的陷阱：以失语症患者为例

Katelyn Xiaoying Mei, Anna Seo Gyeong Choi, Hilke Schellmann, Mona Sloane, Allison Koenecke

发表机构 * University of Washington（华盛顿大学）； Cornell University（康奈尔大学）； New York University（纽约大学）； University of Virginia（弗吉尼亚大学）

AI总结本文识别了标准ASR审计中的三个常见陷阱，并提出了一个整体审计框架，通过失语症患者的案例研究发现ASR系统对其表现更差。

Comments Published at the Proceedings of The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情

DOI: 10.1145/3805689.3812320

AI中文摘要

自动语音识别（ASR）系统的日益普及需要稳健的审计方法，以确保转录质量的公平性，特别是对于像失语症这样的言语障碍患者，他们不成比例地依赖ASR。虽然学术和行业审计揭示了不同用户群体之间的性能差异，但标准审计实践常常忽视可能掩盖对边缘群体伤害的细微差别。我们识别了标准ASR审计中的三个常见陷阱：（1）坚持单一的文本标准化方法，这可能掩盖ASR性能的差异并忽视边缘社区的标准化偏好；（2）展示高层次的人口统计发现，而不考虑按细微交叉亚组划分的性能差异，或依赖于相关的声学特性；（3）仅报告一个黄金标准指标（词错误率），这不足以量化常见的生成式AI错误，如幻觉。我们提出了一个解决这些陷阱的整体审计框架，并在对六个流行ASR系统的案例研究中发现，与对照组相比，失语症患者的ASR性能持续更差。我们呼吁从业者实施这些更适合快速变化的ASR环境的稳健、社区驱动的ASR审计实践。

英文摘要

Automatic Speech Recognition (ASR) systems' growing use warrants robust auditing approaches to ensure equitable transcription quality, especially for people with speech disorders like aphasia who disproportionately depend on ASR. While academic and industry audits have revealed performance disparities across user populations, standard auditing practices often overlook nuances that risk masking harm to marginalized groups. We identify three common pitfalls in standard ASR audits: (1) adhering to one method of text standardization, which can mask variance in ASR performance and ignore the standardization preferences of marginalized communities; (2) displaying high-level demographic findings without considering performance disparities by nuanced intersectional subgroups, or conditioning on relevant acoustic properties; and (3) reporting only one gold-standard metric (Word Error Rate), which inadequately quantifies common generative AI errors like hallucinations. We propose a holistic auditing framework addressing these pitfalls, and in a case study of six popular ASR systems, find consistently worse ASR performance for speakers with aphasia relative to a control group. We call on practitioners to implement these robust, community-driven ASR auditing practices better suited for the rapidly changing ASR landscape.

URL PDF HTML ☆

赞 0 踩 0

2507.08014 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

大规模真实对话分析揭示LLM越狱的复杂性界限

Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

发表机构 * Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.（瓦伦西亚人工智能研究 institute，瓦伦西亚理工大学，西班牙瓦伦西亚）； Department of Computer Science, The University of Chicago, Chicago, USA（计算机科学系，芝加哥大学，美国芝加哥）； Center for Automation and Robotics, Spanish National Research Council, Madrid, Spain（自动化与机器人中心，西班牙国家研究委员会，西班牙马德里）

AI总结通过分析超过200万条真实对话，发现越狱尝试的复杂性并不显著高于正常对话，且攻击复杂性随时间保持稳定，表明LLM安全演化受人类创造力限制。

Comments Code: https://github.com/ACMCMC/risky-conversations Results: https://huggingface.co/risky-conversations Visualizer: https://huggingface.co/spaces/risky-conversations/Visualizer

详情

DOI: 10.1007/978-3-032-11402-0_5

AI中文摘要

随着大型语言模型（LLM）的日益部署，理解越狱策略的复杂性和演变对于AI安全至关重要。我们对来自不同平台（包括专门的越狱社区和通用聊天机器人）的超过200万条真实对话进行了大规模实证分析，研究了越狱复杂性。使用一系列复杂性指标，涵盖概率度量、词汇多样性、压缩比和认知负荷指标，我们发现越狱尝试并未表现出显著高于正常对话的复杂性。这一模式在专门的越狱社区和普通用户群体中一致成立，表明攻击的复杂性存在实际界限。时间分析显示，虽然用户攻击的毒性和复杂性随时间保持稳定，但助手响应的毒性有所下降，表明安全机制正在改进。复杂性分布中缺乏幂律标度进一步指出了越狱发展的自然限制。我们的发现挑战了攻击者与防御者之间军备竞赛不断升级的主流说法，反而表明LLM安全演化受人类创造力限制，而防御措施持续进步。我们的结果突显了学术越狱披露中的关键信息危害，因为超出当前复杂性基线的复杂攻击可能破坏观察到的平衡，并在防御适应之前造成广泛伤害。

英文摘要

As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

URL PDF HTML ☆

赞 0 踩 0

2507.06999 2026-05-28 cs.CV cs.CL cs.LG 版本更新

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

有意学习，直觉行动：解锁多模态大语言模型的测试时推理能力

Yahan Yu, Yuyang Dong, Masafumi Oyamada

发表机构 * Kyoto University（京都大学）； Initial S ； NEC Corporation, Japan（日本NEC公司）

AI总结提出D2I框架，通过训练时使用基于规则的格式奖励进行有意推理以增强模态对齐，推理时移除显式策略转为直觉推理，从而提升多模态大语言模型的推理能力，无需额外标注或复杂奖励。

Comments 22 pages, 24 figures

详情

AI中文摘要

推理对于大型语言模型（LLMs）至关重要，尤其是在数学问题求解等复杂任务中。然而，多模态推理在模态对齐和训练可扩展性方面仍面临挑战，因为许多现有方法依赖于额外的标注或复杂的基于规则的奖励。为了解决这些问题，我们提出了“有意到直觉”推理框架（D2I），该框架无需额外标注或复杂奖励即可提升多模态大语言模型（MLLMs）的理解和推理能力。在训练过程中，D2I使用仅由基于规则的格式奖励监督的有意推理策略来增强模态对齐。在推理过程中，它通过移除这些显式策略转向直觉推理，使模型能够在其响应中隐式应用所获得的能力。D2I在域内和域外基准测试中均优于基线，突显了格式奖励在培养可迁移多模态推理技能方面的有效性，并表明将训练时的推理深度与测试时的响应灵活性解耦是有益的。

英文摘要

Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability, as many existing methods rely on additional annotations or complex rule-based rewards. To address these issues, we propose the Deliberate-to-Intuitive reasoning framework (D2I), which improves the understanding and reasoning abilities of multimodal LLMs (MLLMs) without extra annotations or complex rewards. During training, D2I uses deliberate reasoning strategies supervised only by rule-based format rewards to enhance modality alignment. During inference, it shifts to intuitive reasoning by removing these explicit strategies, allowing the model to implicitly apply the acquired abilities in its responses. D2I outperforms baselines on both in-domain and out-of-domain benchmarks, highlighting the effectiveness of format rewards in fostering transferable multimodal reasoning skills and suggesting the benefit of decoupling training-time reasoning depth from test-time response flexibility.

URL PDF HTML ☆

赞 0 踩 0

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控：增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； ICISEE, Shanghai Jiao Tong University（上海交通大学ICISEE）； School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University（上海交通大学数学科学学院）； King Abdullah University of Science and Technology（卡塔尔国王 Abdullah 科学与技术大学）

AI总结提出TELLME方法，通过改进大型语言模型的内部表征透明度，帮助监控者识别不当和敏感行为，并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情

AI中文摘要

大型语言模型（LLMs）的能力日益增强，但其思维和决策过程的机制仍不清楚。思维链（CoTs）常被用来外化LLMs的思维，但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角，以改善对其潜在思维的可监控性。然而，以往的方法仅尝试开发外部模块，而非使LLMs本身更易于监控。本文提出了一种新方法TELLME，提高了LLMs的透明度，并帮助监控者识别不合适和敏感的行为。此外，我们在去毒化任务上展示了TELLME的有效性，LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

URL PDF HTML ☆

赞 0 踩 0

2503.18893 2026-05-28 cs.CL cs.LG 版本更新

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction

xKV：通过对齐奇异向量提取的跨层KV缓存压缩

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Hung-Yueh Chiang, Yash Akhauri, Xilai Dai, Huiqiang Jiang, Yucheng Li, Luis Ceze, Kai-Chiang Wu, Mohamed S. Abdelfattah

发表机构 * Cornell University（康奈尔大学）； University of Washington（华盛顿大学）； Department of Computer Science, National Yang Ming Chiao Tung University（国立阳明交通大学计算机科学系）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of Surrey（塞夫顿大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出xKV，一种通过跨层共享低秩子空间压缩KV缓存的后训练方法，实现高达8倍压缩且保持长上下文任务精度，并引入选择性重建实现端到端加速。

Comments ICML 2026

详情

AI中文摘要

长上下文大型语言模型（LLMs）支持强大的应用，但由于键值状态（KV-Cache）导致高内存成本。最近的研究尝试跨层共享KV-Cache，但这些方法要么需要昂贵的预训练，要么依赖于实践中通常有限的逐token跨层余弦相似度。我们通过中心核对齐（CKA）表明，KV-Cache的主要奇异向量在层间对齐良好。受此观察启发，我们提出xKV，一种后训练压缩方法，将分组层的KV-Cache联合分解为共享的低秩子空间，大幅减少KV-Cache内存。在广泛使用的LLMs上，xKV实现了高达8倍的KV-Cache压缩，同时在长上下文任务和多轮设置中保持准确性。为进一步提高效率，我们在解码时引入选择性重建（SR）。结合SR，xKV相比全注意力基线实现了高达4.23倍的端到端加速，并在相似精度水平下以30%更高的吞吐量超越了显著基线。总体而言，xKV提供了一种即插即用的方法，用于减少长上下文LLM推理的内存和延迟。我们的代码公开于：https://github.com/abdelfattah-lab/xKV。

英文摘要

Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression method that jointly factorizes grouped-layer KV-Cache into a shared low-rank subspace, substantially reducing KV-Cache memory. Across widely used LLMs, xKV achieves up to 8x KV-Cache compression while preserving accuracy on long-context tasks and in multi-turn settings. To further improve efficiency, we introduce Selective Reconstruction (SR) at decode time. Combined with SR, xKV achieves up to 4.23x end-to-end speedup over the full attention baseline, and surpasses notable baselines with 30% higher throughput under a similar accuracy level. Overall, xKV provides a plug-and-play approach to reduce both memory and latency for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

URL PDF HTML ☆

赞 0 踩 0

2407.21075 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Apple Intelligence Foundation Language Models

Apple Intelligence 基础语言模型

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Daniel Parilla, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren

发表机构 * Apple（苹果公司）

AI总结本文介绍了为 Apple Intelligence 功能开发的基础语言模型，包括一个约30亿参数的设备端高效运行模型和一个用于私有云计算的服务器端大模型，并描述了其架构、训练数据、优化过程和评估结果。

2308.04823 2026-05-28 cs.CL 版本更新

Evaluating the Generation Capabilities of Large Chinese Language Models

评估大型中文语言模型的生成能力

Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang

发表机构 * AI Research Center, Besteasy Language Technology Co., Ltd（最佳语言技术有限公司人工智能研究中心）； LanguageX AI Lab（LanguageX人工智能实验室）

AI总结提出CG-Eval自动评估框架和Gscore复合指标，用于多学科领域评估大型中文语言模型的生成能力。

详情

AI中文摘要

本文揭示了CG-Eval，这是首个专为评估大型中文语言模型在多个学科领域中的生成能力而设计的全面自动化评估框架。CG-Eval以其自动化流程脱颖而出，该流程基于模型在六个关键领域（科学与工程、人文与社会科学、数学计算、医师资格考试、司法考试和注册会计师考试）中生成精确且上下文相关回答的能力进行关键评估。同时，我们引入了Gscore，这是一种创新的复合指标，由多个指标的加权和开发而成。Gscore独特地自动测量模型文本生成相对于参考标准的质量，提供对模型性能的详细而细致的评估。这种自动化不仅提高了评估过程的效率和可扩展性，还确保了跨不同模型的客观和一致评估。详细的测试数据和结果，展示了所评估模型的强大能力和比较性能，可在http://cgeval.besteasy.com/获取。

英文摘要

This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines. CG-Eval stands out for its automated process, which critically assesses models based on their proficiency in generating precise and contextually relevant responses to a diverse array of questions within six key domains: Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. Alongside this, we introduce Gscore, an innovative composite index developed from a weighted sum of multiple metrics. Gscore uniquely automates the quality measurement of a model's text generation against reference standards, providing a detailed and nuanced assessment of model performance. This automation not only enhances the efficiency and scalability of the evaluation process but also ensures objective and consistent assessment across various models. The detailed test data and results, highlighting the robust capabilities and comparative performance of the evaluated models, are accessible at http://cgeval.besteasy.com/.

URL PDF HTML ☆

赞 0 踩 0

2304.12986 2026-05-28 cs.CL cs.AI 版本更新

Measuring Massive Multitask Chinese Understanding

测量大规模多任务中文理解

Hui Zeng

发表机构 * Besteasy (Beijing) Language Technology Co., Ltd.（北京最佳语言科技有限公司）

AI总结针对中文大语言模型缺乏能力评估的问题，提出一个涵盖医学、法律、心理学和教育四大领域共23个子任务的多任务测试，通过零样本准确率评估模型性能，发现最佳模型平均领先最差模型18.6个百分点，且所有模型在法律领域表现最差。

详情

AI中文摘要

大规模中文语言模型的发展蓬勃，但缺乏相应的能力评估。因此，我们提出一个测试来衡量大型中文语言模型的多任务准确性。该测试涵盖四大领域，包括医学、法律、心理学和教育，其中医学有15个子任务，教育有8个子任务。我们发现，在零样本设置中，表现最好的模型平均比表现最差的模型高出近18.6个百分点。在四大领域中，所有模型的最高平均零样本准确率为0.512。在子领域中，只有GPT-3.5-turbo模型在临床医学上达到了0.693的零样本准确率，这是所有模型在所有子任务中的最高准确率。所有模型在法律领域表现不佳，最高零样本准确率仅为0.239。通过全面评估多个学科知识的广度和深度，该测试可以更准确地识别模型的不足之处。

英文摘要

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

URL PDF HTML ☆

赞 0 踩 0