arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.27366 2026-05-27 cs.AI cs.CL cs.LG cs.MA 版本更新

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

MUSE-Autoskill: 通过技能创建、记忆、管理和评估实现自我进化智能体

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

发表机构 * ByteDance Inc.(字节跳动公司) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出MUSE-Autoskill框架,通过统一的技能生命周期(创建、记忆、管理、评估和优化)使LLM智能体持续提升任务解决能力,实验表明生命周期管理的技能可提高任务成功率、效率、复用性和跨智能体迁移。

Comments 30 pages, 8 figures, 13 tables, working in progress

详情
AI中文摘要

大型语言模型(LLM)智能体依赖可复用技能来解决复杂任务。然而,现有的技能创建方法将技能视为孤立和静态的工件,限制了其可复用性、可靠性和长期改进。我们提出了MUSE-Autoskill智能体(记忆利用技能进化),一个以技能为中心的智能体框架,让智能体通过统一的技能生命周期(创建、记忆、管理、评估和优化)持续提升任务解决能力。我们的框架使智能体能够按需创建技能,跨任务存储和复用技能,高效组织和选择技能,并通过单元测试和运行时反馈评估技能以进行持续优化。我们进一步引入了技能级记忆,为每个技能跨任务积累经验,从而实现更有效的复用和随时间适应。在SkillsBench上的实验提供了初步证据,表明生命周期管理的技能可以提高任务成功率、效率、复用性和跨智能体迁移,突出了将技能视为长期存在、具有经验意识和可测试资产的重要性。

英文摘要

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

2605.27358 2026-05-27 cs.LG cs.AI cs.CL 版本更新

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

发表机构 * Meta AI

AI总结 针对设备端部署,提出MobileMoE系列子十亿参数MoE语言模型,通过联合优化架构和四阶段训练,在14个基准上匹配或超越领先的密集模型和MoE模型,并在智能手机上实现高效推理。

详情
AI中文摘要

混合专家(MoE)已成为千亿参数语言模型的事实标准架构,但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距,我们提出MobileMoE,一系列设备端MoE语言模型,具有子十亿激活参数(0.3-0.9B激活,1.3-5.3B总参数),为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律,在移动内存和计算约束下联合优化MoE架构,识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构,我们采用四阶段方案训练MobileMoE,包括预训练、中期训练、指令微调和量化感知训练,全部使用开源数据集。在14个基准上,MobileMoE匹配或超越领先的设备端密集LLM,推理FLOPs减少2-4倍,并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步,我们提供了首个在商用智能手机上的高效MoE推理,并进行了全面的设备端性能分析。在相当的INT4权重内存下,MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍,解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

2605.27354 2026-05-27 cs.LG cs.AI cs.CL 版本更新

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出SAERL框架,通过稀疏自编码器提取模型内部状态,建模数据多样性、难度和质量,用于强化学习数据工程,提升准确率并减少训练步数。

详情
AI中文摘要

模型内部状态编码了大型语言模型(LLM)处理其训练数据时的丰富信息;然而,后训练数据工程主要依赖外部信号,忽略了模型内部状态中丰富的内在信号。我们提出了SAERL,一个用于LLM强化学习(RL)的数据工程框架。它使用稀疏自编码器(SAE)这一先进的机制可解释性工具提取的模型内部状态,建模三种内在数据属性:多样性、难度和质量。每个属性支撑一个具体的数据工程操作:用于批次多样性控制的SAE空间聚类与适度批次混合、用于从易到难课程排序的难度代理,以及用于数据过滤的质量探针。SAERL在Qwen2.5-Math-1.5B上相比原始GRPO平均准确率提升3.00%,并以减少20%的训练步数达到目标准确率,在模型规模和RL算法上均有一致收益。实验表明,SAE在不同模型家族和规模间有效迁移,作为一种轻量级且可重用的数据工程工具。这些结果证明,模型内部状态是后训练数据工程中强大且实用的信号来源。

英文摘要

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

2605.27345 2026-05-27 cs.CL 版本更新

MATCHA: Matching Text via Contrastive Semantic Alignment

MATCHA: 通过对比语义对齐进行文本匹配

Siran Li, Ece Sena Etoglu, Carsten Eickhoff, Seyed Ali Bahrainian

发表机构 * University of Tübingen(图宾根大学)

AI总结 针对现有评估指标无法区分语义矛盾的问题,提出MATCHA指标,通过双视角对比学习同时奖励语义一致性和惩罚矛盾,在多个基准上优于ROUGE和BERTScore。

详情
AI中文摘要

可靠的评估对于理解大型语言模型(LLM)的性能至关重要,但当今常用的指标,即词元重叠分数(如ROUGE)和基于嵌入的度量(如BERTScore),常常误判文档的语义相似性。我们的研究表明,词元重叠指标和基于嵌入的指标通常会将几乎相同的分数分配给直接相互矛盾的文本,从而可能掩盖根本性错误。我们引入了MATCHA,一种自动度量指标,它同时奖励与参考的语义一致性并惩罚矛盾。MATCHA采用双视角方法,衡量(i)与黄金文本的接近程度和(ii)与对抗性生成的反事实矛盾的距离。在八个公开基准上,MATCHA在问答、图像字幕生成、自然语言推理、摘要和语义文本相似性任务中,与人工标注相比,优于流行指标。在TruthfulQA数据集(即没有训练集的数据集,其中没有基于嵌入的指标可以局部训练)上,这种在根据参考匹配文本方面的改进相对于ROUGE-L达到18.38%,相对于BERTScore达到20.82%。定量比较和定性人工评估都证实了MATCHA的有效性和正确性,并揭示了现有指标的根本弱点。与23个嵌入模型(包括最先进的模型)作为类似BERTScore的度量相比,MATCHA在仅基于参考区分正确和错误陈述方面仍然是最准确的。我们的代码和指标公开可用(https://github.com/Siran-Li/MATCHA)。

英文摘要

Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (https://github.com/Siran-Li/MATCHA).

2605.27338 2026-05-27 cs.AI cs.CC cs.CL cs.LO 版本更新

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

带有弱约束的2-ASP(Q)程序:复杂性与高效实现

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

发表机构 * University of Calabria(卡拉布里亚大学) Rende Italy(意大利伦德)

AI总结 本文研究了带有两个量词和弱约束的ASP(Q)程序(2-ASP(Q)^w)的复杂性,并提出基于CEGAR技术的Casper系统实现策略,实验证明其有效性。

详情
AI中文摘要

ASP(Q)通过回答集上的量词扩展了回答集编程(ASP)。本文聚焦于带有两个量词和弱约束的ASP(Q)程序类,记为2-ASP(Q)^w。2-ASP(Q)^w是ASP(Q)的一个实际相关片段,其表达能力足以捕获直到类Delta_3^P的优化问题。在理论方面,我们提供了2-ASP(Q)^w程序主要计算任务的完整复杂性刻画,包括紧致完备性结果以及对先前工作未涉及的非平凡情况的分析。在实践方面,我们引入了在Casper系统中计算(最优)量化回答集的新策略,该策略依赖于针对ASP(Q)定制的反例引导抽象精化(CEGAR)技术。来自不同应用领域的硬基准测试的实验评估表明,所提出的技术在实践中是有效的。

英文摘要

ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the class Delta_3^P. On the theoretical side, we provide a complete complexity characterization of the main computational tasks for 2-ASP(Q)^w programs, including tight completeness results and the analysis of nontrivial cases that have not been addressed in previous works. On the practical side, we introduce novel strategies for computing (optimal) quantified answer sets in the Casper system, that rely on a Counterexample-Guided Abstraction Refinement (CEGAR) technique tailored to ASP(Q). An experimental evaluation on hard benchmarks from different application domains shows that the proposed techniques are effective in practice.

2605.27333 2026-05-27 cs.CL 版本更新

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

FinHarness:面向金融LLM代理的内联生命周期安全约束框架

Haoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang, Yancheng Chen, Jiayu Liang, Qian Li, Hanning Lu, Kefu Xu, Hao Zheng, Chongyang Zhang, Hao Peng, Philip S. Yu

发表机构 * Peking University(北京大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) University of Chinese Academy of Sciences(中国科学院大学) Soochow University(苏州大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Leeds(利兹大学) Fullive Innovation (Beijing) AI Technology Co., Ltd.(全维创新(北京)人工智能科技有限公司) Beihang University(北京航空航天大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 针对金融LLM代理在阻止提示诱导的未授权操作与批准合法多步骤业务流程之间的冲突,提出FinHarness内联安全约束框架,通过查询监控、工具监控和级联模块实现逐步骤风险评估与自适应验证,显著降低攻击成功率并保持良性批准率。

详情
AI中文摘要

金融LLM代理必须同时阻止提示诱导的未授权操作并批准合法的多步骤业务流程。然而,边界过滤器常常遗漏不可逆的中间轨迹工具调用,而事后LLM判断仅在终止后执行审计——对于干预来说为时已晚,且计算成本随轨迹长度线性增长。我们提出FinHarness,一个内联安全约束框架,通过三个组件端到端地封装金融代理:查询监控器融合单轮意图与跨轮漂移,工具监控器评估每个潜在工具调用,以及级联模块整合每步风险并在轻量级和高级LLM判断之间自适应路由验证。触发的风险因素作为事前证据重新注入代理输入,使代理能够自行拒绝、重新规划或批准。在FinVault上,路由的FinHarness将攻击成功率从38.3%降至15.0%,同时基本保持良性批准率(41.1%→39.3%),并且高级判断调用次数比始终使用高级判断的消融实验减少4.7倍。

英文摘要

Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination -- too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end-to-end with three components: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per-step risk and adaptively routes verification between a lightweight and an advanced-tier LLM judge. Fired risk factors are re-injected into the agent input as ex-ante evidence, enabling the agent to refuse, re-plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval ($41.1\% \to 39.3\%$), and uses $4.7\times$ fewer advanced-judge calls than an always-advanced ablation.

2605.27322 2026-05-27 cs.CL 版本更新

Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech

SSD中的语义梯度交互:种族身份与仇恨言论的案例研究

Felix Ostrowicki, Hubert Plisiecki

发表机构 * Independent Researcher(独立研究者) IDEAS Research Institute(IDEAS研究所)

AI总结 本文提出交互式SSD方法,通过语义梯度交互模型研究调节变量对语义含义的影响,并在UC Berkeley仇恨言论语料库上验证了种族身份对仇恨言论判断的调节作用。

详情
AI中文摘要

我们引入了交互式SSD,这是监督语义微分的一种扩展,用于建模语义含义如何在调节变量(如群体、特征或条件)之间变化,使这种变化可检验和可解释。该方法估计一个主语义梯度、一个交互梯度和条件梯度,所有这些都可以通过标准SSD工具进行解释。我们在UC Berkeley仇恨言论测量语料库上进行了说明,测试了注释者的种族身份是否调节了对针对有色人种评论的仇恨言论判断。交互模型检测到显著的调节效应:共享梯度将非人化的敌意与反言论形成对比,而交互梯度则揭示了在哪些语义线索预测仇恨言论评分方面存在较小的群体相关差异。交互式SSD使调节后的含义-结果关系在统计上可检验和可解释。

英文摘要

We introduce interaction SSD, an extension of Supervised Semantic Differential that models how semantic meaning varies across moderators such as groups, traits, or conditions making this variation testable and interpretable. The method estimates a main semantic gradient, an interaction gradient, and conditional gradients, all interpretable through standard SSD tools. We illustrate it on the UC Berkeley Measuring Hate Speech corpus, testing whether annotator racial identity moderates hate-speech judgments of comments targeting people of color. The interaction model detects a significant moderation effect: the shared gradient contrasts dehumanizing hostility with counter-speech, while the interaction gradient reveals smaller group-linked differences in which semantic cues predict hate-speech ratings. Interaction SSD makes moderated meaning-outcome relationships statistically testable and interpretable.

2605.27315 2026-05-27 cs.CL 版本更新

Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

真实图像,更差判断:评估视觉-语言模型在具体性和意象性上的表现

Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所)

AI总结 本文通过具体性和意象性评分任务,发现真实图像上下文不仅未提升视觉-语言模型与人类判断的一致性,反而在视觉证据相关性低时加剧偏差,并表明推理时聚焦文本指令可缓解退化。

详情
AI中文摘要

视觉输入通常被认为能改善多模态模型的语言理解。我们通过询问视觉-语言模型(VLM)在词汇判断中能否区分有用的视觉证据与偶然的图像上下文来检验这一假设。我们使用人类的具体性和意象性评分,因为它们涵盖从抽象低意象词到具体高意象词等不同视觉相关性的词汇。我们发现,真实图像上下文并未带来一致的增益,反而常常损害与人类评分的一致性,尤其是在视觉证据最不相关时。通过探针分析和典型相关分析,辅以归因案例研究,我们发现真实图像上下文与表征偏移和对虚假视觉线索的更高敏感性相关,同时目标词汇属性的可恢复性减弱。我们进一步表明,在推理时指示模型仅关注文本内容可以减少这种退化,尤其是在这些脆弱子集上效果最明显。我们的发现表明,当前指令微调的VLM需要更好地校准视觉上下文何时应影响词汇判断。

英文摘要

Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.

2605.27313 2026-05-27 cs.CL 版本更新

When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

人口统计信息何时有帮助?面向观点的仇恨言论检测的数据与建模机制

Weibin Cai, Reza Zafarani

发表机构 * Data Lab, EECS Department(数据实验室,电子工程与计算机科学系)

AI总结 本文通过分析数据划分属性和建模框架,研究了人口统计特征在仇恨言论检测中的增益条件,并提出了一种门控人口统计残差模型,在低训练分歧、高测试分歧等机制下有效提升性能。

详情
AI中文摘要

人口统计信息通常用于在主观任务(如仇恨言论检测)中对标注者观点进行建模,但其益处并不一致:在某些设置下提升性能,在其他设置下则表现为噪声。本文探讨了人口统计特征何时有帮助。我们分析了人口统计增益作为数据划分属性和建模框架的函数。对于数据划分,我们测量了标注者分歧,即标注者给同一示例分配不同标签的频率,以及训练规模和训练-测试人口统计覆盖率。我们发现人口统计增益集中在低训练分歧、高测试分歧、细粒度模糊性测量、充足训练数据和更大人口统计重叠的机制中。受这些机制的启发,我们引入了一种门控人口统计残差模型,将人口统计视为对纯文本预测的选择性调整。在MHS和POPQUORN上的实验表明,这种设计是有效的,尤其是在高分歧或低置信度的示例上。总体而言,我们的结果表明,人口统计信息不应默认假定有用;其价值取决于数据机制和建模框架的共同作用。

英文摘要

Demographic information is often used to model annotator perspectives in subjective tasks such as hate speech detection, but its benefit is inconsistent: it improves performance in some settings and behaves as noise in others. This paper asks when demographic features help. We analyze demographic gain as a function of both data split properties and modeling frameworks. For data splits, we measure annotator disagreement, namely how often annotators assign different labels to the same example, along with training size and train-test demographic coverage. We find that demographic gains concentrate in regimes with low training disagreement, high test disagreement, fine-grained ambiguity measurement, sufficient training data, and greater demographic overlap. Motivated by these regimes, we introduce a gated demographic residual model that treats demographics as a selective adjustment to text-only predictions. Experiments on MHS and POPQUORN show that this design is effective, especially on high disagreement or low confidence examples. Overall, our results suggest that demographics should not be assumed useful by default; their value depends jointly on the data regime and the modeling framework.

2605.27311 2026-05-27 cs.CL cs.CV 版本更新

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer: 用于评估视觉语言模型的反事实图表生成

Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) Layer 6 AI

AI总结 提出 Chartographer 框架,通过将图表逆向工程为可执行代码并生成反事实变体,揭示视觉语言模型在图表问答中的视觉推理缺陷。

详情
AI中文摘要

图表问答(QA)基准旨在提出需要视觉推理才能正确回答的问题,但模型通常可以通过捷径或基于自身背景知识对图表的先验熟悉度来达到解决方案。为了严格评估视觉推理,我们提出了反事实图表,其中图表问答任务保持不变,但底层图表和相应答案发生变化。我们引入了 Chartographer,一个将图表逆向工程为可执行代码、验证重构保真度、生成种子控制的反事实变体以及从可执行问答逻辑中推导新答案的框架。我们将该框架应用于现有的图表 QA 数据集,并评估了专有和开源视觉语言模型(VLM),测量了变化敏感性和泛化能力。反事实图表揭示了单一图表性能所隐藏的失败:VLM 在正确回答原始图表后通常无法泛化。我们发现,当更新后的图表需要新的视觉推理路径时,失败最为普遍。

英文摘要

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.

2605.27298 2026-05-27 cs.CL 版本更新

Self-Ensembling Vision-Language Models for Chart Data Extraction

用于图表数据提取的自集成视觉语言模型

Thomas Berkane, Qianyi Wang, Maimuna S. Majumder

发表机构 * Computational Health Informatics Program, Boston Children’s Hospital & Harvard Medical School(波士顿儿童医院计算健康信息学项目及哈佛医学院)

AI总结 提出一种自集成方法,通过多次采样同一VLM的输出并聚合表格单元格,提升图表数据提取的准确性,并引入新基准WB-ChartExtract。

详情
AI中文摘要

图表能有效传达定量信息,但底层数据通常以图像形式锁定,阻碍了重用和分析。手动数字化图表耗时且易出错,因此推动了自动图表到表格的提取。最近的方法使用专门的视觉语言模型(VLM),但在数据点众多或风格变化大的图表上性能仍然滞后。我们提出了一种VLM自集成方法,该方法针对固定图表图像从同一VLM重复采样多个表格输出,并在单个表格单元格级别进行聚合。我们对齐候选表格,并对数值取每个单元格的中位数,以生成更准确的共识表格。我们的方法还包括收敛检测,一旦聚合表格稳定就停止采样,以及基于样本间离散度的不确定性估计,帮助用户评估提取可靠性。由于现有的图表提取基准包含相对简单的图表,改进空间有限,我们引入了WB-ChartExtract,这是一个基于世界银行数据构建的新基准,包含更复杂和风格多样的图表;平均而言,其图表的数据点数量是ChartQA基准中图表的7倍。在ChartQA和WB-ChartExtract上,我们的方法比单次VLM输出提高了提取准确性,在WB-ChartExtract上集成后相对改进高达23%。更广泛地说,我们的方法有助于解锁以前被锁定在图表图像中的表格数据,支持下游分析和重用。

英文摘要

Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic variation. We propose a VLM self-ensembling method that repeatedly samples multiple tabular outputs from the same VLM for a fixed chart image and aggregates them at the level of individual table cells. We align candidate tables and take per-cell medians over numerical values to produce a more accurate consensus table. Our method also includes convergence detection to stop sampling once the aggregated table stabilizes, and uncertainty estimation based on dispersion across samples to help users assess extraction reliability. Because existing chart extraction benchmarks contain relatively simple plots with limited room for improvement, we introduce WB-ChartExtract, a new benchmark built from World Bank data with more complex and stylistically diverse charts; on average, its charts contain 7 times more datapoints than those in the ChartQA benchmark. Across both ChartQA and WB-ChartExtract, our approach improves extraction accuracy over single-pass VLM outputs, yielding up to 23% relative improvement on WB-ChartExtract after ensembling. More broadly, our method helps unlock tabular data previously siloed in chart images, enabling downstream analysis and reuse.

2605.27296 2026-05-27 cs.CL 版本更新

Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics

探究大型语言模型的文化意识:跨文化美学文体学案例研究

Jiashuo Wang, Fenggang Yu, Jian Wang, Chak Tou Leong, Xiaoyu Shen, Chunpu Xu, Jiawen Duan, Wenjie Li, Johan F. Hoorn

发表机构 * Department of Computing, Hong Kong Polytechnic University(香港理工大学计算机系) Institute of Digital Twin, Eastern Institute of Technology, Ningbo(宁波东部技术研究所数字孪生研究所) Department of Language Science and Technology, Hong Kong Polytechnic University(香港理工大学语言科学与技术系) School of Design, Hong Kong Polytechnic University(香港理工大学设计学院) Research Institute for Quantum Technology, Hong Kong Polytechnic University(香港理工大学量子技术研究所) Department of Communication Science, Vrije Universiteit Amsterdam(阿姆斯特丹自由大学传播科学系)

AI总结 通过构建C4STYLI基准(包含香港和中国大陆的高度风格化翻译电影片名和广告标语),评估大型语言模型在跨文化美学文体识别和生成方面的能力,发现模型依赖表层语言信息而非风格结构,对香港特定风格结构敏感度有限。

Comments IJCAI 2026 Human-Centred AI track

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在多样化的文化背景中,但它们掌握美学文体学(即策略性地使用语言以唤起文化共鸣)的能力仍未得到充分探索。我们整理了C4STYLI,一个包含来自香港和中国大陆的高度风格化翻译电影片名和广告标语的基准,通过行为识别和生产能力的视角评估LLMs。广泛评估表明,LLMs在风格识别上与人类不同,且这种识别能力在不同文本领域有所变化。此外,LLMs中的风格识别和生成性能并不一致。为了进一步检查LLMs在风格识别中是否真正捕捉到风格信息,我们使用逻辑回归探针进行了结构消融。我们发现,在香港背景下,LLMs中的风格识别主要依赖于表层语言信息而非风格结构。这表明对香港特定风格结构的敏感度有限。

英文摘要

Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4STYLI, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure.

2605.27294 2026-05-27 cs.CL cs.IR 版本更新

Separating Semantic Competition from Context Length in RAG Reading

在RAG阅读中区分语义竞争与上下文长度

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar, Cien Zhang, Svetlana Karslioglu, Akash Vishwakarma

发表机构 * Meta Platforms(Meta平台)

AI总结 通过匹配对照实验,分离出检索增强生成中阅读器的语义竞争效应,证明性能下降部分源于竞争而非仅上下文长度。

Comments 4 pages, 1 figure, 2 tables

详情
AI中文摘要

检索增强生成(RAG)系统即使在检索到正确段落时也可能错误回答。模型仍需阅读检索到的段落,并在看似相关的段落中识别出包含答案的那一个。这种段落阅读模型称为阅读器。它的失败仅仅是因为上下文更长,还是因为其他段落与正确段落真正竞争?我们引入并展示了一种RAG阅读的匹配对照协议:保持段落数量和长度固定,但将强竞争段落替换为不那么竞争的实段。我们在SQuAD上对两个紧凑开放模型应用此对照。这种替换部分恢复了性能,对F1和答案包含的影响最强。对于Phi-2,它恢复了+6.0 EM点、+7.0答案包含点和+0.057 F1。对于Qwen2.5-1.5B,它恢复了+4.5 EM点、+9.0答案包含点和+0.068 F1。为了跟踪性能如何随竞争段落积累而变化,我们还报告了保留曲线,并在曲线未交叉半保留时用右删失半衰期进行总结。这些结果共同表明,该协议分离了与上下文长度不同的竞争效应,尽管该效应对F1和答案包含比精确匹配更清晰,并且也随片段长度变化。

英文摘要

Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.

2605.27288 2026-05-27 cs.CL cs.AI cs.LG 版本更新

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

并非总是谄媚:基于认知不确定性测量LLM的从众行为

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Intuit AI Research(Intuit AI研究院)

AI总结 本文提出MUSE框架,通过区分谄媚从众和不确定性驱动的从众,揭示LLM在用户反驳时改变立场的行为机制,并发现两种从众均随用户感知专业性和建议合理性增强。

详情
AI中文摘要

大型语言模型(LLMs)已知会放弃初始立场以适应用户的反驳。虽然先前研究主要将此行为归因于从人类反馈强化学习中习得的谄媚,但我们假设从众行为也受模型在推理时的认知不确定性驱动。本文提出MUSE,一个两阶段评估框架,用于解开驱动LLM从众行为的机制。具体而言,MUSE将模型回答查询时的认知不确定性与其在后续轮次中屈服于用户反驳的可能性进行映射。我们证明驱动从众的机制不仅限于谄媚。具体来说,我们刻画了共同驱动从众的两个不同因素:谄媚从众,即模型即使对其初始回答绝对确定也会与用户反驳保持一致;以及不确定性驱动从众,即模型从众可能性随其不确定性增加而增加。此外,我们进行消融研究,证明谄媚从众和不确定性驱动从众均随1)LLM对用户感知专业性的增加和2)用户建议的合理性增加而增长。更广泛地说,MUSE通过区分对齐诱导的谄媚和训练语料驱动的不确定性,为更有针对性的干预策略提供信息。

英文摘要

Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.

2605.27268 2026-05-27 cs.CL cs.AI 版本更新

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

迷失在采样中:通过词覆盖率评估大语言模型中的词汇可达性

Samer Awad, Javier Conde, Carlos Arriaga, Tairan Fu, Javier Coronado-Blázquez, Pedro Reviriego

发表机构 * Information and Processing Telecommunications Center(信息与处理电信中心) Universidad Politécnica de Madrid(马德里理工大学) Politecnico di Milano(米兰理工大学) Banco de España(西班牙银行)

AI总结 提出词覆盖率(WCS)指标,量化标准采样过滤器(如Top-p、Top-k、Min-p)如何抑制低频率高信息词汇的生存率,揭示解码机制对语言多样性的影响。

Comments 15 pages, 6 figures

详情
AI中文摘要

现代大语言模型(LLM)常因生成重复和同质化文本而受到批评,尽管它们拥有庞大的潜在词汇量。以往研究关注模型知识和训练数据,我们则探究解码机制在抑制语言多样性中的作用。我们引入词覆盖率(WCS),该指标量化了标准采样过滤器(如Top-$p$、Top-$k$和Min-$p$)在数学上剔除上下文适当的人类词汇的程度。WCS并非评估静态知识,而是衡量低频率、高信息人类词汇的词汇存活率作为采样参数的函数。通过审计人类撰写的语料片段中的开放权重模型,我们识别出哪些合理的词汇选择因解码器而变得不可达,即使它们存在于概率空间中。我们的结果提供了定量证据,表明行业标准的采样默认值充当了无意的审查机制,将人类表达的独特纹理平滑为同质化的话语。WCS为优化文本连贯性与词汇丰富性之间的权衡提供了严谨框架,为在生成模型中保留人类语言多样性提供了诊断工具。

英文摘要

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.

2605.27249 2026-05-27 cs.AI cs.CL 版本更新

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

Gumbel机器:通过Gumbel噪声引导生成反事实学生写作

Hunter McNichols, Alexander Scarlatos, Mihai Dascalu, Danielle McNamara, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University Politehnica of Bucharest(布加勒斯特理工大学) Arizona State University(亚利桑那州立大学)

AI总结 提出Gumbel机器,一种利用β-Hindsight控制解码算法生成既符合评分标准又与学生原文相似的反事实文本的模块化方法。

Comments preprint

详情
AI中文摘要

跨学科教学的有效方法是提供高质量工作的示例。然而,示例可能与学生的当前工作存在显著差异,使得学生难以模仿。理想的学习示范是学生工作的反事实版本,即与学生自身工作相似但有所改进的版本。现有的使用大型语言模型(LLMs)进行反事实文本生成的自动化方法导致了难以转化为实际应用的领域特定系统。我们提出了Gumbel机器,一种灵活、模块化的反事实生成方法,它利用LLM的指令遵循能力,同时鼓励与参考事实文本的相似性。我们方法的核心是一种新颖的受控解码算法β-Hindsight控制,该算法在反事实生成过程中利用潜在随机性作为可调的相似性控制机制。在根据各种标准评分的学生写作数据集上的实验表明,我们的方法在生成既符合评分标准又与参考文本相似的反事实文本方面是有效的。

英文摘要

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $β$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

2605.27240 2026-05-27 cs.CL 版本更新

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

ENPMR-Bench: 情感支持代理的主动记忆检索基准

Xing Fu, Yulin Hu, Mengtong Ji, Haozhen Li, Yixin Sun, Weixiang Zhao, Yanyan Zhao, Bing Qin

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心)

AI总结 提出ENPMR-Bench基准,基于马斯洛需求层次评估情感支持代理主动推断用户潜在情感需求并检索适当记忆的能力,实验表明当前检索范式存在显著缺陷。

详情
AI中文摘要

记忆增强的语言代理越来越多地部署在情感支持等情感应用中,在这些应用中,理解和响应用户的潜在情感需求至关重要。然而,现有研究通常将记忆视为事实检索的工具,忽视了其在塑造用户情感体验中的作用。在这项工作中,我们引入了ENPMR-Bench,一个用于评估情感需求感知的主动记忆检索(ENPMR)的基准,这是一种核心能力,使代理能够推断用户的潜在情感需求并主动检索适当的记忆以支持共情交互。基于马斯洛需求层次,ENPMR-Bench包括超过1,800个记忆增强对话,并定义了情感需求与支持性记忆类型之间的结构化映射。实验结果表明,当前的检索范式,包括基于嵌入和LLM驱动的方法,都存在显著缺陷,共情得分明显落后于黄金记忆条件。虽然思维链提示在一定程度上改善了推断的情感需求与检索记忆之间的一致性,但性能差距仍然显著。总之,这些发现揭示了当前代理的关键局限性,并指出了通过需求敏感的记忆检索推进个性化情感支持的方向。

英文摘要

Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users' emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users' latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow's hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.

2605.27239 2026-05-27 cs.CL 版本更新

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

时间同步性预测情感语料库中的标注质量

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, Botshelo Kondowe, Letlhogonolo Mohleleng, Hareaipha Nkopo Letsoalo, Shamsuddeen Hassan Muhammad, Vukosi Marivate

发表机构 * Data Science for Social Impact, University of Pretoria(社会影响力数据科学,南非普里特里亚大学) Instituto Politécnico Nacional(技术学院国家) Department of African Languages, University of Pretoria(非洲语言系,南非普里特里亚大学) Imperial College London(伦敦帝国理工学院)

AI总结 通过分析Setswana情感数据集,发现标注者间一致性随时间下降的主要原因是时间同步性,即同时标注的样本一致性高,而间隔较长的标注一致性低。

详情
AI中文摘要

当标注活动跨越数周或数月且标注者池较小时,标注质量难以维持。我们提出了一个Setswana情感数据集,包含3,565条推文,由三名母语标注者在八个批次中标注,并考察了标注者间一致性(IAA)随时间下降的原因。尽管总体Randolph自由边际Kappa为$κ= 0.76$,属于“优秀”,但每批次$κ$在整个标注任务中下降了超过32个百分点。通过六项针对性分析,我们发现:(i) 标签混淆集中在负面/中性边界;(ii) 两名标注者表现出与自动驾驶标注一致的运行长度漂移;(iii) $κ$的主要预测因子是时间同步性:一分钟内标注的推文达到$κ= 0.98$,而相隔超过一天标注的推文仅达到$κ= 0.65$。标注速度和推文级语言特征与$κ$无显著关联。我们评估了三种开放多语言编码器和专有模型(GPT-5和Gemini)在三类情感分类任务上的表现;微调相比预训练基线提升了29到43个宏F1分数,其中GPT-5少样本学习总体领先(62.2宏F1)。我们发布了数据集、每条标注的时间戳和分析代码,以支持未来非洲语言NLP资源的可重复质量审计。

英文摘要

Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.

2605.27220 2026-05-27 cs.CL cs.IR 版本更新

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

覆盖幻觉:从检索前路由失败到生产RAG系统中的检索后级联

Zafar Hussain, Kristoffer Nielbo

发表机构 * Aarhus University(奥胡斯大学)

AI总结 本文通过丹麦国家百科全书的案例研究,发现合成查询高估了LLM增强的需求(覆盖幻觉),并提出一种检索后级联策略,按成本递增顺序执行工作流,仅在无结果时升级到LLM增强,从而在无需训练开销的情况下提升质量并降低延迟。

详情
AI中文摘要

在现代RAG流水线中,HyDE和查询扩展等查询增强方法被应用于每个查询,导致大量的LLM推理成本和端到端延迟增加。这种开销在实际生产流量中的经验依据仍未得到充分探索。我们以丹麦国家百科全书为案例研究,评估了来自生产流量和合成条件的20,000个查询-工作流对上的五种检索工作流。在该系统中,合成查询表明超过90%的查询需要LLM增强才能实现高检索覆盖率。然而,在我们的生产延迟策略下,只有27.8%的真实用户查询需要LLM增强。我们将这种差距称为覆盖幻觉,并将其归因于合成查询与真实查询分布之间的结构性不匹配。检索前路由无法解决这一差距,因为LLM增强的需求只有在搜索索引后才能揭示,这一结果得到了我们对四种机器学习范式的评估的证实。这种仅从查询无法检测到的覆盖差距,促使我们采用检索后级联策略,该策略按成本递增顺序运行工作流,仅当某一步骤未返回文档时才升级到LLM增强。该级联策略完全无需训练开销或辅助服务基础设施,在质量上比Always-HyDE提高了+0.140综合总体分数,延迟降低了31.8%,并且72.2%的真实用户查询无需LLM增强即可得到服务。

英文摘要

In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.

2605.27204 2026-05-27 cs.CL cs.IR 版本更新

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

GraphReview: 基于LLM的图消息传递的科学论文评估

Pujun Zheng, Wanying Ren, Jiacheng Yao, Guoxiu He, Star X. Zhao

发表机构 * School of Economics and Management, East China Normal University(东华师范大学经济管理学院) Institute of Big Data, Fudan University(复旦大学大数据研究院)

AI总结 提出GraphReview框架,通过图消息传递整合论文内在质量、同期关联和历时关联,利用LLM生成节点先验和边比较证据,结合个性化PageRank进行质量排序、决策预测和审稿生成,在决策和排序指标上平均提升29.7%。

详情
AI中文摘要

科学论文评估通常不仅涉及评估稿件本身,还需要将其与同期研究和先前文献联系起来。然而,现有的基于LLM的方法通常分别建模这些信号,缺乏跨论文传播审稿证据的统一机制。我们提出$ extbf{GraphReview}$,一个基于图的LLM框架,将论文评估形式化为在语义论文图上进行审稿信号的消息传递。该图联合捕捉内在质量、同期论文之间的同步链接以及指向先前工作的历时链接。LLM用于估计节点级质量先验,并通过成对论文比较生成边级比较证据,而个性化PageRank整合审稿信号用于质量排序、决策预测和审稿生成。为了生成更高质量的图证据,我们提出了奖励诱导的最大似然目标来训练LLM骨干网络。实验表明,GraphReview始终优于最强基线,在决策和排序指标上平均提升29.7%,包括准确率提升23.7%,Spearman's $ρ$提升57.6%。它还生成更高质量的审稿文本,并在不同时间段和会议场所中有效泛化。代码可在https://github.com/ECNU-Text-Computing/GraphReview获取。

英文摘要

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $ρ$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU-Text-Computing/GraphReview.

2605.27195 2026-05-27 cs.CL 版本更新

EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization

EpiCurveBench: 评估视觉语言模型在流行病曲线数字化中的表现

Thomas Berkane, Maimuna S. Majumder

发表机构 * Computational Health Informatics Program(计算健康信息学项目) Boston Children’s Hospital & Harvard Medical School(波士顿儿童医院及哈佛医学院)

AI总结 针对现有图表数据提取基准中忽略时间序列结构的问题,提出包含1000张真实流行病曲线图像的EpiCurveBench基准和基于动态规划的EpiCurveSimilarity评估指标,实验表明最强模型仅达52.3% ECS,且ECS能更好区分模型性能。

详情
AI中文摘要

使用视觉语言模型(VLM)进行图表到数据提取的评估,越来越多地依赖于那些显示递减余量的基准(前沿VLM在ChartQA上超过89%)以及将提取点视为无序键值对的指标,忽略了时间序列的时间结构,并将小的对齐偏移视为灾难性失败。我们通过EpiCurveBench(一个从多种公共卫生来源精选的1000张真实流行病曲线图像基准)和EpiCurveSimilarity(ECS,一种通过动态规划对齐预测序列和真实序列的评估指标,容忍局部时间偏移和间隙,同时按比例惩罚它们)来解决这两个空白。评估六种方法——三种前沿闭源VLM、一种开源VLM和两种专门的图表提取系统——我们发现最强的模型仅达到52.3% ECS,并且ECS将四种通用VLM分散在25个百分点的范围内,而键值指标(RMS、SCRM)将它们压缩在5个百分点的范围内。我们进一步针对四个下游流行病学汇总统计量验证ECS,发现更高的ECS预测更小的总计数、峰值时间和峰值幅度误差,以及更高的增长率保真度;在所有四个统计量中,ECS的相关性比动态时间规整强1.5-3.6倍,后者缺乏间隙惩罚,因此无法区分截断预测与时间保真预测。EpiCurveBench针对一个高影响力的公共卫生应用——解锁被困在已发表图表中的数十年的疫情数据——但该基准和指标直接适用于任何结构化时间序列图表提取场景。

英文摘要

Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods--three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems--we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5--3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application--unlocking decades of outbreak data trapped in published figures--but the benchmark and metric apply directly to any structured time-series chart-extraction setting.

2605.27194 2026-05-27 cs.CL cs.CV cs.LG 版本更新

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

并非所有标记都同等重要:基于关键标记监督的动态上下文向量蒸馏用于长医学报告生成

Ning Wu, Rui Liu, Xinkun Lin, Weixing Chen, Jinxi Xiang, Tao Wei, Lina Yao, Mingjie Li

发表机构 * UNSW Sydney(新南威尔士大学悉尼分校) University of Technology Sydney(技术大学悉尼分校) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Stanford University(斯坦福大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DIVE框架,通过关键标记监督和状态条件动态引导,解决长文本生成中标记级蒸馏忽略关键标记的问题,在医学报告生成任务上取得最佳性能。

Comments Preprint. 20 pages, 6 figures

详情
AI中文摘要

将示范效果蒸馏到隐藏空间干预中提供了一种轻量级的替代全微调的方法。然而,现有的多模态变体主要是在短文本任务上评估的,其中输出在几个标记后结束。将这些方法扩展到长文本生成暴露了一个基本但未充分研究的局限性:标记级蒸馏隐式地将所有输出标记视为同等信息量,但长文本输出由高频模板和语法标记主导,而实际决定输出质量的标记稀疏分布。在医学报告生成(MRG)中,有两种这样的关键标记突出:决定诊断内容的病理相关标记和决定终止的序列结束(EOS)事件。两者在均匀交叉熵下都受到不足的监督,自回归解码通过偏离教师强制轨迹进一步加剧了问题。我们提出DIVE,一个冻结骨干的蒸馏框架,通过两种与这些失败相匹配的互补机制来解决长文本报告生成。关键标记监督通过提高病理相关标记和EOS事件的交叉熵贡献来恢复监督平衡,确保内容保真度和终止在训练期间学习,而不是在解码时施加。状态条件动态引导用隐藏状态相关的适配器替换固定的开环残差,允许注入信号随着解码漂移而适应。在MIMIC-CXR和CheXpert Plus上使用两个医学VLM骨干的实验表明,DIVE在词汇和临床代理指标中始终位列最强方法之一。我们的方法在所有数据集-骨干设置中实现了最佳的BLEU-4、ROUGE-L和RadGraph F1,同时在粗粒度标签级CheXbert F1上保持竞争力。

英文摘要

Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.

2605.27190 2026-05-27 cs.CL cs.AI cs.LG cs.SD 版本更新

Learning When to Think While Listening in Large Audio-Language Models

在大音频语言模型中学习何时在聆听时思考

Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种可学习的等待-思考-回答控制机制,通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机,在提升准确率的同时减少响应延迟。

Comments 19 pages, 4 figures, 6 tables

详情
AI中文摘要

近期大音频语言模型(LALMs)的进展使得实时、流式的语音交互越来越实用。在这种场景下,推理质量和响应速度紧密耦合:将推理延迟到语音端点可以提高答案质量,但会将思考时间转移到用户可见的响应延迟中,而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发,控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型,我们从语音推理数据中构建对齐的等待-思考-回答轨迹,使用监督微调(SFT)训练控制器,然后应用解耦裁剪和动态采样策略优化(DAPO)。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性,优化完整的等待-思考-回答轨迹,而不仅仅是最终答案。在一个六任务合成语音推理问答(SRQA)基准上,六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%,同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准(Real Audio Bench)上,作为超越文本转语音(TTS)渲染语音的迁移检查,控制器家族仍然有效:SFT实现了最强的准确率,而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明,流式模型应该学习在音频流中何时使中间推理显式化。

英文摘要

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

2605.27189 2026-05-27 cs.CL cs.LG cs.SD eess.AS q-bio.NC 版本更新

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

超越二元:认知评分层级中的语音表征

Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann, Gerhard Eschweiler, Daniela Berg, Sam Gijsen, Paula Andrea Perez-Toro, Kerstin Ritter

发表机构 * 1 Hertie Institute for AI in Brain Health, University of Tübingen, Tübingen, Germany 2 Tübingen AI Center, University of Tübingen, Tübingen, Germany 3 Department of Psychology, Humboldt-Universität zu Berlin 4 Geriatric Center, Tübingen University Hospital, Tübingen, Germany 5 Tübingen Center for Mental Health (TüCMH), Department of Psychiatry Psychotherapy, Tübingen University Hospital, Tübingen, Germany 6 German Center for Mental Health (DZPG), Partner Site Tübingen, Tübingen, Germany 7 Department of Neurology, University Medical Center Schleswig-Holstein Kiel University, Kiel, Germany 8 Center for Neurology, University Hospital Tübingen Hertie Institute for Clinical Brain Research, Tübingen, Germany 9 Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany 10 Charit\'e--Universit\"atsmedizin, Department of Psychiatry

AI总结 本研究利用5,754份德语神经心理学评估录音,比较手工声学特征与自监督学习嵌入在轻度认知障碍认知评估层级(任务、领域、全局)中的表现,发现任务约束与评估层级之间的关联。

详情
AI中文摘要

本研究考察了轻度认知障碍中语音表征与认知评估层级结构之间的关系。利用5,754份德语神经心理学评估录音,我们在三个评分层级(任务、领域和全局)上评估了六项认知任务。我们比较了手工声学特征与自监督学习(SSL)嵌入。结果表明,尽管SSL表示在较低层级通常优于手工特征,但这种趋势在MCI分类中发生逆转。此外,任务特定约束影响性能:响应自由度较大的任务随着层级增加表现出性能稀释,表明“专家”表示,而高度结构化任务的性能向更高层级增加,表明“通才”表示。这些发现揭示了自动临床语音分析中任务约束与评估层级之间的联系。

英文摘要

This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.

2605.27186 2026-05-27 cs.CL 版本更新

MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

MAIGO: 通过历史清理的在线策略自蒸馏缓解对话丢失

Haoyu Zheng, Yun Zhu, Shu Yuan, Shangming Chen, Qing Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Huazhong University of Science and Technology(华中科技大学) Fuzhou University(福州大学) Tencent(腾讯)

AI总结 针对大语言模型在多轮对话中性能下降(对话丢失)的问题,提出MAIGO方法,通过在线策略自蒸馏和清理历史助手回复来减少自污染,无需验证器或推理时辅助,显著提升多轮对话准确性。

详情
AI中文摘要

大语言模型通常能从完整指定的提示中解决任务,但当相同需求在多轮中展开时,性能会下降,这被称为对话丢失(LiC)差距。我们将这种退化部分归因于自污染:中间助手的回复进入后续上下文,并将早期偏差向前传递。受此机制启发,我们提出了MAIGO,一种在线策略自蒸馏方法,通过使用模型自身策略的历史清理参考来减少这种污染。对于中间轮次,MAIGO移除先前的助手回复,同时保留用户可见的分片前缀;对于回答轮次,它从基于完整用户侧对话的配对全视图参考中蒸馏。一个可靠性权重降低与干净参考不一致的中间轮次样本的权重。MAIGO不需要验证器奖励、状态标签或推理时辅助。在具有确定性验证器的LiC配对视图协议下,MAIGO将Qwen2.5-7B-Instruct的SHARDED准确率从52.8提升至66.1,SHARDED/FULL比率从66.5%提升至84.1%,同时保持FULL准确率在2.3个点以内。这些结果表明,自污染是LiC差距中一个可训练的成分。

英文摘要

Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.

2605.27168 2026-05-27 cs.CL cs.AI cs.CY 版本更新

Grounding Text Embeddings in Stakeholder Associations

将文本嵌入与利益相关者关联对齐

Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft, Kenneth C. Enevoldsen, Chris Russell

发表机构 * University of Oxford(牛津大学) Institute for Wicked Problems(复杂问题研究所) The Chinese University of Hong Kong(香港中文大学) Danish Technical University(丹麦技术大学) Aarhus University(奥胡斯大学)

AI总结 提出利益相关者对齐练习方法,通过评估嵌入模型与人类专家的语义距离一致性,发现神经文本嵌入在丹麦政策案例中可靠性显著低于专家(差距19-26个百分点),且该差距在美国联邦AI用例中复现(16个百分点)。

详情
AI中文摘要

文本嵌入被广泛用于分析大型复杂文本语料库。然而,尚不清楚这些嵌入是否捕捉到与使用它们的人类专家相同的语义距离。确保嵌入表示与人类意图一致对于有效分析至关重要。我们提出了利益相关者对齐练习,这是一种使专家关联显式化并将嵌入模型结果扎根于人类理解的方法。在我们关于丹麦政策问题的主要案例研究中,我们发现神经文本嵌入的可靠性远低于人类专家(差距19-26个百分点),并且这种不对齐会传播到下游聚类性能(练习排名与聚类质量之间的Spearman $ρ=0.9$)。一项关于美国联邦AI用例的二次研究使用数字协议和不同的专家社区在英语中复现了该差距(16个百分点)——表明该差距并非单一工具或领域的产物。利益相关者对齐练习提供了一种实用方法,用于评估嵌入模型是否捕捉到对领域专家最重要的语义区分。

英文摘要

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.

2605.27161 2026-05-27 cs.CL 版本更新

Formalization of Malagasy conjugation

马达加斯加语动词变位的形式化

Joro Ny Aina Ranaivoarison, Eric Laporte, Baholisoa Simone Ralalaoherivony

发表机构 * University of Antananarivo, Centre Interdisciplinaire de Recherche Appliquée au Malgache, Madagascar(Antananarivo大学,马达加斯加跨学科应用研究中心) Université Paris-Est, Laboratoire d'informatique Gaspard-Monge CNRS UMR 8049, F77454 Marne-la-Vallée, France(巴黎-est大学,加斯帕德-蒙戈实验室,法国马恩拉瓦尔谷,UMR 8049)

AI总结 本文基于Unitex平台,通过构建电子词典和有限状态转换器,实现了马达加斯加语简单动词的形态分析,并优先保证可读性以便语言学家扩展更新。

详情
Journal ref
Language and Technology Conference, 2013, Poznań, Poland, pp.457-462
AI中文摘要

本文报告了为构建基于词典的马达加斯加语简单动词形态分析器所进行的核心语言工作。该分析器使用Unitex平台,包括构建马达加斯加语简单动词的电子词典。数据基于形态特征进行编码。动词词干的形态变化及其与屈折词缀的组合通过可编辑图表示的有限状态转换器进行形式化。78个转换器使Unitex能够生成词干变体的词典。另外271个转换器被形态分析器用于识别变位动词中的词干和词缀。词典和转换器的设计优先考虑可读性,以便语言学家能够扩展和更新它们。

英文摘要

This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.

2605.27156 2026-05-27 cs.CL cs.AI 版本更新

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

LitSeg: 面向文学RAG的叙事感知文档分割

Ruikang Zhang, Zhanni Chen, Yiqiao Cai, Qi Su

发表机构 * Peking University(北京大学)

AI总结 提出LitSeg,一种基于叙事理论引导的文档分割框架,通过多阶段提示提取事件、梳理叙事线索并定位转折点,以解决现有分割方法忽视文学叙事结构导致检索与生成性能下降的问题,并引入轻量版LitSeg-Lite通过数据蒸馏降低计算开销。

详情
AI中文摘要

检索增强生成(RAG)通过引入外部知识增强了大型语言模型(LLMs),特别是在文学作品等长尾领域。然而,RAG中关键的文档分割步骤仍未得到充分探索。现有策略通常语义盲目,忽视了文学作品复杂的叙事结构,常常导致情节碎片化和指代不清,严重阻碍了检索和生成性能。为了解决这一问题,我们提出了LitSeg,一种新颖的叙事理论引导的分割框架。通过采用多阶段提示,LitSeg明确提取有效事件,梳理叙事线索,阐明叙事结构,并定位转折点以指导分割。为了减轻大规模模型多阶段推理的计算开销,我们进一步引入了LitSeg-Lite,一种轻量级的单遍分块器,通过两阶段训练策略在LitSeg生成的数据上进行微调,将复杂过程蒸馏为单次推理。大量实验表明,通过结构独立的文本块,我们的方法在检索准确性和上下文相关性上显著优于基线,最终提升了下游问答性能,而消融研究验证了叙事学指导和数据蒸馏的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.

2605.27110 2026-05-27 cs.CR cs.CL 版本更新

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

BAIT: 基于边界引导的自我条件推理披露升级

Xuan Luo, Yue Wang, Geng Tu, Jing Li, Ruifeng Xu

发表机构 * The Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong Polytechnic University(香港理工大学) Shenzhen University(深圳大学) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 提出BAIT三步越狱框架,通过识别保护边界、细化边界和请求详细示例,利用模型自身推理和一致性倾向实现恶意目标披露,实验表明在多个基准上攻击成功率显著优于传统方法。

详情
AI中文摘要

在这项工作中,我们提出了BAIT(边界感知迭代陷阱),一个三步越狱框架,通过内部披露接近恶意目标。BAIT首先要求模型识别保护边界,然后要求其细化该边界,最后请求一个详细示例。通过在每个步骤中扩展模型之前的响应,BAIT将模型自身的推理和一致性倾向转变为披露路径。在AdvBench、JailbreakBench、AIR-Bench和SORRY-Bench上的实验表明,BAIT在顶级大语言模型上持续实现高攻击成功率,显著超越了传统的越狱基线。进一步分析揭示:1)预防导向的框架显著优于直接知识请求;2)细化步骤在披露升级中起关键作用;3)前两步有一定概率引发有害内容,同时触发很少的过滤。

英文摘要

In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.

2605.27101 2026-05-27 cs.CV cs.CL 版本更新

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

弹出式干扰揭示视频大语言模型中的事件袋行为

Oscar Chew, Serhii Honcharenko, Qian-Hui Chen, Patricia Lu, Dishant Zaveri, Khoa D. Doan, Kuan-Hao Huang

发表机构 * Texas A&M University(德克萨斯A&M大学) National Taiwan University(台湾国立大学) Stanford University(斯坦福大学) VinUniversity(文大学)

AI总结 通过插入无关广告片段,发现视频大语言模型常将不同片段的事件错误关联,表现出将视频视为事件集合而非时间序列的“事件袋”行为。

详情
AI中文摘要

视频理解的一个关键能力是跨时间可靠地将主体与事件联系起来,然而视频大语言模型(VideoLLMs)是否真正实现了这一点仍不清楚。在这项工作中,我们引入了DistractionBench来评估VideoLLMs在存在无关视频片段的情况下是否能稳健地关联主体和事件。通过受控干预,例如在较长视频中插入短广告片段,我们表明VideoLLMs经常幻觉出不同片段中实体之间的交互,错误地将注入广告中的动作归因于主视频中的主体。我们将这种系统性幻觉表征为事件袋(BoE)行为,其中模型将视频视为事件的集合而非时间结构化的序列。评估11个流行的VideoLLMs,我们发现所有模型都表现出显著的BoE行为。我们的发现表明VideoLLMs缺乏可靠的时间接地机制,并激励开发具有更稳健主体-事件关联的模型。

英文摘要

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

2605.27091 2026-05-27 cs.CL cs.AI 版本更新

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

MiRD:通过误覆盖风险分解实现开放式问答的可靠集值预测

Anqi Hu, Zhiyuan Wang, Zijun Jia, Bo Fu

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学) Beihang University(北航)

AI总结 提出MiRD两阶段框架,通过将整体误覆盖分解为采样失败和条件选择失败,在开放式问答中实现可靠的集值预测,控制采样风险和条件选择风险,并产生更紧的边界和更自适应的预测集。

详情
AI中文摘要

可靠的集值预测为缓解开放式问答中的幻觉提供了一种原则性方法,但现有的共形方法通常依赖于一个脆弱的假设:有限采样必须已经产生至少一个可接受的候选,或者违反此条件的校准示例被丢弃。在本文中,我们介绍了MiRD,一个两阶段框架,将整体误覆盖分解为采样失败和条件选择失败。在第一阶段,MiRD在固定预算下,对有限采样不产生可接受答案的概率建立了一个期望水平的边际上界。在第二阶段,基于采样成功,MiRD使用在整个校准集上定义的与接受性相关的非一致性分数来校准共形选择阈值,从而保持校准集的完整性。在三个开放式问答数据集和八个模型上,MiRD控制了采样风险、条件选择风险和整体误覆盖,同时产生了比PAC风格替代方案更紧的第一阶段边界,以及比仅成功校准更自适应的预测集。

英文摘要

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.

2605.27088 2026-05-27 cs.CL cs.LG 版本更新

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

LLMs 已经是好导师:面向教学数学辅导的无训练提示优化

Unggi Lee, Minchul Shin, Yeil Jeong, Sookbun Lee, Jeongsu Moon, Kyungtae Joo, Eunjoo Lee, Hoilym Kwon

发表机构 * Korea University Sejong Campus(韩国大学世宗校区) Gyeonggi Institute of Education(京畿教育学院) Indiana University Bloomington(印第安纳大学布卢明顿分校) Opentutorials Chosun University(全州大学) Korea University Korean Studies Center(韩国大学韩学研究中心)

AI总结 本研究探索通过API调用优化系统提示的无训练方法,提出5种教育专用方法,在2个OOD基准上评估12种方法,发现所有方法均超越最强RL训练基线,ParetoGrad在事后解决率、泄漏控制和有用性上达到最佳帕累托平衡。

Comments 17 pages, 5 figures

详情
AI中文摘要

将LLMs与数学辅导对齐通常需要基于RL的训练和多GPU基础设施。我们研究无训练提示优化——仅通过API调用演化系统提示——是否可以作为实用替代方案。我们改编了7种已发表方法并提出了5种教育专用方法,在2个OOD基准套件上的5种条件下评估这12种方法。所有12种最佳方法配置均超越了最强的RL训练基线(R_total = 0.633),我们的ParetoGrad在事后解决率、泄漏控制和有用性上实现了最佳帕累托平衡,而非在任何单一组件上占优。使用包含82个代码的教育代码本进行行为分析发现,无训练方法依赖教学知识模式的频率是RL训练模型的2-3倍,同时意图级脚手架减少了约10个百分点。我们还发现一个任务依赖的推理模式效应,在无训练和基于RL的范式中一致。我们的方法仅通过提示和最小计算即可高效开发教学对齐的LLM导师。

英文摘要

Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.

2605.27083 2026-05-27 cs.CL cs.CR 版本更新

On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning

反事实知识训练在LLM遗忘中的隐藏代价

Xiaotian Ye, Xiaohan Wang, Mengqi Zhang, Shu Wu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Huazhong University of Science and Technology(华中科技大学) Shandong University(山东大学) NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences(神经网络计划、人工智能研究所、中国科学院自动化研究所)

AI总结 本文发现反事实微调(CFT)在LLM遗忘中存在知识冲突和幻觉溢出两大问题,并引入扩展基准RWKU+及诊断工具进行系统分析。

详情
AI中文摘要

反事实微调(CFT)已成为大语言模型(LLM)遗忘的一种有前景的范式,通过训练模型生成替代的虚构知识来取代不需要的内容。然而,在这项工作中,我们发现该范式在某些方面仍不如其他范式,并识别出导致这一差距的两个先前被忽视的陷阱:(1)知识冲突,即反事实语料库中的相互不一致导致冲突梯度,破坏参数优化;(2)幻觉溢出,即拟合虚假目标会灌输持久的捏造偏差,增加无关领域的幻觉率。为了系统诊断这些问题,我们引入了RWKU+,这是一个扩展的基准,配备了新颖的权衡指标和梯度级诊断工具。我们的工作进一步讨论了该范式的局限性和开销,旨在为更严格的LLM遗忘研究提供见解和可操作的指导。

英文摘要

Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.

2605.27072 2026-05-27 cs.CL cs.AI 版本更新

E3: Issue-Level Backtesting for Automated Research Critique

E3: 面向自动化研究评论的问题级回测

Yashwardhan Chaudhuri, Sanyam Jain, Paridhi Mundra

AI总结 提出E3自动化评论助手,通过问题级回测协议评估其在识别研究论文技术问题上的表现,相比人类评审和LLM基线实现最高召回率。

详情
AI中文摘要

我们提出E3,一个自动化评论助手,通过识别研究论文中与决策相关的技术问题来增强评审者和工程团队。对于每个问题,E3报告其性质、位置、对贡献的影响以及解决该问题所需的分析或证据,涵盖无根据的主张、缺失的消融实验、弱基线、隐藏假设、有效性威胁和数据泄露风险。为了在没有污染混杂因素的情况下评估E3,我们采用问题级回测协议:语料库仅限于每个自动化来源训练截止日期之后发表的论文,并且对于每篇论文,一个仅观察匿名评审的元裁判将每个问题-来源对标记为“捕获”、“部分”或“遗漏”。应用于100篇ICLR 2026论文和4598个被评判的问题行,将E3与ICLR人类评审以及基于OpenAI的gpt-5.4和Anthropic的claude-opus-4-6构建的两个提示匹配的LLM基线进行比较,使用元裁判gpt-5.5,E3在每个聚合指标上达到最高召回率。包含部分的召回率达到90.2%,比GPT高15.5个百分点,比Claude高17.1个百分点,比人类评审高29.2个百分点,严格召回率保持顺序为65.8%。在人类评审提出的问题上,E3恢复了89.6%;在人类评审遗漏的问题上,它额外发现了1635行被纳入评判联合集,比次优来源多406行。语料库、基线提示、裁判提示模板和评估代码已发布。

英文摘要

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.

2605.27068 2026-05-27 cs.CL cs.AI cs.MA 版本更新

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

QUACK: 多模态社交推理智能体中的沟通知识质疑、理解与审计

Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Cambridge(剑桥大学) MBZUAI - Mohamed bin Zayed University of Artificial Intelligence(MBZUAI - 摩苏尔·本·扎耶德人工智能大学) University of Toronto(多伦多大学) Salesforce

AI总结 提出QUACK框架,通过游戏结果、行为轨迹和话语一致性三级评估,自动审计多模态社交推理智能体语言与感知行为的一致性,发现最强智能体仍有15.1%的空间幻觉和过半无据指控。

详情
AI中文摘要

社交推理游戏已成为探测大型语言模型智能体推理、欺骗、协调和信念建模的热门测试平台。然而,大多数环境仅通过胜率等游戏结果评分,且主要局限于纯文本交互,难以判断智能体的语言是否真正基于其感知和行动,也难以识别其行为背后的失败模式。为填补这一空白,我们引入了QUACK,一个用于审计多模态社交推理中智能体语言基础的开源环境和评估框架。QUACK在三个层面评估智能体:游戏结果、行为轨迹和话语级一致性。其核心的陈述验证流水线从引擎日志重建每个智能体的真实轨迹,并对照检查每个讨论声明,自动标记空间幻觉、无据指控、欺骗崩溃和语言-行动不一致。在同质和跨模型对抗设置下评估三个前沿视觉语言模型,我们发现即使是最强的智能体,其可验证的空间声明中有15.1%是幻觉,且超过一半的指控缺乏有据证据。我们在https://github.com/AAAAA-Academia-Attractions/QUACK发布完整的引擎、评估框架、工具包和日志。

英文摘要

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

2605.27066 2026-05-27 cs.CL cs.IR 版本更新

Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search

工业搜索中基于大语言模型的查询驱动事件时间线摘要

Mingyue Wang, Xingyu Xie, Hang Yang, Li Gao, Lixin Su, Ge Chen, Dawei Yin, Daiting Shi

发表机构 * Baidu Inc.(百度公司)

AI总结 提出QDET系统,通过多任务微调和强化学习实现查询驱动的事件时间线摘要,在百度搜索中显著提升用户参与度。

Comments Accepted at KDD 2026

详情
AI中文摘要

理解事件如何随时间演变对于处理热门新闻查询的搜索引擎至关重要。我们提出了QDET(查询驱动事件时间线摘要),这是一个部署在百度搜索上的生产系统,用于构建聚焦的事件时间线以解释特定查询事件。与传统的以主题为中心、旨在全面覆盖的方法不同,QDET从每天检索的数百万文档形成的嘈杂候选集中识别并组织与查询密切相关的子事件。QDET包含两个关键创新:(1)多任务监督微调,包含三个辅助任务——时间顺序、因果判断和时间线完成——使紧凑模型在专业领域匹配更大通用模型的性能;(2)基于强化学习的事件简洁摘要,在保持语义质量的同时强制执行严格长度约束,实现了88.2%的长度合规性,并在约束满足上比671B规模模型高出7.7个百分点。我们微调的7B参数模型在时间线摘要上达到76.2%的F1分数,略超DeepSeek-R1-671B的零样本性能(76.1% F1),而仅使用其1%的参数——表明领域特定优化能够以大幅降低的计算成本实现质量相当的生产就绪模型。百度搜索上的在线A/B测试验证了实际效果,与单任务基线相比,点击率提升5.5%,停留时间延长4.6%,探索深度增加4.4%。我们进一步证明时间线理解可迁移到热度预测,确认了对下游任务的有效知识迁移。

英文摘要

Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.

2605.27062 2026-05-27 cs.CL cs.LG 版本更新

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR: 一个大规模说话人标注的欧洲葡萄牙语议会会议语音语料库

Francisco Teixeira, Carlos Carvalho, Mariana Julião, Catarina Botelho, Rubén Solera-Ureña, Sérgio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

发表机构 * INESC-ID Instituto Superior Técnico(理工学院)

AI总结 为弥补欧洲葡萄牙语语音资源不足,构建了FalAR语料库,包含5800小时议会会议语音及说话人标注,实验表明作为预训练数据可使ASR词错误率相对降低14%。

Comments Published in LREC2026

详情
AI中文摘要

自动语音识别(ASR)的最先进性能在很大程度上依赖于大规模标注语料库的可用性。这增加了数据收集工作的需求,特别是对于代表性不足的语言和方言变体。由于欧洲葡萄牙语(EP)的说话人数量较少(约1100万),在目前可用的大规模语音数据资源中,它被巴西葡萄牙语(BP)(约2亿说话人)所掩盖,导致EP用户的语音系统性能不佳。为了弥补这一差距,并遵循其他语言的类似数据收集工作,我们提出了FalAR,一个大规模、说话人标注的欧洲葡萄牙语议会会议语音语料库。FalAR涵盖约20年,包含5800小时的语音数据。此外,4850小时具有说话人身份标注,总共1180个说话人,附带元数据包括年龄、性别、政治派别和议会角色。该语料库使用最先进的EP CAMÕES ASR模型进行转录参考对齐。在本文中,我们描述了数据收集过程以及FalAR语料库的主要特征。此外,我们评估了数据量和对齐准确性对ASR性能的权衡,实验表明,将FalAR作为预训练数据可以使基线模型的词错误率相对降低高达14%。

英文摘要

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.

2605.27050 2026-05-27 cs.CL cs.LG 版本更新

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu:一种以数据为中心的低资源机器翻译方法

Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar

发表机构 * Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai(孟买韦尔马塔·吉贾拜技术学院计算机工程与信息技术系) Tübingen AI Center, University of Tübingen, Germany(图宾根大学图宾根人工智能中心,德国)

AI总结 提出BhashaSetu数据集,通过大规模、多领域、形态感知的英-马拉地语平行语料库,并验证语料库级去重对低资源神经机器翻译质量的关键影响。

详情
AI中文摘要

我们提出了BhashaSetu,一个语言丰富的英语-马拉地语平行数据集,解决了低资源神经机器翻译(NMT)中持续存在的数据限制问题。马拉地语有超过9500万使用者,但在不同领域的高质量平行语料库中仍然代表性不足。我们的数据集包含来自新闻、政治、医疗、文学和文化等异构来源的278万个句子对,并提供了词干化和词形还原表示以支持形态感知分析。我们使用BLEU、spBLEU、chrF++和TER指标对多个最先进的翻译模型进行了基准测试,并使用LoRA对NLLB-200-distilled-600M进行了参数高效微调。我们消融实验的一个关键发现是:语料库级去重是预处理中对下游质量贡献最大的单一因素(去除它会使性能降低1.17 BLEU和2.21 chrF++),这表明对于低资源、形态丰富的语言,有纪律的跨源语料库卫生是一种低成本、高影响力的干预措施。该数据集已公开发布,以促进可重复且语言信息丰富的低资源NMT研究。

英文摘要

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.

2605.27045 2026-05-27 cs.CL 版本更新

ExTax: Explainable Disinformation Detection via Persuasion, Emotion, and Narrative Role Taxonomies

ExTax:基于说服、情感和叙事角色分类学的可解释虚假信息检测

Shang Luo, Yingguang Yang, Zhenchen Sun, Yang Liu, Bin Chong, Jingru Chen, Yancheng Chen, Jiayu Liang, Kefu Xu, Hao Peng, Philip S. Yu

发表机构 * Peking University(北京大学) University of Science and Technology of China(中国科学技术大学) North China University of Science and Technology(华北理工大学) Tsinghua University(清华大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) University of Chinese Academy of Sciences(中国科学院大学) Soochow University(苏州大学) Beihang University(北航) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出ExTax框架,统一说服修辞、情感操纵和叙事角色为17维分类空间,通过熵驱动动态标签平滑和多头注意力融合分类与上下文特征,实现可解释的虚假信息检测,在跨域基准上达到0.8456 Macro F1。

详情
AI中文摘要

LLMs的普及加速了高度流畅虚假信息的生成和传播,使得传统的句法语义验证越来越不足。这种欺骗很少仅依赖表面虚假;相反,它常常结合说服性修辞、情感操纵和叙事角色构建,通过多种认知途径影响读者的解读。然而,现有检测器通常强调孤立信号——如句法、外部知识、说服或情感线索——因此难以捕捉虚假信息背后的多方面操纵意图或提供人类可审计的解释。为填补这一空白,我们提出了 extbf{ExTax},一个面向分类学的可解释虚假信息检测框架。ExTax将说服修辞、情感操纵和叙事角色统一到17维分类空间中,涵盖6种说服修辞策略、5种情感操纵方法和6种叙事角色类别。它从多个前沿LLMs中提取属性,通过熵驱动动态标签平滑协调它们的分歧,并通过异构多头注意力将所得分类表示与上下文编码融合,将每个预测基于可解释的操纵画像。在五个跨领域和跨体裁基准上,ExTax实现了0.8456的整体Macro $F_1$,优于最先进的深度学习和基于LLM的基线。在严重的体裁不平衡下,最强的深度基线从0.9454降至0.6194,而ExTax保持稳健。

英文摘要

The democratization of LLMs has accelerated the generation and circulation of highly fluent disinformation, making traditional syntax-semantic verification increasingly insufficient. Such deception rarely relies solely on surface-level falsity; instead, it often combines persuasive rhetoric, emotional manipulation, and narrative role construction to influence readers' interpretations through multiple cognitive pathways. However, existing detectors typically emphasize isolated signals -- such as syntax, external knowledge, persuasion, or affective cues -- and therefore struggle to capture the multi-faceted manipulative intents underlying disinformation or provide human-auditable explanations. To address this gap, we present \textbf{ExTax}, a taxonomy-aligned framework for explainable disinformation detection. ExTax unifies persuasive rhetoric, emotional manipulation, and narrative roles into a 17-dimensional taxonomic space, covering 6 persuasive-rhetoric strategies, 5 emotional-manipulation methods, and 6 narrative-role categories. It elicits attributes from multiple frontier LLMs, reconciles their disagreements through Entropy-driven Dynamic Label Smoothing, and fuses the resulting taxonomic representations with contextual encodings via Heterogeneous Multi-Head Attention, grounding each prediction in an interpretable manipulation profile. Across five cross-domain and cross-genre benchmarks, ExTax achieves an overall Macro $F_1$ of $0.8456$, outperforming state-of-the-art deep learning and LLM-based baselines. It also remains robust under severe genre imbalance, where the strongest deep baseline degrades from $0.9454$ to $0.6194$.

2605.27033 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) ICREA

AI总结 提出s-Trace方法估计最优子图,发现LLM计算分为早期稀疏核心和后期密集细化两个阶段,且计算量与模型不确定性相关。

详情
AI中文摘要

基于Transformer的大型语言模型(LLMs)由数十亿个参数组成,这些参数排列在深度和宽度都很大的计算图中,但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法,以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法,我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点(主要位于后期层,且越来越多地由注意力头组成)会导致近似完整输出分布的逐步细化。此外,我们发现每个输入所需的计算量与模型不确定性相关,并且更稀疏的子图编码浅层统计信息,例如单字频率。总体而言,我们的结果表明,有效的LLM计算中存在一致的模块化组织,其中稀疏的早期层核心提供粗略预测,然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

2605.27030 2026-05-27 cs.CL 版本更新

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

分享更多,搜索更少:面向高效测试时间扩展的协作并行思考

Xinglin Wang, Hao Lin, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

发表机构 * School of Computer Science, Beijing Institute of Technology(北京理工大学计算机学院) Xiaohongshu Inc(小红书公司)

AI总结 提出一种无需训练的协作并行思考框架,通过在并行分支间共享搜索信息来减少冗余探索,从而在测试时间扩展中实现更优的准确率-延迟帕累托边界。

Comments Preprint

详情
AI中文摘要

测试时间扩展(TTS)通过分配额外的推理计算来探索解空间,从而增强大型语言模型的推理能力。然而,现有的并行TTS方法通常在搜索过程中保持分支隔离:中间发现保持分支私有,无法及时指导其他分支。这种信息隔离导致大量冗余探索,因为分支反复重新发现其他地方已有的信息,并且需要更多搜索步骤来收集做出正确回答所需的完整决策信息。为弥补这一差距,我们提出协作并行思考(CPT),一种无需训练的推理框架,能够在并行分支间实现搜索时信息共享。CPT从正在运行的分支中提取紧凑的中间信息,维护一个去重的查询级信息池,并通过输入上下文广播池条目,使得后续搜索步骤中的每个分支能够重用其他分支的发现,而不是重新发现相同信息。实验上,在HMMT和AIME基准测试上的结果表明,CPT在不同rollout预算和模型规模下,相比强基线建立了更强的准确率-延迟帕累托前沿,突显了搜索时协作作为高效并行TTS的一个有效方向。

英文摘要

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and cannot guide other branches in time. This information isolation causes substantial redundant exploration, as branches repeatedly rediscover information already found elsewhere and require more search steps to collect complete decision information needed to reach correct answers. To bridge this gap, we propose \textbf{Collaborative Parallel Thinking (CPT)}, a training-free inference framework that enables search-time information sharing across parallel branches. CPT extracts compact intermediate information from ongoing branches, maintains a deduplicated query-level information pool, and broadcasts pool entries through the input context, allowing each branch in subsequent search steps to reuse discoveries made by other branches rather than rediscover the same information. Empirically, experiments on HMMT and AIME benchmarks show that CPT establishes a stronger accuracy--latency Pareto frontier than strong baselines across rollout budgets and model scales, highlighting search-time collaboration as an effective direction for efficient parallel TTS.

2605.27025 2026-05-27 cs.CL cs.MM 版本更新

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

基于属性的LLM与仇恨言论标注的对齐诊断

Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser

发表机构 * School of Computation, Information and Technology, TU Munich(计算、信息与技术学院,慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 通过分析LLM在十个主观属性上的判断与人类标注的对齐情况,发现行为显式维度对齐良好而评价维度系统性反转,并提出基于置信度加权岭回归的属性组合方法,重构连续仇恨言论分数,R²达0.71。

详情
AI中文摘要

仇恨言论标注成本高昂、主观性强且容易产生标注者分歧,使得大规模数据集构建具有挑战性。我们系统分析了大型语言模型(LLM)在十个理论上基于主观属性(如去人性化、暴力和情感)上与人类判断的对齐程度,评估了Llama 3.1和Qwen 2.5的小型及大型变体。我们的分析揭示了所有模型的一致分裂:行为显式维度(侮辱、羞辱、攻击-防御)与人类标注高度相关,而评价维度(尊重、情感、仇恨言论)则系统性反转。人口统计角色条件化降低了模型置信度,但未改善对齐。基于这些发现,我们提出通过置信度加权岭回归组合属性级LLM预测,从测量仇恨言论语料库中重构连续仇恨言论分数,R²达到0.71,优于直接提示基线,表明结构化属性分解比端到端标签预测单独恢复出更丰富且更符合人类对齐的信号。

英文摘要

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

2605.27016 2026-05-27 cs.CL cs.AI cs.LG stat.ML 版本更新

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

评估不确定性估计器与LLM幻觉的相关性

Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau, Karteek Alahari

发表机构 * CREST, ENSAE Institut Polytechnique de Paris(CREST,巴黎高等理工学院) Ekimetrics France(法国Ekimetrics) Centre Inria de l’Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学信息研究院)

AI总结 通过系统实证研究,评估信息论、基于采样和反思性等不确定性估计器与LLM幻觉之间的关联,发现关联性高度可变且通常较弱,挑战了将不确定性作为幻觉直接信号的做法。

Comments 35 pages, 7 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)容易产生幻觉,即与输入或训练数据不符的陈述,阻碍了可靠部署。同时,许多不确定性估计(UE)方法被提出来量化模型置信度,并常被隐含地视为模型失败的代理。然而,不确定性与幻觉之间的关系尚未得到充分表征。我们对不确定性估计器与LLM幻觉之间的关联进行了系统的实证研究。我们不是假设这种关联,而是直接评估它在何时以及在多大程度上成立。我们考虑了多种不确定性估计器,包括信息论、基于采样和反思性估计器,并检查了它们在幻觉设置中的行为。我们的实验涵盖了内在幻觉(违反输入忠实性)和外在幻觉(相对于训练数据的无根据主张),使用了四个互补基准,包括RAGTruth和HalluLens。我们发现,这种关联性高度可变且通常较弱,取决于幻觉类型和所评估的LLM。这些结果挑战了将不确定性作为幻觉直接信号的做法,并阐明了何时它能提供可操作的信息。

英文摘要

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

2605.27015 2026-05-27 cs.CL 版本更新

PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

PersLitEval:波斯文学问题上的细粒度基准与LLM评估

Ruhallah Niazi, Faeze Ghorbanpour, Alexander Fraser

发表机构 * School of Computation, Information and Technology, TU Munich(计算信息与技术学院,慕尼黑工业大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 提出PersLitEval基准,包含4514道波斯文学多选题,评估六种LLM在十种提示策略下的表现,发现模型在概念相似性任务上准确率高,但在拼写和构词等正式语言分析上困难,且提示策略显著影响性能。

详情
AI中文摘要

尽管多语言能力令人印象深刻,但大型语言模型(LLM)在非英语语言的文学知识方面仍然缺乏充分评估。我们引入了PersLitEval,这是一个包含4514道波斯文学多选题的基准,涵盖拼写、修辞手法、语法、词汇、构词和概念理解等八个细粒度类别,题目来源于Konkur大学入学考试材料。我们评估了六种LLM在十种提示策略下的表现,揭示了三个难度层级上显著的类别差异:模型在概念相似性任务上准确率较高,但在正式语言分析上表现不佳,其中拼写和构词对所有模型来说都是最难的。提示策略对性能有显著影响,其中带解释的少样本示例效果最佳,尤其是在正式语言类别上。错误分析识别出三种失败模式:语义理解差距、正式语言知识差距以及计数/枚举错误,表明不同类别需要不同的改进策略。

英文摘要

Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.

2605.26999 2026-05-27 cs.CL cs.CR 版本更新

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

提示注入检测是依赖于场景的:一种基于可解释结构信号的部署感知评估

Akindoyin Akinrele, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院)

AI总结 本研究通过多模型、多场景的实验框架,评估了提示注入检测方法,发现检测性能高度依赖于部署场景和阈值选择,其中基于Transformer的模型表现最佳,结构信号在特定场景下提供适度但一致的改进。

详情
AI中文摘要

提示注入对大型语言模型的安全部署构成严重威胁,然而现有的检测方法通常在有限的设置下进行评估,未能反映真实世界的操作约束。在这项工作中,我们使用多模型和多场景实验框架,对提示注入检测进行了部署感知评估。我们比较了基于词汇、语义、结构和Transformer的检测器,在多个分布外设置、重复数据划分以及排名和阈值部署指标下的表现。我们引入了可解释的结构信号,这些信号捕捉了层次覆盖、系统提示欺骗、角色重定义和逃避模式,并评估了它们在稀疏模型中以及与强编码器基线结合时的贡献。我们的结果表明,检测性能高度依赖于场景,并且对阈值选择敏感,没有单一模型在所有设置中占据主导地位。基于Transformer的模型实现了最强的整体性能,而结构信号在特定场景下提供了适度但一致的改进,并在更困难的场景中改善了低假阳性率行为。这些发现凸显了排名性能与部署有效性之间的差距,并强调了在现实操作约束下评估提示注入防御的重要性。代码将发布。

英文摘要

Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.

2605.26978 2026-05-27 cs.CL cs.SD 版本更新

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

PashtoTTS-Bench:低资源非拉丁文字文本转语音的自动化筛选

Hanif Rahman

发表机构 * Independent Researcher(独立研究员)

AI总结 针对低资源非拉丁文字TTS评估中单一ASR往返WER的不足,提出INSV报告框架及其自动化筛选子集INSV-A,并实例化为PashtoTTS-Bench基准,通过多指标评估多个TTS系统。

详情
AI中文摘要

对于低资源非拉丁文字语言,当文本转语音(TTS)评估依赖于单一的ASR往返词错误率(WER)时可能会失败。系统可能不产生音频、说出邻近语言、仅在ASR转录中保留目标文字脚本,或者对母语者来说听起来不自然。我们引入了INSV(可懂度、自然度、脚本保真度和验证)报告框架,将这些情况分开。本文报告了INSV-A,即自动化筛选子集:合成完成度、ASR WER/CER、转录脚本保真率和音频语言识别。原生MOS和语音标注已指定但未在此版本中声明。我们将INSV-A实例化为PashtoTTS-Bench,一个针对普什图语TTS的带日期基准。2026年4月至5月的运行评估了Edge GulNawaz、Edge Latifa、OmniVoice clone、OmniVoice auto和一个乌尔都语阴性对照,使用200个FLEURS和200个过滤后的Common Voice 24提示。在独立的omniASR_CTC_300M_v2下,OmniVoice auto的WER最低(FLEURS 24.1%,CV24 27.4%),其次是Edge GulNawaz(32.8%,39.5%)、Edge Latifa(35.6%,47.7%)和OmniVoice clone(45.4%,34.8%)。低于自然语音基线的WER反映了干净的合成音频,不应被解读为优于原生语音。Whisper Large V3在检查的普什图语TTS音频上返回0.0%的普什图语标签,而MMS-LID-4017和SpeechBrain VoxLingua107将普什图语输出与乌尔都语对照区分开。该版本提供了提供者元数据、每句分数、LID审计、失败日志和用于添加系统的脚本。

英文摘要

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

2605.26969 2026-05-27 cs.CL cs.AI 版本更新

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon:基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Recon方法,通过动作重建分数评估推理轨迹的预测能力,以改进用户建模中的推理合成,在多个领域优于事后合理化基线。

详情
AI中文摘要

用户建模旨在使用语言模型(LM)从过去的上下文-动作对(例如对话轮次)语料库中模拟个体的行为,从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库,通常通过同时以上下文和动作为条件生成。然而,这种条件构成事后合理化而非推理:轨迹保证证明动作的合理性,但可能不编码潜在的潜在因果决策路径。我们提出Recon,它使用动作重建通过预测能力对推理轨迹进行评分:给定上下文和候选推理,重建模型预测动作,重建保真度决定推理质量。在四个领域,Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外,我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能,相对于基线实现了高达70.0%的胜率。我们进一步表明,Recon合成的推理可跨模型迁移,并改善重建模型之外的用户建模。我们的工作表明,事后合理化对于推理合成是不够的,有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

2605.26958 2026-05-27 cs.CL cs.AI 版本更新

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO:面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) University of Southern California(南加州大学) Zhejiang University(浙江大学) Xiaohongshu Inc.(小红书公司)

AI总结 针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题,提出Tournament-GRPO框架,通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励,在Deep Research Bench上取得4.52分提升。

详情
AI中文摘要

开放式长文本生成中的强化学习具有挑战性,因为可靠的参考答案和自动评估指标通常不可用。现有的基于规则的方法通常依赖于逐点的LLM作为评判的评分,但绝对分数难以在复杂响应间校准,可能对同一查询的生成结果提供弱区分度,并在优化过程中饱和。我们提出Tournament-GRPO,一种群组奖励框架,通过同一查询生成结果间的重复多轮锦标赛将基于规则的LLM评判转化为相对奖励。Tournament-GRPO在群组内比较候选结果,累积锦标赛结果,并将其归一化为用于GRPO训练的群组奖励。在Deep Research Bench上的实验表明,Tournament-GRPO持续优于现有的奖励设计基线,在最强基线上实现了4.52分的整体分数提升。进一步分析表明,锦标赛奖励提供了有利的有效性-效率权衡,并且锦标赛设计影响训练动态。这些结果表明,基于规则的锦标赛比较为开放式长文本生成中的强化学习提供了有效的奖励信号。

英文摘要

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.

2605.26956 2026-05-27 cs.AI cs.CL 版本更新

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

LELA: 一种基于LLM的端到端实体链接框架,支持零样本领域自适应

Samy Haffoudhi, Nikola Dobričić, Fabian Suchanek, Nils Holzenberger

AI总结 本文提出LELA,一种基于大语言模型的模块化、领域无关的实体消歧方法,并扩展为实用的Python库,集成零样本命名实体识别,实现端到端实体链接,实验验证其跨领域性能与鲁棒性。

详情
Journal ref
35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), IJCAI (International Joint Conferences on Artificial Intelligence), Aug 2026, Bremen (DE), Germany
AI中文摘要

实体链接是许多下游NLP系统的关键组件,但现有方法通常依赖于特定的目标知识库和领域,限制了其实际应用。在本文中,我们将LELA(一种模块化且领域无关的基于LLM的实体消歧方法)扩展为一个实用的Python库,该库集成了零样本命名实体识别(NER),从而为实际使用中的实体链接提供了完整的端到端流水线。我们提供了实验结果,验证了LELA在不同实体链接设置下的性能和鲁棒性。在我们的演示中,用户可以在自己的输入文本上试用该系统。

英文摘要

Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA's performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.

2605.26955 2026-05-27 cs.CL cs.AI 版本更新

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

JuICE:评估LLM裁判识别文化错误的基准

Jiho Jin, Junho Myung, Juhyun Oh, Junyeong Park, Rifki Afina Putri, Sunipa Dev, Vinodkumar Prabhakaran, Alice Oh

发表机构 * KAIST(韩国科学技术院) Google(谷歌) Universitas Gadjah Mada(加查马达大学)

AI总结 提出JuICE基准,包含7470个文化语言错误标注的多语言数据集,用于评估LLM裁判在长文本中识别深层文化错误的能力,发现最强模型F1仅0.52且常遗漏本地人易识别的错误。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署给全球用户,它们被整合到不同文化背景下的日常任务中,从起草个人通信到头脑风暴创意想法。这些任务本质上是文化性的:它们需要语境适当性、象征共鸣以及母语者本能依赖的隐性文化期望,这意味着一个回答可能在事实上合理,但对本地读者来说却明显错误。现有的文化基准通过事实验证或规范蕴含方法将文化视为一组扁平的事实,并采用LLM作为裁判,而未检查它们是否能捕捉到这种深层的文化错误。为填补这一空白,我们提出了JuICE(LLM裁判识别文化错误基准),这是一个多语言数据集,包含7470个跨度级别的文化语言错误标注,涵盖来自四个国家(美国、韩国、印度尼西亚和孟加拉国)的1050个查询-响应对,使用英语和这些国家的主要语言。利用JuICE,我们发现即使是最强的LLM裁判在错误跨度检测任务中也仅达到0.52的F1分数。此外,LLM裁判始终会遗漏本地居民容易识别的深层文化错误。我们的研究结果表明,稳健的文化评估必须超越表面级别的检测,转向考虑文化意义的深度和情境性的框架。

英文摘要

As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.

2605.26952 2026-05-27 cs.CL 版本更新

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

基于策略内在知识边界增强的高效智能体强化学习

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang

发表机构 * Tencent Inc(腾讯公司) The Chinese University of Hong Kong(香港中文大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 提出AKBE方法,通过双路径(有工具和无工具)在线策略训练动态探测模型内在知识边界,构建针对性监督信号,在保持准确率的同时减少工具调用。

详情
AI中文摘要

智能体强化学习已被证明对于训练具有外部工具使用能力的基于LLM的智能体是有效的。然而,我们发现智能体强化学习训练会导致冗余工具调用增加,并模糊模型的内在知识边界,即模型无法区分何时需要工具以及何时参数化知识足够。现有的基于奖励塑形的解决方案创建了粗粒度的优化目标,倾向于激励不加区分的工具调用抑制,导致奖励黑客行为。在本文中,我们提出AKBE(智能体知识边界增强),一种在线策略方法,通过在训练期间进行双路径(有工具和无工具)展开来动态探测模型的内在知识边界。我们将知识边界定义为每个实例是否需要工具以及所需的最小工具调用次数。通过比较各路径的正确性,AKBE对轨迹进行分类并构建针对性的监督信号,为每个问题引导高效的工具使用模式。这些信号无缝集成到智能体强化学习训练循环中。在七个QA基准上的实验表明,与标准智能体强化学习相比,AKBE平均任务准确率提高+1.85,工具调用减少18%,工具生产率提高25%,且没有任何准确率-效率权衡。进一步分析表明其在不同RL算法上的即插即用兼容性以及每个信号类别的机制。我们的代码可在https://github.com/CuSO4-Chen/AKBE获取。

英文摘要

Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at https://github.com/CuSO4-Chen/AKBE.

2605.26940 2026-05-27 cs.CL 版本更新

Accountable Human-AI Deliberation with LLMs: Scaling Collective Intelligence through Symbiotic Scaffolding

负责任的基于LLM的人机协商:通过共生脚手架扩展集体智能

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 提出一个三层共生人机框架,通过多样性放大、条款级溯源和人类主导批准,在扩展集体智能的同时保持主体性和合法性。

Comments Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology

详情
AI中文摘要

大型语言模型(LLM)可以在以前受轮流发言和引导带宽限制的规模上支持民主协商。最近的研究表明,LLM生成的群体陈述通常比人类中介的输出更受欢迎,而理论分析认为LLM放松了限制集体智能的同时性约束。然而,纯LLM中介存在使多元性崩溃、过度优化一致性以及当参与者无法质疑其如何被代表时损害合法性的风险。我们提出了一个共生的人机框架,分为三个层次:观察与多样性放大、具有条款级溯源的引导、以及人类优先批准。我们的贡献包括:具有显著性加权分级覆盖、多样性和擦除度量;结合交叉编码器相似性与因果剔除诊断的溯源管道;偏好条件权衡控制;公平感知的可争议工作流;对抗性鲁棒性测试;以及基于LLM作为评判者局限性证据的消融设计评估协议。结果是一个可测试的协商技术蓝图,能够在扩展集体智能的同时保持主体性和合法性。

英文摘要

Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.

2605.26937 2026-05-27 cs.CL cs.AI 版本更新

Beyond Questions: Evaluating What Large Language Models (Actually) Know

超越问题:评估大型语言模型(实际)知道什么

Luca Giordano, Simon Razniewski

发表机构 * ScaDS.AI Dresden/Leipzig & TU Dresden, Germany(ScaDS.AI 德尔布兰德/莱比锡及德累斯顿技术大学,德国)

AI总结 提出开放知识评估新范式,通过开放式提示(如“告诉我关于M.L. King的一切”)评估模型自然表达的知识,并构建BeQu基准测试10,000个实体。

详情
AI中文摘要

大型语言模型(LLM)中的参数化知识是其成功的基石,但仍未被充分理解。现有的知识基准通常依赖于预定义的问题(例如,“M.L. King的出生日期是什么?”),仅评估基准设计者明确选择查询的知识,这是一种有问题的可用性偏差。在本文中,我们引入了开放知识评估,这是一种用于LLM知识基准测试的新范式。它不提出狭隘的问题,而是评估模型在响应开放式引发提示(例如,“告诉我关于M.L. King的一切”)时选择呈现的知识。这将焦点从预定义的答案检索转向表征模型自然表达的知识。我们用BeQu(超越问题)实例化这一范式,这是一个包含10,000个实体并配有用于陈述验证的参考语料库的基准。使用BeQu,我们评估了广泛的语言模型,并分析了推理努力、模型规模、提示格式和知识领域的影响。数据和排行榜可在此工作的GitHub仓库和基准网站上获取。

英文摘要

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

2605.26935 2026-05-27 cs.CL 版本更新

DunbaaBERT: From Sacrifice to Semantics

DunbaaBERT: 从牺牲到语义

Iffat Maab, Waleed Jamil, Raphael Schmitt

发表机构 * Research and Development Center for Large Language Models (LLMC), National Institute of Informatics, Tokyo(大型语言模型研发中心(LLMC),信息研究所,东京) School of Computation, Information and Technology, Technical University of Munich, Germany(计算、信息与技术学院,慕尼黑技术大学,德国) Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Germany(临床医学系,弗赖堡大学医学院及医学中心,德国)

AI总结 本文提出DunbaaBERT,一种从零训练的乌尔都语RoBERTa-base模型族,通过不同词汇表大小在17GB语料上预训练,在多项下游任务中达到与强多语言基线相当的性能,并发现较大词汇表并不持续提升效果。

详情
AI中文摘要

大型语言模型在许多自然语言处理任务中取得了强劲性能,但由于资源有限和评估设置碎片化,乌尔都语仍相对未被充分探索。为填补这一空白,我们引入了DunbaaBERT,一个乌尔都语RoBERTa-base模型族,在去重后的17GB乌尔都语语料库上使用32k、52k和96k token的Byte-BPE词汇表从头训练。我们在内在和下游乌尔都语自然语言处理基准上评估DunbaaBERT,涵盖语言可接受性、新闻分类、攻击性语言检测和情感分析,同时分析词汇表大小对性能和效率权衡的影响。在各项基准中,DunbaaBERT变体与强多语言基线相比取得了有竞争力的性能,同时始终保持有利的效率权衡。有趣的是,较大的词汇表并不持续提升下游效果,DunbaaBERT$_{\text{32k}}$反复提供最强的整体效率概况。总体而言,我们的结果表明,尽管模型和训练规模相对紧凑,精心策划的乌尔都语特定编码器模型仍能保持高度竞争力。所有模型均在MIT许可下发布。

英文摘要

Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.

2605.26934 2026-05-27 cs.CL cs.AI 版本更新

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂度:逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

发表机构 * Kyoto University(京都大学) University of Tokyo(东京大学) NII LLMC(日本信息处理学会LLMC) RIKEN(理化学研究所)

AI总结 通过将推理空间划分为深度和复杂度两个维度,并考虑四种推理形式,在合成知识图谱环境中进行受控实验,发现联合深度-复杂度覆盖优于单轴策略,不同推理家族对RLVR覆盖的反应非均匀,且均匀混合优于分阶段课程。

Comments Pre-print

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为后训练推理模型的核心,但现有研究的一个关键局限在于对推理空间的狭隘视角:难度仅被视为推理深度,奖励集中在正向演绎状态追踪。相反,我们沿两个维度刻画推理空间。难度:除了推理深度,我们研究环境复杂度,即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式:我们考虑现实世界推理核心的四种能力:演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素,我们构建了一个合成知识图谱环境,具有受控的预训练和后训练分布,其中每个实例在深度、复杂度和任务家族上变化。三个发现:联合深度-复杂度覆盖优于单轴策略;推理家族反应非均匀,溯因推理在RL覆盖区域外退化,任务相关性聚类为演绎-溯因对和归纳-类比对;在固定预算下,均匀混合优于分阶段课程。我们还发现,最近的现成模型表现出相同的演绎-溯因不对称性,表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

2605.26924 2026-05-27 cs.CL 版本更新

Learning to Adapt SFT Data for Better Reasoning Generalization

学习适应SFT数据以实现更好的推理泛化

Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) Hangzhou International Innovation Institute(杭州国际创新研究院) Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing(北京未来区块链与隐私计算高级创新中心)

AI总结 提出DART方法,通过强化学习训练映射器将分布不匹配的SFT数据转化为模型自适应的监督,提升推理泛化能力。

详情
AI中文摘要

大型语言模型(LLMs)取得了显著进展,其中后训练在增强其推理能力方面起着关键作用。在后训练范式中,监督微调(SFT)被广泛使用:它利用外部数据提供密集监督并实现高效训练。然而,当数据分布与目标模型自身分布不匹配时,直接在专家数据上微调可能会损害泛化能力。在这项工作中,我们提出了推理调优的数据适应(DART),它将使用固定且可能分布不匹配的SFT数据集表述为对演示转换的优化问题。DART使用强化学习训练一个映射器模型,将原始SFT数据转换为与目标模型分布和学习偏好更匹配的模型自适应监督。转换后的数据随后用于SFT,使目标模型能够更好地利用外部监督。在多个模型和数据集上的实验表明,DART提高了泛化能力,实现了比直接RL更高的训练效率,并帮助模型超越标准SFT。我们的代码可在https://anonymous.4open.science/r/DART525E50D获取。

英文摘要

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

2605.26918 2026-05-27 cs.CL 版本更新

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

视频模型是教育领域的零样本学习者和推理者吗?EduVideoBench:面向教育视频生成的知识-技能-态度基准

Unggi Lee, Hoyoung Ahn, Yoon Choi, Seonmin Eun, Jahyun Jeong, Seonmin Jin, Harmony Jung, Hye Jin Kim, Chaerin Lee, Hyunji Lee, Jeongjin Lee, Soohwan Lee, Young-Seok Oh, Jaehyeon Park, Sun-ok Ryu, Sunyoung Shin, Yoorim Son, Haeun Park, Yeil Jeong

发表机构 * Korea University Sejong Campus(韩国大学世宗校区) Cardiff Metropolitan University(卡迪夫 Metropolitan 大学) Seoul National University(首尔国立大学) Bugil Academy(Bugil 学院) Gyeonggi Provincial Office of Education(京畿省教育厅) Loughborough University(洛桑大学) Korea National University of Education(韩国教育大学) Korea University(韩国大学) Korean Educational Development Institute(韩国教育发展院) Sungshin Women’s University(顺天堂女子大学) Seoul National University of Education(首尔国立教育大学) Korea Institute for Curriculum and Evaluation(韩国课程评价院) Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出基于知识-技能-态度框架的教育视频生成基准EduVideoBench,评估五个前沿视频生成模型在教育有效性上的不足,并发现教育有效性是多维度的,单一元素不匹配即可使视频失效。

详情
AI中文摘要

视频生成模型(VGMs)正迅速进入课堂,然而现有基准仅评估感知质量、内在忠实性、通用安全性或将视频作为推理媒介,没有评估输出是否具有教育有效性。在这项工作中,我们提出了EduVideoBench,这是教育领域第一个平衡的基准,基于知识-技能-态度(KSA)框架,使得教学充分性和教育安全性被联合评估,而非作为临时的质量维度。在五个前沿VGMs上,我们的结果显示,在知识、技能和态度方面,它们距离课堂准备就绪还有很大的改进空间。我们辅以专家评论的定性分析,发现教育有效性是多维度的,单个不匹配的元素(如节奏、可读性或符号)可能使原本正确的视频失效。我们希望EduVideoBench能够指导开发教学上合理且课堂安全的VGMs。

英文摘要

Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.

2605.26893 2026-05-27 cs.CL cs.AI 版本更新

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

GeoFaith: 时空双视角下的忠实思维链

Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia

发表机构 * Xidian University(西安电子科技大学) Xi’an Jiaotong University(西安交通大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Science and Technology of China(中国科学技术大学)

AI总结 针对思维链推理中的事后合理化问题,提出基于潜在几何结构和熵动力学的时空框架GeoFaith,通过可扩展的引导流水线构建忠实性检测器并联合优化结果正确性、过程忠实性和轨迹一致性。

详情
AI中文摘要

思维链推理推动了大型语言模型的发展,但基于结果的监督导致了普遍的事后合理化,产生了看似合理但不忠实的推理链。大多数先前的忠实性评估方法要么不可扩展、昂贵,要么不可靠。我们提出GeoFaith,一个利用潜在几何结构和熵动力学来诊断和强制忠实推理的时空框架。我们开发了一个可扩展的引导流水线,将四个领域的步骤级标注从1k扩展到20k样本,训练了一个在标准基准上优于GPT-5的8B忠实性检测器,并设计了一个忠实性感知的强化学习框架,联合优化结果正确性、过程忠实性和轨迹一致性。实验表明,所提出的方法在忠实性检测和下游推理上均取得了优越性能,生成了更短、更可解释的链,且不牺牲准确性。我们的代码将公开提供。

英文摘要

Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.

2605.26849 2026-05-27 cs.CL 版本更新

Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning

不确定性感知的自适应测试时推理预算分配

Manh Nguyen, Sunil Gupta, Hung Le

发表机构 * Applied Artificial Intelligence Initiative(应用人工智能计划)

AI总结 提出不确定性感知预算分配(UAB)框架,通过基于每问题不确定性的凹整数优化重新分配固定采样预算,无需额外推理成本,在多个推理基准上提升准确率高达3-5%。

详情
AI中文摘要

采样多个响应可以改善语言模型的推理能力,但均匀的计算分配效率低下:简单问题被过度采样,而困难问题探索不足。我们提出不确定性感知预算分配(UAB),这是一个凹整数优化框架,基于每问题的不确定性重新分配固定采样预算,且无需额外推理成本。在第一阶段,每个问题生成一个响应;其平均负对数似然(ANLL)直接从输出对数概率中提取,作为难度信号,同时该生成贡献于最终投票。在第二阶段,剩余预算通过边际贪心算法分配,该算法精确求解凹覆盖最大化替代问题:不确定的问题获得更多采样预算,而确定的问题获得更少的额外样本。在六个开源和黑盒模型(参数规模从1.5B到27B)以及五个涵盖数学、逻辑和偏好任务的推理基准上评估,UAB在平均准确率上比基线高出最多3%,在单个基准上高出最多5%,在低资源设置下增益最大,且无需辅助模型或额外的LLM调用。代码公开于 https://github.com/manhitv/UAB。

英文摘要

Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.

2605.26842 2026-05-27 cs.LG cs.CL 版本更新

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

MONA: 基于Nesterov加速的Muon优化器用于可扩展语言模型训练

Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

发表机构 * Meituan(美团)

AI总结 提出MONA优化器,通过将Nesterov加速项集成到Muon的梯度处理流程中,实现曲率感知加速,从而帮助逃离尖锐局部最小值,并在1B到68B参数的混合专家预训练中取得更优收敛和下游任务性能。

详情
AI中文摘要

Muon优化器最近为大型语言模型训练提供了一种有希望的AdamW替代方案,利用矩阵正交化产生几何感知更新。然而,与所有一阶方法一样,Muon可能会陷入尖锐的局部最小值。在这项工作中,我们提出了MONA,一种将Muon的正交化框架与曲率感知加速相结合的优化器。MONA直接将加速项添加到Muon的梯度处理流程中。该加速项根据梯度差异的指数移动平均计算得出。我们提供了MONA的详细收敛性分析,表明加速项能够在保持Muon谱范数正则化的同时逃离尖锐最小值。实验上,在从1B到68B参数的三个规模的混合专家预训练中(最大模型在1万亿tokens上训练),MONA在收敛性和下游任务性能上均优于Muon和AdamW。此外,我们在MOE-68B-A3B模型上进行了监督微调,并在通用能力、数学推理和代码生成基准上评估,MONA达到了最先进的性能。

英文摘要

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

2605.26840 2026-05-27 cs.CL 版本更新

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

通过来自多个不完美指标的偏好学习优化摘要的事实一致性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent System Laboratory(智能系统实验室)

AI总结 提出一种自动化训练流程,通过聚合多个弱事实性指标的分数并映射为偏好,过滤高分歧样本,利用词汇相似摘要对进行偏好学习,从而提升摘要的事实一致性。

Comments EMNLP 2025 Findings

详情
AI中文摘要

使用评估指标作为奖励的强化学习被广泛用于增强语言模型的特定能力。然而,对于事实一致性摘要等任务,现有指标仍不完善,限制了其作为塑造模型行为的信号的有效性。虽然单个事实性指标不可靠,但它们的组合可以更有效地捕捉多样的事实错误。我们利用这一见解,引入了一种自动化训练流程,通过聚合来自不同弱指标的分数来提高摘要的事实一致性。我们的方法通过将分数映射到偏好并过滤掉指标之间高度不一致的情况,避免了复杂的奖励塑造。对于每个源文档,我们通过改变解码策略生成词汇相似的摘要对,使模型能够从由细微词汇差异引起的事实差异中学习。这种方法仅使用源文档构建高质量偏好数据集。实验表明,从早期的编码器-解码器架构到现代大型语言模型,模型均获得一致的事实性提升,较小的模型能达到与较大模型相当的事实性。

英文摘要

Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model behaviour.While individual factuality metrics are unreliable, their combination can more effectively capture diverse factual errors. We leverage this insight to introduce an automated training pipeline that improves factual consistency in summaries by aggregating scores from different weak metrics. Our approach avoids the need for complex reward shaping by mapping scores to preferences and filtering out cases with high disagreement between metrics. For each source document, we generate lexically similar summary pairs by varying decoding strategies, enabling the model to learn from factual differences caused by subtle lexical differences. This approach constructs a high-quality preference dataset using only source documents.Experiments demonstrate consistent factuality gains across models, ranging from early encoder-decoder architectures to modern large language models, with smaller models reaching comparable factuality to larger ones.

2605.26827 2026-05-27 cs.CL cs.AI 版本更新

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

ContextGuard: 语言模型中上下文学习的结构化自我审计

Hongbo Jin, Chi Wang, Haoran Tang, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding

发表机构 * Peking University(北京大学) SCUT(上海交通大学) Tsinghua University(清华大学)

AI总结 提出ContextGuard框架,通过结构化自我审计机制使大语言模型在复杂上下文任务中忠实遵循所有上下文约束,包括外围、持久和格式敏感要求。

详情
AI中文摘要

最近的基准测试揭示,尽管大语言模型(LLMs)具有强大的推理能力,但在忠实应用复杂上下文知识方面仍存在困难。这些失败通常不是整体推理崩溃:在上下文丰富的任务中,模型可能遵循中心推理路径,同时遗漏外围、持久或格式敏感的要求。

英文摘要

Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements.

2605.26823 2026-05-27 cs.CL 版本更新

Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning

基于LLM驱动知识图谱推理生成逻辑一致的合成供应链数据

Yunbo Long, Ge Zheng, Liming Xu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 针对合成供应链数据需保持操作逻辑一致性的问题,提出TabKG框架,通过构建列关系知识图谱并利用多LLM集成验证关系,结合潜在扩散模型生成逻辑一致的表格数据。

详情
AI中文摘要

合成数据为供应链分析中两个长期存在的障碍(数据稀缺和数据隐私)提供了一种有前景的解决方案。然而,要使合成数据支持运营模拟和决策,它必须不仅再现真实记录的统计分布,还要保留支配供应链流程的\emph{操作逻辑},包括时间顺序、数学依赖、层次分类和条件规则,这些使记录在操作上合理。我们将这种逻辑视为供应链数据的“物理”。现有的表格生成模型主要针对分布保真度和下游预测效用进行优化,因此通常生成统计上看似真实但违反基本操作约束的记录。本文介绍了 extbf{ extit{TabKG}},一个知识图谱引导的框架,用于生成逻辑一致的合成供应链表格数据。TabKG构建了一个 extbf{ extit{列关系知识图谱(CR-KG)}}来表示数据操作依赖。它使用多LLM集成和多数投票从列元数据中提出候选关系,通过真实数据验证这些关系以去除幻觉或未支持的边,然后使用验证后的CR-KG指导生成。具体而言,TabKG将原始表压缩为独立列,使用潜在扩散模型生成这些列,并根据验证后的关系确定性地重建依赖列,从而通过构造强制与发现的操作规则保持逻辑一致性。

英文摘要

Synthetic data offers a promising solution to two persistent barriers in supply chain analytics: data scarcity and data privacy. However, for synthetic data to support operational simulation and decision-making, it must do more than reproduce the statistical distributions of real records, and also preserve the \emph{operational logic} that governs supply chain processes, including the temporal orderings, mathematical dependencies, hierarchical taxonomies, and conditional rules that make a record operationally plausible. We consider this logic as the ``physics'' of supply chain data. Existing tabular generative models are primarily optimized for distributional fidelity and downstream predictive utility, and therefore often generate records that appear statistically realistic but violate fundamental operational constraints. This paper introduces \textbf{\textit{TabKG}}, a knowledge-graph-guided framework for logically consistent synthetic supply chain tabular data generation. TabKG constructs a \textbf{\textit{Column Relationship Knowledge Graph (CR-KG)}} to represent data operational dependencies. It uses a multi-LLM ensemble with majority voting to propose candidate relationships from column metadata, validates these relationships against real data to remove hallucinated or unsupported edges, and then uses the validated CR-KG to guide generation. Specifically, TabKG compresses the original table into independent columns, generates these columns using a latent diffusion model, and deterministically reconstructs dependent columns according to the validated relationships, enforcing logical consistency by construction with respect to the discovered operational rules.

2605.26801 2026-05-27 cs.CL 版本更新

Psychological Constructs in Shared Semantic Space

共享语义空间中的心理构念

Hubert Plisiecki

发表机构 * IDEAS Research Institute(IDEAS研究院)

AI总结 本文提出一个框架,通过将心理构念表示为共享词嵌入空间中的方向,并使用监督语义微分从文本-结果关联中估计构念特定的语义梯度,从而实现跨不同测量工具和研究传统的心理构念的语义可比性。

详情
AI中文摘要

心理构念通常在不同的测量工具、数据集和研究传统中进行测量,这使得直接比较变得困难。本文提出了一个框架,通过将心理构念表示并比较为共享词嵌入空间中的方向,使这些构念在语义上具有可比性。使用监督语义微分,我们从文本-结果关联中估计构念特定的语义梯度,并将其投影到理论驱动的参考轴上。作为初始测试案例,我们使用效价、唤醒度和支配度(VAD)作为情感坐标系。首先,我们从英语词汇级情感规范中恢复可解释的VAD方向。其次,我们将27个GoEmotions类别的语义梯度投影到该空间中,并恢复预期的情感组织,特别是在效价和唤醒度维度上。第三,我们将相同程序应用于源自IPIP-NEO-300项目-因子关联的大五人格领域和子域。领域层面的定位大体一致,而子域层面的结果更具探索性,因为它们依赖于稀疏的问卷文本。结果表明,只要语义定位的稳定性和可解释性得到评估,嵌入空间可以支持在其他不可比较的心理测量之间进行构念层面的比较。

英文摘要

Psychological constructs are often measured in separate instruments, datasets, and research traditions, which makes direct comparison difficult. This paper proposes a framework for making such constructs semantically commensurate by representing and comparing them as directions in a shared word-embedding space. Using Supervised Semantic Differential, we estimate construct-specific semantic gradients from text-outcome associations and project them onto theoretically motivated reference axes. As an initial test case, we use Valence, Arousal, and Dominance (VAD) as an affective coordinate system. First, we recover interpretable VAD directions from English word-level affective norms. Second, we project semantic gradients for 27 GoEmotions categories into this space and recover the expected organization of emotions, especially along valence and arousal. Third, we apply the same procedure to Big Five personality domains and facets derived from IPIP-NEO-300 item-factor associations. Domain-level placements are broadly coherent, while facet-level results are more exploratory because they rely on sparse questionnaire text. The results suggest that embedding spaces can support construct-level comparison across otherwise incommensurable psychological measurements, provided that semantic placements are assessed for stability and interpretability.

2605.26797 2026-05-27 cs.LG cs.CL 版本更新

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

潜在循环Transformer:架构探索、训练策略与扩展行为

Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang, Baolin Peng, Hao Cheng, Shuohang Wang, Pengcheng He, Jianfeng Gao, Yong Jae Lee, Yelong Shen

发表机构 * Microsoft(微软公司) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of Washington(华盛顿大学)

AI总结 提出潜在循环Transformer(LRT),通过跨层循环潜在路径重用前一token的高层隐藏状态作为记忆,在不增加暂停token或额外深度循环的情况下,以约2倍基线计算实现并行训练,在匹配有效计算下提升语言建模损失和上下文学习能力,仅增加0.3%参数。

详情
AI中文摘要

我们研究潜在循环Transformer(LRT),一种自回归Transformer的轻量级增强,它重用来自前一个token的高层源层隐藏状态作为下一个token的循环记忆。由于该源状态在普通解码过程中已经计算,LRT跨位置添加跨层循环潜在路径,无需插入暂停token或额外深度循环,并且保留了标准注意力机制和KV-cache接口。为了在不顺序展开Transformer的情况下大规模预训练这种循环,我们引入了交错并行训练:一次完整的全序列初始化前向传播构建共享缓冲区;然后不相交的位置子集并行细化并写回,使得所有token在约2倍基线计算下获得循环记忆感知的监督。在nanochat风格的主干网络和广泛的每参数token预算范围内,LRT在匹配有效计算下改进了语言建模损失和上下文学习,同时仅增加0.3%的参数。

英文摘要

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.

2605.26788 2026-05-27 cs.CL cs.AI 版本更新

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT: 基于句子变换器的决策变换器条件化用于多轮对话可靠性

Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde, Amit Shukla

发表机构 * Independent Researcher(独立研究者) Drone Lab, IIT Mandi(IIT曼迪无人机实验室) UPES, Dehradun(德里敦UPES)

AI总结 针对大语言模型在多轮对话中性能下降的问题,提出一种无需训练和额外数据的推理方法SeDT,通过引入离线强化学习中的return-to-go条件化,利用语义、词汇和位置信号计算累积相关性得分并注释对话历史,显著提升模型性能并降低不可靠性。

详情
AI中文摘要

大语言模型(LLMs)在单轮任务完全指定时表现令人印象深刻,但当相同任务在多轮中逐步揭示时,同一模型性能下降高达39%,这一现象在规模上被记录为“迷失在对话中”。关键的是,这种崩溃几乎完全是可靠性失败;最佳情况下,能力仅下降16%,而不可靠性增加超过一倍(+112%)。我们认为根本原因是结构性的:扁平化的对话历史对每个先前轮次赋予相等隐式权重,使模型无法区分关键约束与无关对话。我们提出SeDT(句子变换器-决策变换器),一种无需训练的推理时方法,通过从离线强化学习中引入return-to-go条件化来解决此问题。SeDT使用来自三种互补信号(语义、词汇和位置)的累积相关性得分注释每个对话片段,并在最后一轮向模型呈现完整的注释历史,无需权重更改、无需训练数据、无需丢弃上下文。在三个LLM和三个生成任务的Lost-in-Conversation基准上评估,SeDT在所有九个模型-任务组合中均优于分片基线,平均性能P提升高达+37.7%,同时在九个组合中的七个中降低了不可靠性。简而言之,告诉模型哪些过去的轮次重要足以显著恢复对话中丢失的性能。

英文摘要

Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.

2605.26785 2026-05-27 cs.CL cs.AI 版本更新

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

EmoDistill: 对抗性谈判中语言模型代理的离线情感技能蒸馏

Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu, Alexandra Brintrup

发表机构 * University of Cambridge(剑桥大学) Technical University of Munich(慕尼黑技术大学) Exiger LLC The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出EmoDistill离线框架,通过隐式Q学习选择情感和低秩适应策略表达情感,蒸馏情感谈判技能到语言模型代理,在四个高风险谈判领域取得最高效用。

详情
AI中文摘要

后训练的LLM通常被优化以对齐响应与人类偏好,使其安全、礼貌且适合对话。然而,在对抗性谈判中,这种对齐可能成为漏洞:情感框架语言可能引导代理朝向对手方利益。使用基于GoEmotions的情感提示,我们表明情感显著改变谈判结果,表明情感是战略行动渠道而非表面风格。因此,我们引入 extbf{EmoDistill},一个用于将情感谈判技能蒸馏到语言模型代理中的离线框架。EmoDistill将情感策略分解为情感选择和情感表达:隐式Q学习(IQL)选择器学习表达\emph{哪种}情感,而基于低秩适应(LoRA)的策略通过监督微调(SFT)和裁判策略优化(JPO)学习\emph{如何}表达它。在四个情感敏感、高风险的谈判领域,在EmoDistill框架下训练的SLM策略实现了最高效用,优于普通SLM/LLM基线和仅IQL情感选择。消融实验表明情感条件化是必要的,迁移研究展示了跨领域、未见对手和训练对训练锦标赛的泛化能力。总体而言,EmoDistill从离线代理间交互中学习技能,避免了训练期间昂贵的在线谈判。

英文摘要

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.

2605.26770 2026-05-27 cs.CL 版本更新

Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids

质量而无用:LLM生成的XAI叙事作为信任启发而非决策辅助

Fabian Lukassen, Jan Herrmann, Christoph Weisser, Alexander Silbersdorff, Benjamin Saefken, Thomas Kneib

发表机构 * University of Göttingen(哥廷根大学) BASF SE(巴斯夫股份有限公司) Hochschule Bielefeld(比勒菲尔德大学) TU Clausthal(Clausthal 技术大学)

AI总结 通过五个受控实验,研究LLM生成的高质量自然语言解释在时间序列能源预测中是否提升任务准确性,发现解释不改善准确性但膨胀自信,存在质量-有用性差距。

详情
AI中文摘要

先前研究表明,大型语言模型(LLMs)可以将可解释人工智能(XAI)输出转换为在合理性、连贯性和可理解性等质量指标上得分很高的自然语言解释(NLEs)。但解释质量是否能转化为实际有用性?我们通过五个受控实验(60个测试实例中的2,730个判断)在时间序列能源预测领域中研究这一问题,每个实验操作化XAI文献中研究的有用性的一个不同方面。在保持NLE质量与先前因子研究确定的高水平一致的情况下,我们发现NLEs在五个任务中的任何一个上都没有提高任务准确性,同时膨胀了自我报告的置信度。一个安慰剂对照表明,这种置信度提升是由文本存在而非内容驱动的。在分布外检测任务中,NLEs降低了LLM判断者标记不可靠预测的能力,提供了掩盖模型失败的虚假安慰。我们将这些发现定性为质量-有用性差距,并认为对XAI到NLE管道的评估必须超越文本质量指标,扩展到下游任务性能。

英文摘要

Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.

2605.26738 2026-05-27 cs.CL 版本更新

KARMA: Karma-Aligned Reward Model Adaptation

KARMA:基于Karma对齐的奖励模型适应

Jared Scott, Jesse Roberts

发表机构 * Tennessee Tech University(田纳西科技大学)

AI总结 提出KARMA框架,利用Reddit对话数据训练奖励模型预测语境依赖的回应价值,并通过强化学习微调语言模型以提升语用能力,发现最佳奖励模型不一定带来最优对齐,且KARMA会降低事实性。

详情
AI中文摘要

人类交流依赖于隐含的社会信号,其有效性由语气、语境和对话规范塑造,而不仅仅是语义内容。我们引入了KARMA(Karma对齐的奖励模型适应),这是一个让LLM从大规模社交互动数据中学习语境敏感对话行为的框架。KARMA在Reddit对话上训练奖励模型,以预测基于语境的回应价值,并利用该信号通过强化学习微调语言模型,以提升语用中介任务的表现。关键的是,我们发现表现最好的奖励模型并未带来更好的下游模型对齐:一个完全依赖对话语境的奖励模型在预测Reddit karma方面表现更差,但产生了显著更好的下游性能。我们评估了KARMA应用于下游模型的效果,无论该模型是否直接接触社交媒体数据。得到的模型显示出改进的语用中介行为,同时很大程度上减轻了不良副作用。在所有条件下,KARMA都持续降低了事实性,包括下游模型未直接接触Reddit数据的情况,这表明这种张力嵌入在奖励信号本身中,而非由噪声训练数据引入。

英文摘要

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.

2605.26735 2026-05-27 cs.CL 版本更新

Rethinking the Multilingual Reasoning Gap with Layer Swap

通过层交换重新思考多语言推理差距

Maxence Lasbordes, Amélie Chatelain, Djamé Seddah

发表机构 * LightOn Inria(法国国家科学研究中心)

AI总结 本文通过构建多语言推理数据集并微调专家模型,发现本地推理与英语枢轴推理的性能差距远小于先前报道,并提出层交换方法将英语专家的推理中间层迁移到本地专家,以缩小差距。

详情
AI中文摘要

最近的推理大语言模型在生成思维链(CoT)时主要使用英语,即使提示使用非英语语言。先前的研究表明,强制CoT保持输入语言(本地推理)会显著降低性能,而允许模型用英语推理然后用输入语言回答(英语枢轴推理)则表现更好。然而,大多数关于这种本地推理差距的研究依赖于推理时的干预或有限的本地语言训练数据。我们在更大规模且可比监督下重新审视这一比较。我们构建了涵盖六种语言(英语、法语、德语、西班牙语、中文和斯瓦希里语)的长篇多语言推理数据集;在Qwen/Qwen3-8B-Base基础上微调本地和英语枢轴两种模式的专家模型,并在数学、科学、通用知识和代码任务上进行评估。在此设置下,五种非英语语言的平均本地推理差距缩小至1.9-3.5%,远小于先前报道。对本地专家的权重空间分析显示,中间层的微调更新具有对齐性,而外层则存在分歧。这表明存在一个很大程度上与语言无关的推理核心,周围环绕着特定语言的层。利用这一结构,我们引入了层交换方法:将英语专家更强的推理中间层迁移到每个本地专家中,从而在保留目标语言CoT的同时,几乎消除了五种非英语语言的本地推理差距。我们发布了所有模型和数据集。

英文摘要

Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.

2605.26731 2026-05-27 cs.AI cs.CL 版本更新

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

不是能力问题:LLM 智能体层级间的驾驭敏感性非单调

Yong-eun Cho

发表机构 * KailosLab(凯罗斯实验室)

AI总结 通过 432 次实验,发现 LLM 智能体的驾驭敏感性随模型层级非单调变化,且依赖模型类型(聊天 vs. 推理),推翻了“更高能力模型需要更少结构指导”的假设。

Comments 9 pages, 3 figures

详情
AI中文摘要

LLM 智能体部署中的一个普遍假设是,更结构化的驾驭方式普遍能提高可靠性,并且能力更强的模型需要成比例地减少结构指导——这共同暗示了模型能力层级与最优驾驭复杂度之间存在单调反比关系。我们通过一个受控的 432 次实验来检验这一假设,实验跨越了四个能力层级的六个模型,在 HEAT-24(一个基于 git 工作区验证的 24 任务合成基准)上采用了三种驾驭条件(轻量、平衡、严格)。我们的结果从两个方面反驳了单调反比关系。首先,对于评估的前沿聊天模型(Gemini 2.5 Flash),增加驾驭冗长度使 VTSR 降低 29-38 个百分点——这是一个驾驭复杂度悖论。其次,对于评估的前沿推理模型(Qwen3.5-122B,启用扩展思考),严格驾驭实现了最高的 VTSR(91.7%)和最低的延迟,与预测相反。在受限层级内,一个 2B 模型(Gemma4:e2B)在所有驾驭条件下均以 91.7% 的 VTSR 达到了强开放层级的稳定性。由于本研究中每个层级仅由一个模型代表,这些结果应解释为模型特定的观察;驾驭敏感性在所评估的模型中呈现非单调性,并且关键依赖于模型类型(聊天 vs. 推理)。我们引入了一个六标签失败分类法,显示格式违规主导了能力强的模型失败,而错误文件主导了低能力失败,并推导出了实用的层级感知驾驭选择指南。

英文摘要

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

2605.25981 2026-05-27 cs.CL 版本更新

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

LLM 代理何时对表面噪声与语义噪声做出不同处理?一项基于 68 个单元格的测量研究及留出轨迹级验证

Liyun Zhang, Jiayi Guo

发表机构 * School of Information and Software Engineering, UESTC(信息与软件工程学院,电子科技大学) Jacobs School of Engineering, UC San Diego(工程学院,加州大学圣地亚哥分校)

AI总结 本研究通过 68 个单元格的测量实验,发现大语言模型驱动的思维链和 ReAct 代理对语义扰动(如释义、同义词)比表现扰动(如格式、重排序)更敏感,并基于留出模型和轨迹级机制分析提出了“隐蔽发散”解释。

详情
AI中文摘要

我们记录了一个经验现象:在来自七个架构家族的十种大语言模型驱动的思维链和 ReAct 代理中,意义承载扰动(例如,释义、同义词)比同等严重程度的表现扰动(例如,格式、重排序)更频繁地改变最终答案。跨越 GSM8K、MATH 和 HotpotQA 的 68 个单元格(1,530 个原始样本和约 11,150 个变体),在严重性匹配后,不一致性差距平均为 +19.69 个百分点(配对 t=9.58,p<0.0001),其中 64/68 个单元格为正。该差距通过了四次严重性代理审计,并且在排除 qwen 模型时仍然显著(+11.10 个百分点,p<0.0001)。几项压力测试诚实地失败了:在更严格的假设下,聚类自助法显著性消失;可处理性对比无法复制;跨架构生成器交换破坏了每个单元格的排名;第二个 LLM 判断器仅产生中等一致性(κ=0.50)。 然后,我们在一个完全留出的第 11 个模型(qwen2.5-14B-Instruct;1,800 条轨迹)上验证了标题效应,并重新测试了一个预先注册的能力×可处理性分区,观察到一个小但正的留出效应(3/4 个单元格为正;合并 Welch t=3.81,p=9.6×10^{-4})。利用留出轨迹,我们探测了四个轨迹级机制信号。两个先前的机制主张未能复制并被明确撤回。两个新的探测反而支持一种“隐蔽发散”图景:语义扰动通常保留第一个动作,但从后续步骤开始导致中间推理发散,并伴随略微更深的轨迹。我们将此定位为一项带有留出复现和部分轨迹级解释的测量贡献,说明语义扰动如何通过代理推理传播。代码、扰动语料库、原始轨迹和分析脚本已匿名发布以供评审。

英文摘要

We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.

2605.25731 2026-05-27 cs.CL 版本更新

Trait-Aware Policy Optimization for Autoregressive Multi-Trait Essay Scoring

面向自回归多维度作文评分的特质感知策略优化

Zhengyang Wang, Sanwoo Lee, Jiaxin Wang, Chenxi Miao, Weikang Li, Yunfang Wu

发表机构 * Peking University(北京大学) Baidu Inc.(百度公司)

AI总结 提出特质感知策略优化(TAPO)框架,通过分解样本和特质维度的奖励并结合增强提示,提升自回归多维度作文评分性能。

详情
AI中文摘要

多维度作文评分旨在跨多个维度提供写作质量的细粒度评估。然而,如何有效后训练自回归评分模型仍未充分探索。在本文中,我们提出了特质感知策略优化(TAPO),一种专为自回归多维度评分设计的后训练框架。我们的方法沿样本和特质维度分解奖励,结合全局评分一致性、特质级准确性、格式有效性以及跨特质依赖保持。此外,我们在整个训练过程中使用增强提示,通过融入原始提示文本和特质描述,为特质特定分数生成提供更丰富的语义信息。跨多个骨干模型的实验表明,我们的方法在监督微调和标量奖励优化基线上持续提升了多维度评分性能,证明了特质感知后训练在作文评分中的有效性和可迁移性。

英文摘要

Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tailored to autoregressive multi-trait scoring. Our method decomposes rewards along both the sample and trait dimensions, combining global scoring consistency, trait-level accuracy, format validity, and inter-trait dependency preservation. In addition, we use enhanced prompts throughout training by incorporating original prompt texts and trait descriptions, providing richer semantic information for trait-specific score generation. Experiments across multiple backbone models show that our method consistently improves multi-trait scoring performance over supervised fine-tuning and scalar-reward optimization baselines, demonstrating the effectiveness and transferability of trait-aware post-training for essay scoring.

2605.25510 2026-05-27 cs.CL 版本更新

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

好奇心时代遇上AI时代:大型语言模型中的儿童安全基准测试

Samee Arif, Angana Borah, Rada Mihalcea

发表机构 * University of Michigan(密歇根大学)

AI总结 针对7-11岁儿童使用LLM的安全性问题,提出基于发展心理学的KIDBench基准,通过隐式线索和显式年龄指令提升安全性,并开发了儿童安全评估器KIDGuardLlama和响应模型KIDLlama。

详情
AI中文摘要

儿童越来越多地接触大型语言模型(LLM),这可能会使他们接触到发展不适当或需要年龄敏感性安全、指导和界限的回应。现有的LLM安全评估主要关注有害内容规避,并未明确针对面向儿童的安全性。我们引入了KIDBench,这是一个使用基于发展心理学的LLM作为评判标准的基准,用于评估面向7-11岁儿童的LLM安全性。KIDBench包含十个类别的真实儿童查询,包括单轮提示和多轮儿童角色模拟。我们比较了无儿童上下文的无提示、暗示儿童说话者的隐式提示以及显式年龄指令。隐式提示使模型得分提高了9-47%,而显式年龄进一步增加了10-30%的增益。跨语言和文化评估显示,不同语言和国家背景下的安全行为不均匀。多轮模拟显示,面向儿童的响应质量从第一轮到最差轮次可能下降6-24%。除了评估,我们还引入了儿童安全评估器KIDGuardLlama和面向儿童的响应模型KIDLlama,展示了KIDBench如何支持更安全的面向儿童AI。

英文摘要

Children increasingly have access to Large Language Models (LLMs), which may expose them to responses that are developmentally inappropriate or require age-sensitive safety, guidance, and boundaries. Existing LLM safety evaluations largely focus on harmful-content avoidance and do not explicitly target child-facing safety. We introduce KIDBench, a benchmark for evaluating child-facing LLM safety for ages 7-11 using a developmental-psychology-grounded LLM-as-a-Judge rubric. KIDBench contains realistic child queries across ten categories, with single-turn prompts and multi-turn child-actor simulations. We compare no-cues prompts with no child context, implicit-cues prompts that suggest a child speaker, and explicit age instructions. Implicit-cues improve scores by 9-47% across models, while explicit age adds a further 10-30% gain. Cross-lingual and cultural evaluations show uneven safety behavior across languages and country contexts. Multi-turn simulations show that child-facing response quality can degrade by 6-24% from the first to worst turn. Beyond evaluation, we introduce KIDGuardLlama, a child-safety evaluator, and KIDLlama, a child-oriented response model, showing how KIDBench supports safer child-facing AI.

2605.25480 2026-05-27 cs.CL 版本更新

Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki

检索即推理:通过LLM-Wiki实现自我进化的智能体原生检索

Haoliang Ming, Feifei Li, Xiaoqing Wu, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司)

AI总结 提出LLM-Wiki系统,将外部知识组织为可编译、可组合、自进化的Wiki页面,通过工具调用接口实现搜索、阅读和链接跟踪操作,并引入错误簿进行持续自纠正,在多项多跳问答基准上取得最优结果。

Comments 15 pages, 3 figures, 10 tables, 1 algorithm

详情
AI中文摘要

LLM智能体需要的检索行为应更像推理(搜索、阅读、遍历、判断证据是否充分),而非一次性上下文获取。然而,当前的检索增强生成(RAG)系统将外部知识组织为扁平块,通过嵌入相似性检索,暴露出一种不适合迭代推理智能体的“检索即查找”接口。我们提出LLM-Wiki,一种智能体原生检索系统,它将外部知识视为可编译、可组合、自进化的结构而非静态检索索引,从而实现了“检索即推理”范式。LLM-Wiki将文档编译为带有双向链接的结构化Wiki页面,通过标准工具调用接口暴露搜索、阅读和链接跟踪操作,并引入错误簿进行持久的结构和语义自纠正。LLM-Wiki在HotpotQA、MuSiQue和2WikiMultiHopQA上取得了最先进的结果,比HippoRAG 2、LightRAG和GraphRAG高出2.0-8.1个F1点。在AuthTrace上,LLM-Wiki取得了最佳总体准确率,在多文档结构化查询上尤其有显著提升,证实了基于编译的检索在链式多跳推理之外也具有泛化能力。

英文摘要

LLM agents require retrieval to behave less like one-shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. Yet current Retrieval-Augmented Generation (RAG) systems organize external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval-as-lookup interface ill-suited to iterative reasoning agents. We propose LLM-Wiki, an agent-native retrieval system that operationalizes the Retrieval-as-Reasoning paradigm by treating external knowledge as a compilable, composable, and self-evolving structure rather than a static retrieval index. LLM-Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link-following operations through standard tool-calling interfaces, and introduces an Error Book for persistent structural and semantic self-correction. LLM-Wiki achieves state-of-the-art results on HotpotQA, MuSiQue, and 2WikiMultiHopQA, outperforming HippoRAG 2, LightRAG, and GraphRAG by 2.0-8.1 F1 points. On AuthTrace, LLM-Wiki achieves the best overall accuracy, with especially strong gains on multi-document structured queries, confirming that compilation-based retrieval generalizes beyond chain-style multi-hop reasoning.

2605.25382 2026-05-27 cs.CL 版本更新

AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

AuthTrace: 主题密集的单作者语料库中的证据构建诊断

Xiaoqing Wu, Feifei Li, Haoliang Ming, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司) Beijing, China(中国北京)

AI总结 提出AuthTrace基准,通过扇入梯度诊断主题密集单作者语料库中证据构建系统的召回率、精度和答案正确性,发现证据召回是答案正确性的最强预测因子。

详情
AI中文摘要

证据构建——决定在生成开始前哪些段落到达语言模型的阶段——按范式进行评估,使得从业者无法有原则地诊断哪种组织策略失败、在哪里失败或为什么失败。我们引入了AuthTrace,这是一个基于主题密集的单作者语料库构建的诊断基准,其中近失干扰项与所需证据共享风格、主题和词汇。AuthTrace提供明确的引用证据、精确的扇入注释以及统一的包级协议,用于衡量证据召回率、证据精度和答案正确性。扇入梯度——支持答案所需的源文档数量——作为主要诊断轴,使得能够在检索、记忆、图和结构化证据范式之间进行受控比较。评估两个QA模型上的八个系统,我们发现,在主要读者-判断器对下,证据召回率是答案正确性最强的观察预测因子(r = 0.96);大多数失败源于缺失证据而非答案合成。扇入进一步揭示了特定范式的崩溃模式:平面检索的退化速度比主题组织的证据构建快2-3倍。这些结果表明,扇入分解是一种可重用的诊断镜头,用于识别证据构建系统失败的位置以及哪种范式最适合给定的工作负载。

英文摘要

Evidence construction--the stage that determines which passages reach the language model before generation begins--is evaluated paradigm by paradigm, leaving practitioners with no principled way to diagnose which organization strategy fails, where, or why. We introduce AuthTrace, a diagnostic benchmark built on thematically dense single-author corpora where near-miss distractors share style, topic, and vocabulary with the required evidence. AuthTrace provides explicit quoted evidence, exact fan-in annotation, and a unified pack-level protocol measuring evidence recall, evidence precision, and answer correctness. A fan-in gradient--the number of source documents required to support the answer--serves as the primary diagnostic axis, enabling controlled comparison across retrieval, memory, graph, and structured-evidence paradigms. Evaluating eight systems across two QA models, we find that evidence recall is the strongest observed predictor of answer correctness under the primary reader-judge pair (r = 0.96); most failures stem from missing evidence rather than answer synthesis. Fan-in further exposes paradigm-specific collapse patterns: flat retrieval degrades 2-3x faster than thematically organized evidence construction. These results show fan-in decomposition to be a reusable diagnostic lens for identifying where evidence-construction systems fail and which paradigm best serves a given workload.

2605.25281 2026-05-27 cs.CL cs.AI 版本更新

READER: Reasoning-Enhanced AI-Generated Text Detection

READER: 增强推理的AI生成文本检测

Pingfan Su, Kai Ye, Shijin Gong, Erhan Xu, Jin Zhu, Giulia Livieri, Chengchun Shi

发表机构 * School of Mathematics, University of Birmingham(布里斯托尔大学数学学院)

AI总结 提出READER方法,通过微调1.5B参数的LLM在结构化推理数据集READ上,结合推理与检测,在分布偏移下优于GPT-5.2等大100-1000倍的模型。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进步使得区分人类撰写的文本与AI生成的内容变得越来越困难。许多现有的检测器训练有监督的神经分类器,这些分类器在分布内表现强劲,但通常不透明,且在分布偏移下性能可能大幅下降。我们提出READER,一种增强推理的AI文本检测器,它输出人类/AI标签以及描述其决策证据的结构化理由。我们方法的一个关键组成部分是READ,一个包含理由和判决的精心策划的监督集。我们在READ上微调一个LLM以构建READER,该检测器在推理时先推理再检测。尽管只有1.5B参数,READER始终优于现有检测器以及提示式的高容量LLM基线(GPT-5.2、Gemini-3-Pro和DeepSeek-V3.2),这些基线的规模大100到1000倍。

英文摘要

Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated content. Many existing detectors train supervised neural classifiers that achieve strong in-distribution performance but are often opaque and can degrade substantially under distribution shift. We present READER, a reasoning-enhanced AI text detector that outputs both a human/AI label and a structured rationale describing the evidence for its decision. A key component of our approach is READ, a curated supervision set of rationales and verdicts. We fine-tune an LLM on READ to build READER, which reasons before detecting at inference time. Despite having only 1.5B parameters, READER consistently outperforms existing detectors as well as prompted, high-capacity LLM baselines (GPT-5.2, Gemini-3-Pro, and DeepSeek-V3.2), which are 100 to 1000 times larger in scale.

2605.24636 2026-05-27 cs.AI cs.CL 版本更新

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

GlobalDentBench:一个用于评估牙科领域大语言模型临床推理能力并包含专家校准的多国基准

Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang

发表机构 * Division of Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong(香港大学牙科学院应用口腔科学与社区牙科护理系) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)人工智能学院) Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)牙周科) Beijing Institute of Collaborative Innovation(北京协同创新研究院) Department of Orthodontics, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)正畸科) Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)) College of Future Technology, Peking University(北京大学未来技术学院) Freedom AI New Cornerstone Science Laboratory, National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University, Beijing 100871, China(新基石科学实验室、国家生物医学成像中心、膜生物学国家重点实验室、分子医学研究院、北京大学未来技术学院、生命科学中心,北京大学,北京100871,中国) IDG/McGovern Institute for Brain Research, Peking University, Beijing 100871, China(IDG/ McGovern脑科学研究院,北京大学,北京100871,中国) Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, The University of Hong Kong(香港大学牙科学院口腔颌面外科系) Shenzhen Loop Area Institute(深圳环城区域研究所) Department of Cancer Biology, Mayo Clinic Arizona, 5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA(梅奥诊所亚利桑那分部癌症生物学部门,5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA) Department of Conservative Dentistry, Periodontology and Digital Dentistry, LMU University Hospital, LMU Munich, Munich, Germany(慕尼黑大学医院保守牙科、牙周病学和数字牙科部门,慕尼黑,德国,慕尼黑大学) Division of Periodontology & Implant Dentistry, Faculty of Dentistry, The University of Hong Kong, Hong Kong, SAR, China(香港大学牙科学院牙周病学与种植牙科系,香港,中国)

AI总结 提出首个跨国牙科基准GlobalDentBench,包含14个专科、88个国家的8978道专家验证题目,评估三种推理层次,揭示当前大语言模型在牙科临床推理中性能随复杂度下降且存在高风险。

详情
AI中文摘要

尽管大语言模型(LLMs)在医学领域具有变革潜力,但其在真实临床场景中的推理鲁棒性和安全性仍未得到充分探索,尤其是在牙科领域。本文提出GlobalDentBench,首个跨国牙科基准,其分类体系涵盖六大洲88个国家和地区的14个牙科专科。该基准包含8978道专家验证题目,分为三种格式(选择题、简答题和基于案例的题目),并评估三个递进推理层次:知识回忆(L1)、常规推理(L2)和个体化推理(L3)。为确保数据质量,自动构建框架由六名资深牙医校准,选择题和简答题的专家一致率达到99.98%,更复杂的基于案例的题目达到96.78%。在GlobalDentBench上对12个前沿LLMs的评估显示,随着推理复杂度增加,性能呈急剧阶梯式下降。具体而言,准确率从选择题的81.34%骤降至简答题的64.53%和基于案例的题目的22.34%,同时从L1的74.01%显著下降至L2的55.64%和L3的35.71%。更关键的是,对真实牙科案例的风险分析表明,LLM生成的临床建议中总体不安全率高达31.01%,其中4.51%存在导致不可逆患者伤害的风险,且风险在正畸等专科中尤为突出。这些发现暴露了当前LLMs在医学推理和安全性方面的根本局限性。因此,GlobalDentBench为可信赖的临床AI评估提供了可扩展的基础,强调了在医疗领域安全部署这些模型之前迫切需要严格验证。

英文摘要

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

2605.23910 2026-05-27 cs.CL cs.AI 版本更新

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

基于信息融合的文档分类模式识别:多模态与多视角表示方法的系统综述

Marcin Michał Mirończuk

发表机构 * National Information Processing Institute(国家信息处理研究所)

AI总结 本文通过系统综述和元分析,提出了统一框架,量化了多模态和多视角融合在文档分类中的性能提升,并揭示了方法学严谨性不足的问题。

详情
Journal ref
Information Fusion, 132, 2026, 104247
AI中文摘要

信息融合被广泛用于通过整合多数据源(多模态)或多表示(多视角)来改进文档分类。然而,该领域缺乏统一框架、对其有效性的定量综合以及给实践者的明确指导。本系统综述通过分析139项主要研究来填补这些空白。它引入了一个正式框架来结构化该领域,呈现了定性分析结果以识别关键趋势,并进行了随机效应元分析(据我们所知,这是首次专注于文档分类的元分析)以量化性能提升。我们的元分析显示,多模态融合显著提高了准确率(平均提升+5.28个百分点,$p=0.0016$)——F1分数效应方向为正,但在我们的主要模型中统计上不显著。多视角融合在准确率(+4.67%)、F1分数(+3.08%)和召回率(均$p<0.05$)上提供了一致但适度的提升。关键的是,我们的定性综合揭示了方法学严谨性方面的可重复性挑战:只有11.8%(多模态)和23.3%(多视角)的研究使用统计检验来验证其发现,这削弱了许多结果的可靠性。本综述的主要贡献是一个统一框架、首个定量证据基础以及数据驱动的指南。本综述得出结论,成功的信息融合不依赖于算法复杂性,而在于融合方法与任务上下文的战略对齐以及对更严格验证的承诺。

英文摘要

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

2605.22511 2026-05-27 cs.AI cs.CL cs.IR 版本更新

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1: 自蒸馏驱动搜索增强推理中的自我进化

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai, Lingtao Mao

AI总结 提出Search-E1方法,通过交替使用普通GRPO和在线策略自蒸馏(OPSD),让搜索增强智能体无需外部监督或复杂模块即可自我进化,在七个QA基准上以3B模型超越所有开源基线。

详情
AI中文摘要

后训练已成为将语言模型转变为胜任的搜索增强推理智能体的主要方法。近期一系列工作通过在此标准流程之上添加复杂机制来进一步提升性能。这些增强引入了来自更强外部系统的外部监督,附加了诸如过程奖励模型或回顾性评论者等辅助模块,通过树搜索或多阶段课程重构了轨迹生成本身,并利用手工设计的奖励和惩罚来塑造奖励。每项增加都带来了可衡量的提升,但同时也使训练流程更加复杂,并将方法绑定到可能并非总是可用的资源或设计上。我们退一步思考这些机制是否真的必要,并提出了Search-E1,一种自我进化方法,让搜索增强智能体仅通过普通的GRPO与在线策略自蒸馏(OPSD)交替进行来改进。在每轮GRPO之后,策略在其自身的训练问题上进行轨迹生成。然后,一个token级的前向KL目标将策略的推理时分布与其在特权上下文下的自身分布对齐,该特权上下文暴露了更高效的兄弟轨迹。尽管简单,该过程自然地提供了密集的每步监督。在七个QA基准上,Search-E1使用Qwen2.5-3B达到了0.440的平均EM,在两个规模上均超越了所有开源基线。代码和完整版本将很快公开。

英文摘要

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with on-policy self-distillation (OPSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches 0.440 average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

2605.20530 2026-05-27 cs.AI cs.CL cs.LG cs.SE 版本更新

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas:超越LLM智能体的结果排行榜

Parsa Mazaheri, Kasra Mazaheri

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出AgentAtlas框架,通过控制决策分类法和轨迹故障词汇表,将智能体评估从结果成功分离为控制决策质量和轨迹质量,并揭示仅依赖结果排行榜的测量风险。

详情
AI中文摘要

大型语言模型智能体现在可以操作代码库、浏览器、操作系统、日历、文件和工具生态系统,但它们的评估通常将行为简化为最终任务成功。AgentAtlas将智能体评估重新定义为一种诊断词汇和审计协议,用于将结果成功与控制决策质量和轨迹质量分离。本文贡献了:(i) 一个六状态控制决策分类法(行动/询问/拒绝/停止/确认/恢复);(ii) 一个包含主要错误源和下游影响的轨迹失败词汇表;(iii) 对十五个智能体基准的0/1/2基准覆盖审计;(iv) 一个在合成1,342项数据集上进行的说明性协议研究,使用八种模型在分类法感知和分类法盲提示格式下进行评估。该合成演示不是公开基准发布,不应被视为确定的模型比较。相反,它说明了两个测量风险:当显式标签菜单被移除时,映射标签一致性可能发生显著变化,并且轴选择可能改变表观排名。AgentAtlas旨在帮助基准设计者说明他们覆盖的行为,并帮助评估者诊断仅结果排行榜隐藏的失败。

英文摘要

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG 版本更新

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态:利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

发表机构 * Peking University(北京大学) Nanyang Technological University(南洋理工大学) Beijing Jiaotong University(北京交通大学)

AI总结 通过因果追踪识别出稀疏的“拒绝轨迹”激活模式,并提出轻量级白盒检测器SALO,基于隐藏状态窗口实现鲁棒越狱检测。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪,我们识别出一个 extit{拒绝轨迹}:一种稀疏的上游激活模式,即使当诸如GCG的攻击抑制终端拒绝信号时,该模式也常常持续存在。基于这一观察,我们提出了SALO(稀疏激活定位算子),一种轻量级白盒检测器,它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上,SALO在固定的XSTest校准工作点下,改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况,阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

2605.02035 2026-05-27 cs.CL cs.AI 版本更新

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

VIDA: 多模态机器翻译中视觉依赖歧义的数据集

Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo, Chris Biemann

发表机构 * Department of Informatics, Universität Hamburg(汉堡大学信息学院) Alibaba Group(阿里巴巴集团) Alibaba Cloud(阿里云)

AI总结 提出VIDA数据集,包含2500个精心策划的实例,用于评估多模态机器翻译中需要视觉证据才能解决的歧义,并引入以歧义消解为中心的指标,实验表明链式思维微调能提升跨分布歧义消解能力。

详情
AI中文摘要

歧义消解是多模态机器翻译(MMT)中的一个关键挑战,模型必须真正利用视觉输入将歧义表达映射到其预期含义。尽管先前的工作提出了面向消歧的基准来评估视觉的作用,但我们观察到现有基准仍受限于任务格式不匹配、歧义覆盖范围狭窄或视觉依赖性验证不足。此外,现有的歧义评估并不适用于开放式翻译中的多种歧义类型。为解决这些局限性,我们提出了VIDA(视觉依赖歧义),一个包含2500个精心策划实例的数据集,其中解析带注释的源语言片段需要视觉证据。我们进一步提出了以消歧为中心的指标,使用LLM作为评判分类器来验证带注释的歧义表达是否在片段级别被正确消解。使用两个最先进的LVLM进行的实验表明,监督微调(SFT)提高了整体翻译质量,而链式思维SFT(CoT-SFT)产生了更强的跨分布歧义消解能力,这表明显式的消歧指导提高了对多种歧义类型的泛化能力。

英文摘要

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

2605.26689 2026-05-27 cs.CV cs.CL 版本更新

PinPoint: Prompting with Informative Interior Points

PinPoint: 通过信息性内部点进行提示

Pouya Sadeghi, Shawn He, Pedro Pablo Guerrero Vela, C. Thomas, Alex Wong, Sirisha Rambhatla

发表机构 * University of Waterloo(滑铁卢大学) Critical ML Apple(苹果公司)

AI总结 针对指代图像分割中VLM与SAM结合时因提示模糊导致的性能差距,提出无需训练的确定性点选择器PinPoint,通过融合视觉线索选择稳定、信息丰富的内部点,在无训练下达到监督和强化学习方法的性能。

详情
AI中文摘要

现代指代图像分割流程将用于定位的视觉语言模型(VLM)与用于掩码生成的可提示分割器(如Segment Anything Model,SAM)相结合。先前该方案的无训练实例始终落后于微调和强化学习(RL)调优的专家,且不清楚差距来自VLM的定位、SAM的能力还是提示。我们表明差距主要由提示模糊性主导:VLM提出的边界框(bbox)让SAM猜测框内哪些像素属于表达式所指的对象。内部点是自然的消歧器,但它们的落点很重要;先前的工作依赖于朴素采样的点,这些点落在边界、干扰物和背景杂波上,甚至可能比单独使用bbox更差。有监督和RL调优的方法通过训练VLM预测更好的点来缩小这一差距;我们表明这种训练是不必要的。在五个内部点的匹配预算下,用稳定、信息丰富的点选择替换朴素采样,在RefCOCO/+/g上累积交并比(cIoU)提高了12-18个点,且每个模型固定。我们将这一观察转化为PinPoint,一个确定性的、无需训练的点选择器,它融合四个视觉线索为共识图,选择紧凑、空间多样且远离边界的点,并使用冻结的VLM标记每个点。无需任何任务特定训练,PinPoint在相同堆栈上匹配了有监督和RL调优的专家,同时每次查询仅调用两次VLM。

英文摘要

Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.

2605.26683 2026-05-27 cs.CL cs.AI 版本更新

An In-Vitro Study on Cross-Lingual Generalization in Language Models

语言模型中跨语言泛化的体外研究

Adrian Cosma

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA)(达勒莫利人工智能研究所(IDSIA))

AI总结 通过构建两种程序生成的语言,独立控制词汇距离、少数语言比例等变量,研究语言模型跨语言迁移的机制,发现迁移主要取决于分词是否保留可复用的跨语言子结构,且词汇量越小越有利于掩码迁移。

Comments 16 Figures, 1 Table

详情
AI中文摘要

在自然语料中,语言模型的跨语言迁移难以研究,因为词汇重叠、形态、数据不平衡和分词相互纠缠。我们引入了一个体外框架,使用两种程序生成的语言,它们共享相同的本体、类型化语法和组合结构,但表面实现不同。这使我们能够独立改变词汇距离、少数语言比例、分词器训练制度和词汇量大小,同时评估在掩码少数语言条件下的迁移,该条件的词汇形式在训练中从未被观察到。在700次受控运行中,我们发现迁移受分词器平衡或原始词汇相似性的影响较小,而更多地取决于分词是否保留可复用的跨语言子结构。较小的词汇量通常通过保持单词可分解为共享片段来改善掩码迁移,而较大的词汇量可能将形式转化为特定语言的原子。我们进一步表明,迁移是一个阶段性过程:语法和类型级能力先于掩码词汇泛化。最后,我们尝试通过分词器桥梁解释这一机制,并表明桥梁强度与掩码可达性密切相关。

英文摘要

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.

2605.26678 2026-05-27 cs.CL 版本更新

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

NestedKV: 用于长上下文KV缓存压缩的嵌套内存路由

Hong Chen, Xiang Liu, Yubo Gao, Yuxuan Fan, Bo Wang, Yuanlin Chu, Yuanguo Lin, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Jimei University(集美大学)

AI总结 提出NestedKV方法,通过多时间尺度余弦异常评分和头自适应混合的免训练键缓存压缩,在长上下文任务中优于现有方法。

详情
AI中文摘要

长上下文语言模型受限于键值(KV)缓存的内存占用。现有的免训练KV压缩方法通常通过单一重要性信号(注意力、近期性、逐层分配或键独特性)对令牌进行排序,这在有用上下文具有全局独特性、局部片段性或即时相关性时变得脆弱。我们引入NestedKV,一种受嵌套学习中连续内存系统启发的仅键KV缓存压缩方法。NestedKV维护全局、块级和滑动窗口键锚点,通过多时间尺度余弦异常对令牌评分,并将所得排名与使用头自适应混合和惊喜门控令牌路由的免训练外部学习器结合。该评分与自适应每头预算配对,无需训练或修改LLM。在RULER(4k--32k)、LooGLE、LongBench、LongBench-E、InfiniteBench和MMLU-Pro上,使用Qwen3和Llama-3.2模型,NestedKV在保留缓存较小时表现最强。在Qwen3-4B上,当r=0.75时,它在RULER上比KeyDiff提升高达19.10分,在LongBench上提升19.29分;当r=0.95时,它在LongBench上保留37.32分,而KeyDiff为17.55分。

英文摘要

Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, or key distinctiveness -- which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k--32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at $r=0.75$; at $r=0.95$, it retains 37.32 on LongBench versus 17.55 for KeyDiff.

2605.26670 2026-05-27 cs.CL cs.AI 版本更新

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

迷宫与线索:重新思考大语言模型顺序知识编辑中的正则化方法

Zheng Wang, Kaixuan Zhang, Wanfang Chen, Jingwen Zhang, Xiaonan Lu

发表机构 * Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI)) Bosch (China) Investment Ltd.(博世(中国)投资有限公司) School of Statistics, East China Normal University(东华大学统计学院)

AI总结 本文通过优化分析证明顺序编辑与一次性编辑的等价性,揭示稳定性源于累积编辑约束而非专门正则化,从而简化大语言模型知识编辑流程。

Comments Accepted for publication at ICML 2026

详情
AI中文摘要

大语言模型中结构化知识的顺序编辑允许在不重新训练的情况下进行有针对性的事实更新,但现有方法通常依赖于复杂的正则化或约束机制,其必要性尚不明确。在这项工作中,我们系统地研究了有效且稳定的顺序编辑背后的机制。具体来说,我们首先分析了AlphaEdit的经验成功,并通过严格的优化分析建立了一次性编辑与顺序编辑之间的形式等价性。基于这一见解,我们将等价性推广到更广泛的编辑目标类别,证明稳定性自然源于正确处理累积的编辑约束,而非专门的正则化或零空间操作。我们通过实验证实,许多常用的正则化策略对于可靠的顺序更新并非必要。此外,我们将我们的框架扩展到处理冲突编辑,确保在矛盾更新下具有鲁棒且一致的行为。最终,我们的工作为顺序编辑的迷宫提供了阿里阿德涅的线索,为更简单、更可解释且可靠的知识更新指明了道路。我们的代码可在https://github.com/Wangzzzzzzzz/OTE-SE-Alignment获取。

英文摘要

Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one-time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null-space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne's thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at https://github.com/Wangzzzzzzzz/OTE-SE-Alignment.

2605.26663 2026-05-27 cs.CL cs.IR cs.SE 版本更新

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

证据缺失并非证据不足:事实核查中NEI构建伪影的诊断

Jingxi Qiu, Zeyu Han, Cheng Huang

发表机构 * ZenWeave AI Georgetown University(乔治城大学)

AI总结 针对事实核查中“信息不足”(NEI)标签因不同证据条件构建而产生的伪影,提出NEI-CAP诊断协议,通过审计捷径线索、人工裁决和跨构建迁移测试,揭示模型在NEI任务上的能力不可靠迁移及聚合分数的隐藏问题。

Comments Preprint. Under review. 20 pages, 2 figures

详情
AI中文摘要

证据缺失并非证据不足,但事实核查基准可能使它们在观测上相似。“信息不足”(NEI)标签通常通过不同的证据条件来操作化,而这种选择悄然决定了验证器学习的内容以及其分数可能隐藏的信息。我们引入了NEI-CAP,一种针对不足证据评估的构建感知诊断协议。每个NEI示例都带有产生它的构建家族;NEI-CAP审计捷径线索,通过人工裁决验证困难案例,并测试能力是否跨构建迁移。我们在SciFact风格的科学验证中实例化该协议,以FEVER和HoVer作为有界外部控制。在这些设置中,NEI能力不能可靠迁移:在捷径倾向构建上训练的模型无法识别语义相关的不充分证据,而混合构建训练缩小了差距但并未消除。固定主张的诊断进一步表明,证据条件改变了参考支持/反驳标签的置信度,而不仅仅是NEI召回率,因此聚合的NEI分数可能隐藏模型实际解决了哪个问题。

英文摘要

Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.

2605.26662 2026-05-27 cs.CL cs.AI econ.GN q-fin.EC 版本更新

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

AI评估可能扭曲认知:语境在解读学术写作中的重要性

Shang Wu, Randol Yao

发表机构 * UC Irvine(加州大学欧文分校) MIT(麻省理工学院)

AI总结 本文通过构建AI相似度基准,发现忽略国家和领域差异的评估方法会系统性高估或低估某些群体中的AI使用,提出基于具体语境的基准以更准确评估科学写作中的AI使用。

详情
AI中文摘要

本文研究了当评估方法忽略国家和领域的语境差异时,科学写作中AI使用估计可能产生的偏差。利用Dimensions中期刊论文的大规模数据,我们基于人类撰写和LLM重写的摘要之间的差异构建了AI相似度基准。我们表明,合并基准可能混淆已有的风格差异与AI生成的文本,即使在LLM之前的出版物中也会在跨国家-领域组中产生显著扭曲。相比之下,特定国家-领域的基准减轻了这种扭曲,并提供了更可信的比较基线。将这些方法应用于2025年的出版物,结果显示合并基准系统性高估了某些国家和领域的AI使用,同时低估了其他国家和领域的AI使用。这些发现强调了语境感知测量对于准确和公平评估科学中AI使用的重要性。

英文摘要

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.

2605.26655 2026-05-27 cs.CL cs.LG cs.NE 版本更新

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

为什么提示优化有效,以及为什么有时无效:一种因果启发的编辑级分析

Shuzhi Gong, Hechuan Wen

发表机构 * The University of Melbourne(墨尔本大学) The University of Queensland(昆士兰大学)

AI总结 本文通过因果推断方法分析自动提示优化在不同任务和模型上的泛化失败原因,发现编辑类型与任务特性之间的系统性交互作用。

Comments 17 pages, 4 figures, 8 tables

详情
AI中文摘要

自动提示优化方法(例如 DSpy、TextGrad)可以显著提升大语言模型(LLM)的性能,然而,它们在不同任务上的泛化能力仍然不足。在实践中,优化后的提示在一个基准上的优势往往无法迁移到另一个基准,即使切换不同的 LLM 骨干网络,这种局限性依然存在。为了探究提示性能中未被充分探索的异质性来源,我们对跨多种优化框架、LLM 骨干网络和 NLP 基准的优化提示进行了因果推断启发的观察性分析。为此,我们基于倾向调整的关联分析以及提示编辑的多种互补表示,识别出一致的任务条件编辑模式。我们发现,增加复杂性和元指令的编辑与数学和多跳推理性能呈负相关,而逐步和元认知的编辑则改善了逻辑和顺序推理任务。这些效应在认知负荷标注、表面文本特征和编辑主题分析中均具有鲁棒性,并且可以跨优化框架泛化。总体而言,这些结果表明,提示优化失败源于编辑族与任务特性之间的系统性交互,而非随机的优化伪影,从而提供了优化器行为的特征级表征,并激励了未来任务条件优化器的设计。

英文摘要

Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.

2605.26646 2026-05-27 cs.AI cs.CL cs.MA 版本更新

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

UnityMAS-O: 基于LLM的多智能体系统的通用强化学习优化框架

Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出UnityMAS-O框架,将多智能体工作流作为优化单元,通过逻辑角色、图轨迹、用户定义奖励和智能体-模型映射四个核心对象解耦逻辑与物理参数,支持灵活的参数共享和奖励分配,在检索增强问答、迭代搜索和反思代码生成任务上验证了多智能体RL对手动工作流的提升效果。

详情
AI中文摘要

基于LLM的多智能体系统将复杂任务分解为交互角色,但大多数仍通过提示、工具和控制规则手动编排,智能体很少通过统一的强化学习接口进行优化。现有的RL后训练框架主要针对单策略优化,缺乏对用户定义的多智能体工作流、结构化交互、角色特定信用分配和可配置参数共享的抽象。我们提出了UnityMAS-O,一个用于基于LLM的多智能体系统的通用RL优化框架。UnityMAS-O将完整工作流视为优化单元,而非单个响应或策略轨迹。它通过四个核心对象表示工作流:逻辑智能体角色、图轨迹、用户定义奖励和智能体-模型映射。这将逻辑智能体与物理模型参数解耦,支持完全共享、完全分离和部分共享,奖励在角色、轮次和轨迹级别分配。UnityMAS-O通过基于Ray的星形拓扑运行时扩展了verl。中央控制器执行工作流、调用工具、记录结构化轨迹并组装奖励;模型本地工作器组负责轨迹生成、缓冲、优势计算和分布式PPO风格更新。用户可以定义智能体、工作流、模型映射和奖励,而无需重写优化基础设施。我们在检索增强问答、迭代智能体搜索和反思代码生成上实例化了UnityMAS-O。在Natural Questions、HotpotQA和保留代码任务上,多智能体RL在优化后改进了手动指定的工作流,对于较小模型和严格代码全通过指标尤其有较大提升。这些结果表明,UnityMAS-O可以作为可复用基础,将多样化的基于LLM的多智能体工作流转化为可训练的多智能体RL系统。

英文摘要

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

2605.26645 2026-05-27 cs.CL 版本更新

Bounded Path Context: A Controlled Study of Visible Path History in LLM-Based Knowledge Graph Question Answering

有界路径上下文:基于LLM的知识图谱问答中可见路径历史的受控研究

Xihang Shan, Ye Luo

发表机构 * School of Mathematical Sciences(数学科学学院) Xiamen University(厦门大学) School of Informatics(信息学院)

AI总结 本文提出有界路径上下文(BPC)方法,通过将完整路径存储在符号记忆中,仅暴露最近K跳路径给关系选择提示,在WebQSP和CWQ数据集上使用Qwen3.5-9B-AWQ模型达到或超过全历史提示的性能,同时减少输入token。

Comments 13 pages, 1 figure, submitted to EMNLP 2026

详情
AI中文摘要

基于LLM的知识图谱问答(KGQA)将图遍历委托给语言模型,将每个问题转化为一系列局部关系选择决策,这些决策在多个波束和跳数上重复。一个常见但未经测试的默认做法是将完整的部分路径序列化到每个路由提示中,尽管控制器已经以精确符号状态维护了该路径。有界路径上下文(BPC)解耦了这两个角色:控制器在符号记忆中保留完整路径用于答案提取和审计,而关系选择提示仅暴露问题、当前实体、候选关系以及最多最后K跳。对K进行受控扫描——固定图邻域、波束预算、深度、解码和答案提取格式——表明,在完整的WebQSP和CWQ测试集上,使用Qwen3.5-9B-AWQ模型,有界历史匹配或超过全历史提示:K=1在WebQSP上达到0.487的答案集F1,而全历史为0.472;K=0在CWQ上达到0.287,而全历史为0.274,同时输入token分别减少9.7%和12.1%。在4B规模下,K=1在两个基准上仍然是最强设置。逐示例分析显示,71-84%的示例不受历史长度影响,而受影响的案例揭示了先前跳数何时起到消歧作用或分散注意力。这些结果表明,路径序列化长度更适合作为可调接口变量,而不是基于LLM的图控制器中的默认假设。

英文摘要

LLM-based knowledge-graph question answering (KGQA) delegates graph traversal to language models, turning each question into a sequence of local relation-selection decisions repeated across beams and hops. A common but untested default is to serialize the complete partial path into every routing prompt, even though the controller already maintains this path as exact symbolic state. Bounded Path Context (BPC) decouples these two roles: the controller retains full paths in symbolic memory for answer extraction and audit, while the relation-selection prompt exposes only the question, the current entity, outgoing relation candidates, and at most the last K hops. A controlled sweep over K -- fixing graph neighborhoods, beam budget, depth, decoding, and answer-extraction format -- shows that bounded histories match or exceed full-history prompting on complete WebQSP and CWQ test sets with Qwen3.5-9B-AWQ: K=1 achieves 0.487 answer-set F1 on WebQSP versus 0.472 for full history, and K=0 reaches 0.287 on CWQ versus 0.274, with 9.7% and 12.1% fewer input tokens respectively. At the 4B scale, K=1 remains the strongest setting on both benchmarks. Per-example analysis reveals that 71-84% of examples are unaffected by history length, while the affected cases expose when prior hops disambiguate versus distract. These results suggest that path serialization length is better treated as a tunable interface variable than as a default assumption in LLM-based graph controllers.

2605.26620 2026-05-27 cs.CL cs.HC 版本更新

Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

Granuscore: 一种用于文本分析和问答的无参考粒度度量

Lukas Ellinger, Alexander Fichtl, Miriam Anschütz, Georg Groh

发表机构 * School for Computation, Information and Technology(计算、信息与技术学院)

AI总结 提出无参考粒度度量Granuscore,利用层次嵌入空间结构可靠恢复粒度顺序并解释句子特异性变化,应用于问答基准分析模型行为差异。

详情
AI中文摘要

自然语言以不同粒度水平传达信息,从细粒度指代到宽泛描述。虽然粒度对人类交流至关重要,但现有度量主要捕捉表面细节或句子特异性。我们引入Granuscore,一种无参考的粒度度量,利用层次嵌入空间的结构特性。Granuscore在Granola-EQ数据集上可靠恢复层次顺序,并捕捉跨话语语境的预期粒度差异。跨领域,我们进一步展示Granuscore解释了超出句子长度的句子特异性非线性变化。最后,我们将Granuscore应用于四个问答基准,分析问题、黄金答案和模型输出在不同响应结果中的粒度差异。分析揭示了模型行为的一致差异,并为表征QA数据集的难度提供了原则性视角。综合来看,结果将Granuscore定位为一种可扩展、广泛适用的文本粒度分析工具。

英文摘要

Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.

2605.26612 2026-05-27 cs.CL 版本更新

LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation

LATTE: 预测同伴锚定偏好轨迹以实现个性化LLM生成

Jinze Li, Xiaoyan Yang, Shuo Yang, Jinfeng Xu, Yue Shen, Jian Wang, Jinjie Gu, Edith Cheuk-Han Ngai

发表机构 * The University of Hong Kong(香港大学) Ant Healthcare, Ant Group(蚂蚁集团医疗科技部)

AI总结 提出LATTE框架,通过预测同伴锚定的相对偏好状态来个性化冻结的大语言模型,在Amazon Reviews 2023和MemoryCD上优于检索、摘要记忆和静态潜在轮廓等方法。

Comments Under review

详情
AI中文摘要

使用冻结的大语言模型进行个性化生成需要既紧凑又即时的条件信号。现有的个性化方法通常检索或总结用户历史文本,或将其压缩为静态潜在轮廓和软提示。这些方法高效,但将用户过去行为视为聚合轮廓,因此将稳定身份、近期漂移和物品内容混合在同一表示中。我们提出潜在轨迹跟踪与外推(LATTE),一个将个性化表示为预测同伴锚定相对偏好状态的框架。对于每个历史会话,LATTE减去由对同一物品做出反应的可比用户形成的时间掩蔽基线,产生一个衡量目标用户在共享物品背景下与同伴差异的状态。然后,一个轻量级序列预测器预测该轨迹中的下一个状态,状态到令牌桥通过单个锚定软令牌将预测注入冻结的指令调优LLM。我们提供潜在因子分析,展示同伴锚定何时抵消共享物品变化,以及为什么时间预测在陈旧平均值和噪声近期状态之间权衡。在Amazon Reviews 2023和MemoryCD上的实验表明,LATTE始终优于检索、摘要记忆、静态潜在轮廓、差异感知潜在轮廓和软提示压缩基线。在Amazon Reviews 2023上,LATTE将平均ROUGE-L从静态潜在轮廓的0.219和最强附加潜在压缩基线的0.245提高到0.259。额外的成对比较和诊断分析表明,改进主要归因于预测用户特定轨迹信息,而不仅仅是添加软提示接口。

英文摘要

Personalized generation with frozen large language models requires a conditioning signal that is both compact and current. Existing personalization methods typically retrieve or summarize user histories in text, or compress them into static latent profiles and soft prompts. These approaches are efficient, but they treat a user's past behavior as an aggregate profile and therefore mix stable identity, recent drift, and item content in the same representation. We propose LAtent Trajectory Tracking and Extrapolation (LATTE), a framework that represents personalization as forecasting a peer anchored relative preference state. For each historical session, LATTE subtracts a time masked baseline formed from comparable users who responded to the same item, producing a state that measures how the target user differs from peers under a shared item context. A lightweight sequence predictor then forecasts the next state in this trajectory, and a State to Token Bridge injects the forecast into a frozen instruction tuned LLM through a single anchored soft token. We provide a latent factor analysis showing when peer anchoring cancels shared item variation and why temporal forecasting trades off stale averages against noisy recent states. Experiments on Amazon Reviews 2023 and MemoryCD show that LATTE consistently outperforms retrieval, summary memory, static latent profiles, difference aware latent profiles, and soft prompt compression baselines. On Amazon Reviews 2023, LATTE improves average ROUGE-L from 0.219 for a static latent profile and 0.245 for the strongest added latent compression baseline to 0.259. Additional pairwise comparisons and diagnostic analyses suggest that the improvement is mainly due to forecasting user-specific trajectory information, rather than merely adding a soft prompt interface.

2605.26575 2026-05-27 cs.CL 版本更新

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

中心性而非各向异性驱动多语言嵌入模型中的跨语言检索不对称性

Adib Sakhawat, Fardeen Sadab, Atik Shahriar

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文通过实验证明,在多语言嵌入模型中,中心性(hubness)是导致跨语言检索不对称性的主要几何病理因素,而非各向异性、质心漂移或向量幅度,并推荐使用CSLS替代余弦相似度作为默认检索度量。

Comments 17 pages, 5 figures

详情
AI中文摘要

多语言嵌入模型在部署时假设跨语言检索是对称的:如果语言A的查询检索到语言B中的翻译,反之亦然。但实际上并非如此。我们使用包含英语、孟加拉语、印地语和阿拉伯语的6,518个习语和谚语表达的平行语料库,通过五个生产级编码器(Gemini、Mistral、OpenAI-L、OpenAI-S、Qwen)进行嵌入,将这种失败形式化为互近邻互惠性的缺陷,并测试一个单一的机制性主张:在多语言空间的几何病理中,中心性(hubness),而非各向异性、质心漂移或幅度,是主要的因果驱动因素。在五个预先注册的实验中,预先指定了证伪条件,中心质量在互惠性的联合回归中占主导地位(49.5%的主导份额,是下一个预测因子的1.68倍;偏R²=0.302,而各向异性为0.003),而中心性感知的分数校正(CSLS)缩小了最差到最佳互惠性差距的63.5%,并产生平均模型内效应量,是外科中心向量消融的130倍。后一对比指出了机制:中心性是相似度度量的病理,而非单个中心向量的病理。我们解决了著名的各向异性-中心性悖论,证明两者在统计上是可分离的,并建议用CSLS替换余弦相似度作为多语言嵌入管道的默认检索度量。

英文摘要

Multilingual embedding models are deployed under the assumption that cross-lingual retrieval is symmetric: if a query in language A retrieves its translation in language B, the reverse should also hold. In practice it does not. Using a parallel corpus of 6,518 idiomatic and proverbial expressions in English, Bangla, Hindi, and Arabic, embedded by five production-grade encoders (Gemini, Mistral, OpenAI-L, OpenAI-S, Qwen), we formalise this failure as a deficit in mutual nearest-neighbour reciprocity and test a single mechanistic claim: among the geometric pathologies of multilingual spaces, hubness, not anisotropy, centroid drift, or magnitude, is the dominant causal driver. Across five pre-registered experiments with falsification conditions specified in advance, hub mass dominates a joint regression on reciprocity (49.5% dominance share, 1.68x the next predictor; partial R^2 = 0.302 versus 0.003 for anisotropy), while a hub-aware score correction (CSLS) closes 63.5% of the worst-to-best reciprocity gap and yields a mean within-model effect size 130x larger than surgical hub-vector ablation. The latter contrast pinpoints the mechanism: hubness is a pathology of the similarity metric, not of individual hub vectors. We resolve the well-known anisotropy-hubness paradox by showing the two are statistically dissociable, and we recommend replacing cosine similarity with CSLS as the default retrieval metric for multilingual embedding pipelines.

2605.26560 2026-05-27 cs.CL cs.AI 版本更新

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

可靠提取临床随访指令:一种混合神经符号管道

Michal Laufer, Yehudit Aperstein, Alexander Apartsin

发表机构 * Bar-Ilan University(巴伊兰大学) Afeka College of Engineering(阿菲卡工程学院) Holon Institute of Technology(霍洛恩理工学院)

AI总结 提出混合神经符号管道,结合BioBERT实体提取和确定性日期算术,在合成门诊笔记上实现接近完美的(动作, 日期)对提取F1分数,优于直接生成方法。

Comments 17 pages, 5 figures

详情
AI中文摘要

目标。门诊笔记携带随访指令,将动作与未来时间配对(“两周内进行脑部MRI”)。提取(动作,日期)对支持调度和审计,但生成式提取器会错过日期,因为链接和算术在解码中是隐式的。我们测试了一种混合神经符号管道与直接生成方法的对比。方法。我们定义了TestSpecification和TimeSpecification实体以及ScheduledFor关系。BioBERT提供BIO标注和双仿射链接器;实体通过28动作本体规范化,时间通过确定性方式归一化为天数偏移。我们在一个包含2000份笔记的合成门诊语料库上评估,采用动作不相交划分(18个训练,6个OOV测试),与零样本GPT-4o-mini和LoRA微调LLaMA-3 8B对比,使用笔记级bootstrap 95%置信区间。结果。在259份笔记的已知和OOV划分上,混合管道实现了测试时间对F1分别为0.997和0.986,MAE为0.00天。基线达到了高动作F1(LLaMA-3 0.992;GPT-4o-mini 0.963已知),但对F1保持在0.51-0.57(LLaMA-3)和0.53(GPT-4o-mini),置信区间与混合管道不重叠。结论。将学习的实体提取与确定性日期算术分离在此基准上优于生成方法,泛化到未见动作,并暴露了失败模式。迁移到真实电子健康记录笔记是下一步验证;初步的现实性检查见局限性。

英文摘要

Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.

2605.26537 2026-05-27 cs.CL 版本更新

Conceptual Steganography

概念隐写术

Zhejian Zhou, Jonathan May

发表机构 * University of Southern California Information Sciences Institute(南加州大学信息科学研究所)

AI总结 提出概念隐写术,通过思维链中的高级推理行为模式而非词汇选择嵌入信息,实验表明其对释义防御比标准关键词方法更鲁棒,且不影响推理效用。

详情
AI中文摘要

语言模型(LM)发出的思维链(CoT)驱动了其大部分能力。然而,承载有用推理的同一序列也可以隐蔽地传递信息:一个未对齐的模型可能在其CoT中嵌入隐蔽信息,从而逃避人类监督,这种隐写术形式被称为编码推理。先前的LM隐写术方案在词元或词汇空间操作,而内容保留释义器是近期工作中规范且有效的防御手段。我们引入了概念隐写术,其中CoT的每一步通过高级推理行为模式而非词汇选择来携带信息。在四个模型家族和两个推理领域中,这种后门通信渠道被证明比标准关键词方法对强释义防御具有更一致的鲁棒性,并且将信息编码到CoT中不会影响其在推理过程中的效用。在提高对这一新风险的认识后,我们进一步证明,一个策略感知的释义器可以关闭大部分渠道,强调了确保野外忠实LLM推理的新挑战和推荐防御措施。

英文摘要

Language Models (LMs) emit Chains-of-Thought (CoTs) that drive much of their capability. However, the same sequence that carries useful reasoning can also covertly convey messages: a misaligned model may embed covert information in its CoT that slips through human supervision, a form of steganography known as encoded reasoning. Prior LM steganography schemes operate in the token or lexical space, and a content-preserving paraphraser is the canonical and effective defense in recent work. We introduce conceptual steganography, in which each step of a CoT carries information through patterns of high-level reasoning behavior, rather than through lexical choice. Across four model families and two reasoning domains, this backdoor communication channel is shown to be consistently more robust to a strong paraphrase defense than standard keyword approaches, and the encoding of information into CoTs does not affect their utility in the reasoning process. Having raised awareness of this new risk, we then demonstrate that a strategy-aware paraphraser can close much of the channel, highlighting new challenges and recommended defenses for ensuring faithful LLM reasoning in the wild.

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG 版本更新

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

发表机构 * School of Computing, Telkom University(Telkom大学计算机学院) Faculty of Engineering and Technology, School of Computing and Artificial Intelligence(工程与技术学院,计算与人工智能学院)

AI总结 本文提出一种解耦的边缘可部署管道,结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型,实现风电叶片缺陷定位与结构化报告生成,在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情
AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成;在当前的实践中,这些任务被分开处理,语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道,由三个组件组成,每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器,在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块,将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型,通过量化低秩适应(QLoRA)在947个合成生成的维护报告上进行适配,从该提示生成结构化的JSON报告。检索增强微调(RAFT)进一步将每个建议基于索引的维护程序。五项消融实验,通过BLEU-4、ROUGE-L、幻觉率(HR)和LLM-as-a-Judge评分标准,将该管道与单一视觉-语言模型(VLM)基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10,而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下,QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明,具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

2605.26498 2026-05-27 cs.CL 版本更新

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Verilog-Evolve:反馈驱动与技能演进的Verilog生成

Zehua Pei, Hui-Ling Zhen, Yu Zhang, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出Verilog-Evolve框架,通过反馈驱动(功能仿真、Yosys综合、ABC时序代理等)迭代优化Verilog代码,并利用跨会话技能演进提升生成质量,实验表明在VerilogEval和混合精度GEMM任务上提高了功能成功率和下游友好性。

详情
AI中文摘要

大型语言模型(LLMs)改进了从自然语言规范生成Verilog的过程,但大多数流水线仍将生成视为孤立的采样后功能检查。这对于实际的RTL设计是不够的,因为有用的Verilog必须正确、可综合、考虑时序,并对下游硬件目标友好。我们提出Verilog-Evolve,一个用于版本化Verilog细化和跨会话技能演进的反馈驱动框架。对于每个任务,Verilog-Evolve生成多样化的次要候选,通过功能仿真、Yosys综合、ABC时序代理以及可选的GEMM指标的可执行反馈进行评估,然后在可配置评分下将最佳候选提升为主要版本。为了跨任务改进,系统维护模块化技能指导,根据任务和反馈上下文检索技能,并通过创建/改进/跳过决策和验证器报告从记录的历史中演进候选技能。在VerilogEval和混合精度GEMM任务上的实验表明,Verilog-Evolve提高了最终功能成功和晋升稳定性,同时在开源综合、时序代理和网表级GEMM目标下生成更下游友好的RTL。验证门控的技能演进进一步提高了GEMM下游质量,并在评估的技能模式中实现了最佳下游分数和GEMM保留通过率。

英文摘要

Large language models (LLMs) have improved Verilog generation from natural-language specifications, but most pipelines still treat generation as isolated sampling followed by functional checking. This is insufficient for practical RTL design, where useful Verilog must be correct, synthesizable, timing-conscious, and friendly to downstream hardware objectives. We present Verilog-Evolve, a feedback-driven framework for versioned Verilog refinement and cross-session skill evolution. For each task, Verilog-Evolve generates diverse minor candidates, evaluates them with executable feedback from functional simulation, Yosys synthesis, ABC timing proxy, and optional GEMM metrics, then promotes the best candidate into a major version under configurable scoring. To improve across tasks, the system maintains modular skill guidance, retrieves skills according to task and feedback context, and evolves candidate skills from logged histories through create/improve/skip decisions and verifier reports. Experiments on VerilogEval and mixed-precision GEMM tasks show that Verilog-Evolve improves final functional success and promotion stability while producing more downstream-friendly RTL under open-source synthesis, timing-proxy, and netlist-level GEMM objectives. Validation-gated skill evolution further improves GEMM downstream quality and achieves the best downstream score and GEMM held-out pass rate among the evaluated skill modes.

2605.26494 2026-05-27 cs.AI cs.CL cs.LG 版本更新

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2系列:小激活释放最大现实智能

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge

发表机构 * MiniMax

AI总结 提出MiniMax-M2系列混合专家语言模型,通过小激活参数实现前沿性能,核心包括智能体驱动数据管道、可扩展强化学习系统Forge及自进化检查点M2.7。

Comments Technical Report. 35 pages, 10 figures, 4 tables

详情
AI中文摘要

我们介绍了MiniMax-M2系列,这是一个基于“小激活可以释放最大现实智能”原则构建的混合专家语言模型家族。旗舰版M2总参数量为229.9B,每个token仅激活9.8B参数。M2系列专为智能体部署而端到端设计,包含三个组成部分:(i) 智能体驱动数据管道,生成大规模、可验证的轨迹,涵盖智能体编码和智能体协作,每个轨迹都基于可执行工作空间和与工件对齐的奖励;(ii) Forge,一个可扩展的智能体原生强化学习系统,适应长程智能体轨迹,并配有窗口FIFO调度、前缀树合并、推理优化以及支持白盒和黑盒智能体的干净训练-推理-智能体解耦;(iii) 最新的M2.7检查点向自我进化迈出了早期一步——自主调试训练运行并修改其自身框架。从M2到M2.7,这种组合将小激活足迹转化为智能体编码、深度搜索、办公任务和推理基准上的前沿性能。

英文摘要

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

2605.26492 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

灯塔中的伊莱亚斯,再次?诊断LLM故事中的低多样性

Sil Hamilton, David Mimno

发表机构 * Department of Information Science(信息科学系)

AI总结 研究通过采样20000个故事发现,LLM生成的故事中存在高度重复的词汇(如Elias、灯塔等),这些词汇来自偏好数据而非预训练数据,表明小数据集与强对齐算法的结合可能对多样性产生不成比例的影响。

详情
AI中文摘要

LLM生成的故事是一个流行的用例,但它们显示出非常低的变异性。我们使用五个提示从四个当前模型中采样了总共20,000个故事。我们发现,88.3%的生成故事中出现11个单词,模型之间差异很小。这些单词包括名字(Elias, Mara, Elara)、场景(灯塔)和职业(钟表匠、图书管理员)。这些标记在已发表的文献或预训练数据中并不常见,但在所有当前模型可能使用的偏好数据中却存在。令人惊讶的是,与平均后训练故事相比,这些“灯塔”故事并不常见,后训练故事中很多包含受版权保护的角色或成人内容。这一结果证明了小数据集与强大对齐算法结合可能产生的潜在不成比例影响。

英文摘要

LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.

2605.26485 2026-05-27 cs.CV cs.CL 版本更新

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniInteract:面向实时全模态助手的流式交互基准测试

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多模态实验室) SJTU(上海交通大学) NTU(国立新加坡大学) McMaster(麦马斯特大学) CityUHK(香港城市大学) JUFE(吉林大学)

AI总结 提出OmniInteract基准,通过在线推理音视频流评估全模态大模型的实时交互能力,发现现有模型在流式交互中表现薄弱。

详情
AI中文摘要

我们引入了OmniInteract,一个用于实时全模态大语言模型的流式基准测试,通过音视频流上的原生在线推理进行评估。与离线视频理解或文本提示的流式问答不同,OmniInteract保留了原始音视频流,并要求模型在线处理,无法访问未来内容。用户查询和环境声音嵌入在音频轨道中,要求模型检测多模态触发信号,决定何时响应,并在流展开时回答问题。OmniInteract包含250个视频,具有1430个时间锚定的响应槽:其中1062个1Q1A槽涵盖实时、主动和嵌套场景,368个1QnA槽用于连续任务监控和步骤指导。每个槽包括触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1、中断诊断套件和嵌套链完成分数来评估响应正确性、时序、无效输出、中断处理和上下文连续性。实验表明,当前模型在流式交互中仍然薄弱,最佳整体IA-QTF1仅为0.368,最佳1QnA IA-QTF1仅为0.052。在全双工设置下的数学推理进一步研究表明,离线能力不一定能迁移到在线交互中。代码和数据集将在https://github.com/Lucky-Lance/OmniInteract公开。

英文摘要

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

2605.26476 2026-05-27 cs.CL cs.IR 版本更新

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench:半导体制造中自适应RAG基准测试框架

Jingbin Qian, Congwen Yi, Min Xia, Wen Wu, Jun Zhu, Jian Guan

AI总结 提出FAB-Bench框架,通过六项诊断指标和三种合成策略,评估半导体制造领域RAG系统在不同上下文窗口下的性能,发现注意力稀释是极端上下文长度下性能下降的主要原因。

详情
AI中文摘要

检索增强生成(RAG)已成为知识密集型应用的关键技术,然而在垂直领域评估其性能仍然困难,原因包括领域复杂性、多样的上下文规模以及对专家评估的严重依赖,而专家评估成本高、不一致且不可扩展。我们提出了FAB-Bench,一个用于半导体制造中RAG系统自适应基准测试的端到端框架。FAB-Bench定义了六项诊断指标,衡量事实准确性、上下文利用率、完整性、检索相关性、技术深度和推理一致性。该框架将检索器诊断与生成器级别的推理分析相结合,覆盖4K-32K token的上下文窗口,量化了随着上下文范围扩展,检索精度和生成保真度如何共同演变。从超过1300个生成的候选对中,我们精选了200个查询-答案对的高质量基准,涵盖三种合成策略:大海捞针、文档内多主题和跨文档多跳。在四个LLM和四个RAG框架上的系统评估揭示了三种不同的上下文缩放行为:对数增长、早期饱和和冷启动动态,并确定注意力稀释是极端上下文长度下性能下降的主要机制。在另外三个生产级RAG系统上的跨框架验证确认了评估的可移植性。

英文摘要

Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.

2605.26463 2026-05-27 cs.CL cs.AI 版本更新

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

迈向无差错的电子健康记录:临床笔记与结构化表格之间的推理密集型一致性验证

Yeonsu Kwon, Jiho Kim, Junseong Choi, Paloma Rabaey, Minseo Kim, Sujeong Im, Jeewon Yang, Jun-Min Lee, Sangji Lee, Jiwon Kim, Hangyul Yoon, Hyunwook Kwon, Edward Choi

发表机构 * KAIST(韩国科学技术院) Ghent University(根特大学) Samsung Medical Center(三星医疗中心) Samsung Changwon Hospital(三星昌原医院) Asan Medical Center(亚山医疗中心)

AI总结 针对电子健康记录中临床笔记与结构化表格数据不一致的问题,提出推理密集型基准EHR-ReasonCon和基于大语言模型的框架EHR-Inspector,通过锚点实体提取、时间引用和表格探索工具实现高效一致性验证。

详情
AI中文摘要

电子健康记录中非结构化临床笔记与结构化表格之间的数据一致性对于患者安全和临床决策至关重要。然而,现有关于笔记-表格一致性验证的工作主要依赖于数值或简单事件的表面匹配。这些方法未能捕捉真实世界EHR文档背后的推理,包括临床解释、事件关系和时间变化。为弥补这一差距,我们引入了EHR-ReasonCon,一个用于笔记-表格一致性验证的推理密集型基准。它基于MIMIC-III构建,并经过专家指导的注释,包含来自临床笔记的8,048个实体,并提供高质量的真实标签。注释协议由专门的表格探索工具支持,以确保系统的证据检索和可靠的一致性评估。我们还提出了EHR-Inspector,一个基于LLM的框架,它分割笔记、提取锚点实体和时间引用,并使用表格探索工具与结构化表格进行一致性验证。在严格和宽松标准下,使用经过专家验证的LLM-as-a-judge指标进行评估,EHR-Inspector在多个模型骨干上实现了最先进的性能。进一步的分析证明了其组件的有效性,并突出了与人工验证的差异。

英文摘要

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.

2605.26457 2026-05-27 cs.SE cs.AI cs.CL cs.PL 版本更新

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Verus-SpecGym: 用于评估规范自动形式化的智能体环境

Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck

发表机构 * CMU(卡内基梅隆大学) Amazon(亚马逊)

AI总结 提出 Verus-SpecGym 环境与 Verus-SpecBench 基准,通过执行规范机制和对抗性测试评估 LLM 智能体将非正式编程问题转化为形式规范的能力,发现前沿模型可解决 77.8% 的任务但存在遗漏假设等脆弱性。

Comments Preprint

详情
AI中文摘要

AI 编码智能体越来越多地用于编写真实世界的软件,但确保其输出正确性仍然是一个基本挑战。形式化验证提供了一条有希望的路径:智能体生成代码的同时生成机器检查的证明,保证代码满足形式规范。然而,无法保证形式规范本身与用户意图一致。在这项工作中,我们研究规范自动形式化:LLM 智能体能否将非正式编程问题转化为忠实的形式规范。我们引入了 Verus-SpecBench,一个包含 581 个规范编写任务的基准,这些任务源自针对 Rust 验证器 Verus 的 Codeforces 问题,以及 Verus-SpecGym,一个智能体环境,模型在其中与 Verus、bash 和文件系统交互以开发这些规范。核心挑战在于评估:专家编写的参考规范编写成本高昂,LLM 评判者可能遗漏细微错误。我们通过以下方式解决这一问题:(a) 扩展 Verus 的 exec_spec 机制,使生成的规范可以作为 Rust 代码执行;(b) 针对官方 Codeforces 测试和从 Codeforces "hacks"(即竞争对手编写的用于破解不正确解决方案的边缘情况)中提取的对抗性案例进行测试。在 Verus-SpecBench 上,最强的模型 Gemini 3.1 Pro 解决了 77.8% 的任务,其他前沿模型解决了 51.1-57.8%,而开源模型仅达到 21.5-25.5%。我们对失败模式的分析表明,模型生成的规范可能遗漏重要的输入假设、接受不正确的输出以及拒绝有效的输出。我们还发现,LLM 作为评判者的评估遗漏了我们评估者捕获的 26% 的失败。总体而言,我们的结果表明,规范自动形式化对于前沿智能体来说是可行的,但即使在它们已经能够生成正确代码的问题上仍然脆弱。代码、数据和日志可在 https://github.com/formal-verif-is-cool/verus-spec-gym 获取。

英文摘要

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

2605.26454 2026-05-27 cs.CL 版本更新

Model Unlearning Objectives Vary for Distinct Language Functions

模型遗忘目标因语言功能而异

Berk Atil, Vipul Gupta, Rebecca J. Passonneau

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) Scale AI

AI总结 本文提出针对不同语言功能(危险知识遗忘和毒性遗忘)应设计不同的遗忘方法,并分别提出基于余弦的元学习变体RMU和多层目标方法,在多个7-8B模型上取得良好效果。

详情
AI中文摘要

大型语言模型(LLMs)在预训练过程中会学习到不良特性,包括危险知识和有毒文本生成。正如后训练使用不同的目标来塑造不同的行为,我们认为遗忘方法应针对所涉及的语言功能进行设计。为此,我们考虑了两个机制上不同的遗忘目标:危险知识遗忘和毒性遗忘。对于危险知识,我们引入了基于余弦的元学习变体RMU。对于毒性,我们提出了一种基于层特定探针方向的多层目标。在四个开源7-8B模型上,我们的方法基于两种遗忘类型的独特训练目标取得了强劲结果。总体而言,我们的结果表明,遗忘应作为一个问题族来研究,类似于LLM后训练的多种类型。

英文摘要

Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning. For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.

2605.26445 2026-05-27 cs.CL 版本更新

Curation and Extraction of Drug-Related Entities from Reddit Platform

从Reddit平台策划和提取药物相关实体

Zewei Wang, Zihan Xu, Yishu Wei, Michael Chary, Yifan Peng

发表机构 * Population Health Sciences, Weill Cornell Medicine, New York City, USA(威立·科恩医学中心人口健康科学系,纽约市,美国) School of Computing and Information Systems, University of Melbourne, Melbourne, Australia(墨尔本大学计算机与信息系统学院,墨尔本,澳大利亚) Emergency Medicine, Weill Cornell Medicine, New York City, USA(急诊医学,威立·科恩医学中心,纽约市,美国)

AI总结 为解决医生对非法药物真实使用情况了解有限的问题,本文构建了ReDose数据集(6435条Reddit帖子),并采用BERT、LLM和RAG模型进行药物、剂量和效果实体提取,其中BiomedBERT在药物实体上F1达0.843,Llama-3 70B优于GPT-4,但效果提取仍具挑战。

Comments Accepted by IEEE International Conference on Healthcare Informatics (ICHI 2026)

详情
AI中文摘要

医生主要通过临床过量案例了解非法药物,这限制了他们对其真实使用情况的理解。与此同时,药物用户在线上分享第一手经验,提供了关于药物剂量和效果的见解。为弥合这一差距,我们引入了ReDose(Reddit药物剂量和效果)数据集,包含6435条关于物质使用的Reddit帖子。一名委员会认证的毒理学家主要注释了训练集和测试集,而两名医学生参与了测试集的注释,标注了药物、剂量和效果实体。我们使用基于BERT、大型语言模型(LLM)和检索增强生成(RAG)模型对6267个注释进行了基准测试。BiomedBERT在药物实体上达到了0.843的F1分数,而Llama-3 70B优于GPT-4(F1=0.79 vs. 0.72)。效果提取仍然具有挑战性,GPT-4的召回率为0.41。ReDose捕捉了患者策划的叙述,以推进从社交媒体中提取医学数据。

英文摘要

Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.

2605.26442 2026-05-27 cs.CL cs.AI 版本更新

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

大型语言模型的对齐调优:以数据为中心的对齐数据管道视角

Hwanjun Song

发表机构 * KAIST(韩国科学技术院)

AI总结 本文以数据为中心,将对齐调优重构为管道设计问题,分解为响应合成、偏好评估和偏好实例化三个阶段,并基于此框架统一分类现有对齐方法,总结设计权衡与失败模式,提炼高层原则,最后指出开放挑战。

Comments Accepted at the Findings of ACL 2026

详情
AI中文摘要

对齐调优文献大多围绕优化目标组织,而对齐数据的构建往往被隐式处理。在本综述中,我们采用以数据为中心的视角,将对齐调优重构为管道设计问题。我们将对齐数据构建分解为三个相互作用的阶段:响应合成、偏好评估和偏好实例化,并利用此框架将现有对齐方法组织成统一的分类体系。通过这一视角,我们识别出先前对齐方法中反复出现的设计权衡和失败模式,并提炼出一套高层原则,阐明管道设计选择如何影响最终的优化信号。最后,我们概述了对齐数据管道的开放挑战,包括提示级对齐、智能体设置以及目标演化下的对齐。

英文摘要

Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment tuning as a pipeline design problem. We decompose alignment data construction into three interacting stages, response synthesis, preference evaluation, and preference instantiation, and use this framework to organize existing alignment methods into a unified taxonomy. Through this lens, we identify recurring design trade-offs and failure modes observed across prior alignment methods, and distill a set of high level principles that clarify how pipeline design choices influence the resulting optimization signal. Finally, we outline open challenges for alignment data pipelines, including prompt-level alignment, agentic settings, and alignment under evolving objectives.

2605.26440 2026-05-27 cs.CL cs.SE 版本更新

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Conv-to-Bench: 通过代码任务中的用户-助手对话评估语言模型

Victor M. dos Santos, Andre C. Castro, Samuel L. de S. Toledo, Bruno M. L. Calura, Lisandra C. de M. Menezes, Raul C. R. Mata, Telma W. de L. Soares, Bryan L. M. de Oliveira

发表机构 * Institute of Mathematics and Computer Science, University of São Paulo(圣保罗大学数学与计算机科学学院) Institute of Informatics, Federal University of Goiás(戈亚斯联邦大学信息学院) HUG Labs(HUG实验室) Advanced Knowledge Center for Immersive Technologies (AKCIT)(沉浸式技术高级知识中心)

AI总结 提出Conv-to-Bench框架,自动将多轮用户-助手对话转化为结构化需求清单,用于评估大语言模型,在编程领域与人工标准高度一致且计算开销低。

详情
AI中文摘要

大型语言模型(LLMs)的快速发展已超越了传统评估基准的可扩展性,这些基准仍严重依赖劳动密集型的人工专家策划。我们通过Conv-to-Bench解决了这一瓶颈,这是一个多阶段框架,可自动将真实的多轮用户-助手对话转化为结构化的、可验证的需求清单。通过利用真实对话日志中的“指令演化”,我们的方法将碎片化的用户意图分解为整合的指令和二元评估标准。应用于编程领域,Conv-to-Bench生成的评估集与BigCodeBench等人工程准几乎完美对齐,实现了高达ρ=1.000的斯皮尔曼相关性,且计算开销显著降低。对LLM-as-a-judge框架的验证进一步证实了其可靠性,主要评估器与人工验证的真实标签达到高度一致(κ=0.705)。我们全面的消融研究表明,虽然多轮交互捕捉了用户意图的迭代演化,但以指令为中心的提取提供了更稳健的基础。最终,Conv-to-Bench提供了一种可扩展、成本效益高的范式,用于在用户中心AI应用持续多样化时保持高保真评估标准。

英文摘要

The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn user-assistant dialogues into structured, verifiable requirement checklists. By leveraging the "instructional evolution" found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with human-authored standards like BigCodeBench, achieving Spearman correlations of up to $ρ$ = 1.000 with significantly lower computational overhead. Validation of the LLM-as-a-judge framework further confirms its reliability, with the primary evaluator achieving substantial agreement with human-verified ground truth ($κ$ = 0.705). Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation. Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.

2605.26438 2026-05-27 cs.CL cs.AI 版本更新

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

LURE: 减少评估感知的实时使用回放评估

Igor Ivanov, David Demitri Africa

发表机构 * Meridian Cambridge(梅里登剑桥)

AI总结 提出LURE方法,通过回放真实代理交互轨迹并附加评估提示来构建类似部署的评估,以减少大语言模型的评估感知,并引入自动化评估真实性流程。

详情
AI中文摘要

大型语言模型能够识别自己正在被评估(评估感知),并因此表现出不同的行为,这破坏了安全和对齐基准的有效性。我们提出LURE(实时使用回放评估),一种通过回放真实的代理交互轨迹并在末尾附加评估提示来构建类似部署的评估的方法。我们还引入了一个自动化流程来衡量评估的真实性,结合了对口头化评估感知的检测和法官模型对日志是否为评估的概率估计,并在一个包含部署和评估记录的大型数据集上进行了验证。我们发现,与广泛使用的基准和合成评估生成器相比,基于LURE的评估与部署的区分度显著降低,并且可以接近与用户真实对话的真实性。我们在策划、AI安全破坏和谄媚场景中实例化了LURE。我们的结果表明,评估真实性是对齐基准的一个关键属性,应在基准结果旁边报告,特别是当这些结果用于安全案例时。

英文摘要

Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

2605.26433 2026-05-27 cs.CL 版本更新

Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization

向量并非中性:从摘要任务中导出的LLM表示进行敏感信息推断

Weixin Liu, Bowen Qu, Juming Xiong, Congning Ni, Bradley A. Malin, Zhijun Yin

发表机构 * Vanderbilt University(范德比大学) Vanderbilt University Medical Center(范德比大学医学院)

AI总结 研究LLM摘要系统导出向量中的敏感信息泄露风险,提出SurfaceLoRA微调方法降低特定向量的可恢复性,但未针对的池化向量仍存在风险。

Comments 30 pages, 2 figures; preprint

详情
AI中文摘要

大型语言模型(LLM)摘要系统可能将私有输入的紧凑向量表示传递给下游检索、监控、审计或分析工作流。即使源文档保持访问受限,派生向量可能在不同访问控制下处理,仍支持敏感信息推断,造成残留的信息披露风险。我们以临床出院摘要生成为高风险案例研究,使用电子健康记录(EHR)记录的种族作为受控敏感标签审计。我们审计系统可能保留或暴露给下游组件的两个工件:最终提示令牌隐藏状态和均值池化提示表示。我们的结果表明,从一个导出工件降低案例研究敏感标签的可恢复性并不一定能降低另一个工件的可恢复性。作为缓解案例研究,我们引入了SurfaceLoRA,一种针对导出向量的参数高效微调方法,该方法使用连接到指定导出向量的梯度反转鉴别器。在平衡的五向探测协议下,SurfaceLoRA将EHR记录的种族可恢复性从目标最终令牌工件降低到接近随机水平,同时保持摘要效用,但从未经目标池化工件的可恢复性仍然显著更高。这些发现表明,隐私审计和缓解应针对保留或暴露给下游组件的确切向量工件进行。

英文摘要

Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.

2605.26414 2026-05-27 cs.AI cs.CL cs.LG 版本更新

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

推理、代码,还是两者兼有?大型语言模型如何处理数学问题的变化

Matthew Kutakh

AI总结 本研究通过对比链式思维推理、单次代码执行和迭代代码执行三种方法在GSM-Symbolic数据集上的表现,发现代码执行并未提升大型语言模型在数学问题变体上的推理鲁棒性。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

大型语言模型(LLMs)在数学推理基准测试中取得了令人印象深刻的准确性,但当问题被修改为不同的名字或数字等简单变化时,它们的性能会下降。代码执行方法允许模型生成并运行Python代码,而不是用自然语言进行推理,已被提出作为解决方案,但其对推理鲁棒性(即在问题变体中保持准确性的能力)的影响尚未得到系统测试。本研究在GSM-Symbolic数据集的1000个问题上评估了三种方法:使用链式思维(CoT)提示的纯推理、使用程序辅助语言模型(PAL)的单次代码执行,以及使用逐步编码(SBSC)的迭代代码执行。所有三种方法均在配对的原始问题和修改问题上使用Claude Haiku 4.5运行。CoT是最鲁棒的方法,在扰动下准确率下降1.3个百分点,1.8%的问题被破坏。PAL的鲁棒性最差,准确率下降1.7个百分点,3.1%的问题被破坏,SBSC介于两者之间。尽管这些差异在统计上不显著($p = .096$),但方向趋势在所有指标上一致,表明无论是单次还是迭代的代码执行,都没有提高小学水平问题变体的推理鲁棒性。

英文摘要

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

2605.26405 2026-05-27 cs.CL 版本更新

Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM

面向即时自适应反馈:通过知识增强的大语言模型提升学生学习

Younghun Lee, Amir Bralin, Nobel Sanjay Rebello, Dan Goldwasser

发表机构 * Department of Computer Science(计算机科学系) Department of Physics and Astronomy(物理与天文学系) College of Education(教育学院)

AI总结 提出一个框架,利用领域专家知识增强大语言模型,在真实教学场景中提供即时自适应反馈,并在大规模大学课程中提升学生成绩超过80%。

Comments 8 pages, Accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

教育干预是提升学生学习的有效工具。虽然大语言模型(LLMs)允许大规模生成自适应反馈,但当前研究缺乏在真实教学环境中提供即时(JiT)反馈的明确方法。在本文中,我们提出了一个框架,通过将LLMs与领域专家知识相结合来提供自适应反馈。我们的方法收集学生的书面推理逻辑(策略文章),基于推理内容分析潜在错误类型,并提供非侵入性反馈,旨在澄清缺失或错误的概念。我们在一个大规模大学课程(N > 1000)中部署了该框架,与以往学期相比,学生成绩提升了超过80%。最后,我们通过分析学习轨迹验证了该框架的教学实用性;我们展示了与LLM的迭代对话如何促进从错误概念向正确理解的转变。

英文摘要

Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale university course (N > 1000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework's pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one's misconception to correct understanding.

2605.26394 2026-05-27 cs.CL 版本更新

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

多轮文本到SQL的记忆架构:基准测试与实证研究

Ravi Kumar Tummalapenta, Suman Addanki

发表机构 * LLM Suite Engineering Team, JP Morgan Chase & Co.(JP摩根大通公司LLM套件工程团队)

AI总结 针对多轮Text-to-SQL任务,提出EnterpriseMem-Bench基准并评估五种前沿模型,发现无状态方法在第三轮后执行准确率归零,且记忆架构复杂度并不单调提升准确率。

Comments 18 pages, 4 figures, 14 tables; includes appendices with verbatim prompts, example session, and full ablation tables; prepared by the LLM Suite Engineering Team, JP Morgan Chase & Co

详情
AI中文摘要

多轮Text-to-SQL是企业分析的核心,但现有评估主要集中于单轮场景。我们引入EnterpriseMem-Bench,一个包含300个会话和1400轮的多轮Text-to-SQL基准,通过编程方式从三个企业领域(BIRD金融、SEC EDGAR、Northwind)构建,具有确定性真实标签和每轮记忆关键标注。我们在五种记忆条件下评估五个前沿模型——GPT-5 mini、GPT-5.2、Claude Sonnet 4.5、Sonnet 4.6和Opus 4.6,通过三路消融实验独立隔离工作记忆窗口大小、情景检索和语义增强的影响。所有Claude模型均启用扩展思考以保持与GPT推理模型的对等性。我们引入记忆收益分数(MBS)作为每轮诊断指标。四项发现如下:(1)无状态多轮Text-to-SQL在所有五个模型下,即使启用推理,到第三轮时执行准确率也降为零;(2)记忆架构复杂度并不单调提升准确率——工作记忆占主导,额外组件产生模型和数据集依赖的效果,变化范围从+14到-16个百分点;(3)Claude Sonnet 4.6在SEC EDGAR上各条件下表现比Sonnet 4.5差17-33个百分点,这是一个在推理下仍然存在的代际退化;(4)在推理下,Claude的错误分布变为单峰——每个非正确轮次都是错误结果。我们发布了基准、智能体和评估代码。

英文摘要

Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in single-turn settings. We introduce EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark of 300 sessions and 1,400 turns built programmatically from three enterprise domains (BIRD financial, SEC EDGAR, Northwind), with deterministic ground truth and per-turn memory-critical annotation. We evaluate five frontier models -- GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 -- across five memory conditions enabling a three-way ablation isolating working-memory window size, episodic retrieval, and semantic augmentation as independent effects. All Claude models are evaluated with extended thinking enabled to maintain parity with GPT reasoning models. We introduce the Memory Benefit Score (MBS) as a per-turn diagnostic metric. Four findings emerge: (1) stateless multi-turn Text-to-SQL collapses to zero execution accuracy by Turn 3 across all five models, even under reasoning; (2) memory-architecture complexity does not monotonically improve accuracy -- working memory dominates, and additional components produce model- and dataset-dependent effects from +14 to -16 percentage points; (3) Claude Sonnet 4.6 underperforms Sonnet 4.5 by 17-33pp on SEC EDGAR across conditions, a generational regression persisting under reasoning; (4) under reasoning, Claude error distributions become mono-modal -- every non-correct turn is a wrong-result error. We release the benchmark, agent, and evaluation code.

2605.26365 2026-05-27 cs.CL 版本更新

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

大型语言模型中通过潜在激活引导的文化价值对齐

Trung Duc Anh Dang, Sarah Masud

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 提出一种基于场景行为探测和潜在激活引导的框架,用于评估和干预LLMs的文化价值,发现文化价值以耦合结构编码,限制了精确对齐。

Comments ACL 2026 Student Research Workshop (Non-Archival Track)

详情
AI中文摘要

大型语言模型(LLMs)通常表现出同质化的文化视角。虽然世界价值观调查(WVS)为映射人类价值观提供了黄金标准,但传统的直接提示LLMs回答WVS问题往往无法触及模型的潜在文化深度,导致安全对齐的拒绝或中性回应。在此,我们提出一个通用的文化评估与干预框架,从抽象查询过渡到基于场景的行为探测。通过提取300个情境困境中的隐式token概率,我们绕过表面层次的对齐,映射LLMs文化价值的潜在坐标。我们进一步引入激活引导,在前向传播过程中无需重新训练即可改变这些内部对齐。在多个LLMs上,我们发现适应性存在显著差异,并揭示了一个一致的现象——潜在纠缠,即沿一个文化维度的干预会引发沿另一维度的偏移。这些结果表明,文化价值被编码为耦合结构,限制了精确对齐。本工作建立了一个计算高效的文化引导框架,突出了在LLMs中导航全球价值观时的结构复杂性。

英文摘要

Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model's latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining. Across multiple LLMs, we find substantial variation in adaptability and uncover a consistent phenomenon of latent entanglement, where interventions along one cultural dimension induce shifts along another. These results suggest that cultural values are encoded as coupled structures, limiting precise alignment. This work establishes a computationally efficient framework for cultural steering, highlighting the structural complexities when navigating global value with LLMs.

2605.26362 2026-05-27 cs.CL cs.AI 版本更新

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

为什么LLMs会在结构化知识上产生幻觉:对线性化表示推理的机制分析

Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song, Langzhou He, Kenan Kamel A Alghythee, Philip S. Yu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文通过机制分析发现,大型语言模型在结构化知识推理中产生幻觉是由于注意力过度集中于捷径式结构线索和前馈层未能将知识语义接地,导致模型依赖参数记忆。

Comments To appear in Proceedings of ACL 2026

详情
AI中文摘要

在许多推理任务中,大型语言模型(LLMs)依赖于结构化外部知识,如图和表格,这些知识通常被线性化为连续的令牌表示。然而,即使有足够的知识可用,LLMs仍然可能产生幻觉输出,这种失败背后的潜在机制仍然知之甚少。我们研究了这些机制,发现幻觉源于系统性的内部动态而非随机噪声。首先,注意力不成比例地集中在类似捷径的结构线索上,而不是分布在完整的上下文中。其次,前馈表示未能将提供的知识接地,导致模型回归到参数记忆。此外,我们的结果表明,幻觉始终与前馈层中的语义接地失败相关,而注意力分配表现出更大的任务依赖性。最后,我们展示了这些机制模式从单跳图推广到多跳和表格设置,从而能够在结构化知识格式中有效检测幻觉。

英文摘要

In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.

2605.26356 2026-05-27 cs.CL 版本更新

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

检索增强生成的上下文优化:梯度下降视角

Mingchen Li, Jiatan Huang, Chuxu Zhang, Liang Zhao, Hong Yu

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) University of Connecticut(康涅狄格大学) Emory University(埃默里大学)

AI总结 本文从梯度下降视角研究检索增强生成(RAG)作为上下文优化过程,提出一种轻量级前向更新方法,在冻结LLM和检索器的情况下提升生成器对检索证据的利用。

详情
AI中文摘要

上下文学习最近被与线性自注意力模型中的隐式梯度下降联系起来,表明上下文可以诱导前向传递更新。检索增强生成(RAG)也依赖于上下文,但检索到的文档通常被视为静态证据而非适应信号。我们将RAG研究为一种上下文优化过程。首先,我们展示一个线性自注意力层可以在统一的线性化RAG目标上实现一步梯度下降,该目标涵盖基于投影和基于点积的检索接口。这给出了检索增强预测与上下文优化一致的一个精确区域。我们使用这一结果并非作为LLM计算的字面模型,而是作为调整查询与检索证据之间交互的指南。然后,我们测试这种对应关系的边界:在受控的线性扩展下保持稳定,但在非线性架构下变得依赖于特征分布。最后,我们将这一观点转化为一种针对冻结RAG LLM的轻量级方法。该方法保持检索器和骨干网络固定,并预测一个上下文条件更新到生成器侧的证据使用接口。在七个QA基准、两个检索器和两个冻结LLM骨干网络上,这种仅前向的更新改进了共享接口基线,迁移到未见任务,并以更低的每查询成本接近测试时的梯度适应。

英文摘要

In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.

2605.26355 2026-05-27 cs.LG cs.CL eess.SP 版本更新

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

能量门控注意力与小波位置编码:Transformer注意力的互补归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher(独立研究者) Athens, Greece(希腊雅典)

AI总结 针对标准注意力缺乏能量显著性和尺度选择性局部性两种互补归纳偏置的问题,提出能量门控注意力(EGA)和莫雷特位置编码(MoPE),两者组合在字符级语言建模上实现超加性性能提升。

Comments 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: https://github.com/AthanasiosZeris/energy-gated-attention

详情
AI中文摘要

标准Transformer注意力计算成对标记相似性,但将所有标记视为同等显著、所有位置视为同等局部,忽略了输入的信息结构。我们识别出标准注意力缺乏两种互补归纳偏置:能量显著性(哪些标记集中了信息能量,通过端到端学习而不需要显式频率分解)和尺度选择性局部性(在每个频率上位置影响的范围,通过Morlet小波编码实现)。我们通过两个简单组件解决这两个问题。能量门控注意力(EGA)通过键标记嵌入的学习能量估计(通过单个线性投影计算)来门控值聚合;它选择关注什么。莫雷特位置编码(MoPE)用学习的高斯窗口小波替换固定的正弦编码,使联合位置-频率定位适应语料库;它指定注意力在每个尺度上操作的位置。在TinyShakespeare上,单独EGA相比标准注意力实现+0.092验证损失改进(相比Phase 1-3基线+0.103);单独MoPE为-0.032(作为独立编码低于基线);但它们的组合实现+0.119——超过各部分之和。这种超加性在两个独立训练运行中观察到,是核心实证发现:显著性和局部性是互补归纳偏置,各自填补对方无法单独填补的空白。消融实验证实,结构化谱先验(Morlet小波门控、尺度初始化头、固定正弦PE)始终不如其无约束学习对应物,而互补学习组件交互产生超加性。所有实验都在小规模(≤6M参数、字符级基准、单种子)进行;更大规模的多种子验证是未来工作最重要的方向。

英文摘要

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.

2605.26352 2026-05-27 cs.CL 版本更新

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

RICE-PO:将检索交互转化为推理智能体的信用信号

Mingchen Li, Hansi Zeng, Zhuo Qian, Jiatan Huang, Hamed Zamani, Hong Yu

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿姆赫斯特分校) Texas Tech University(得克萨斯科技大学) University of Connecticut(康涅狄格大学)

AI总结 提出RICE-PO框架,通过将检索交互转化为局部学习信号,解决推理型检索智能体在训练中推理步骤的信用分配问题,在BRIGHT和BEIR上优于基线。

详情
AI中文摘要

检索正从一次性匹配向交互式推理转变,语言智能体迭代检查证据、重新表述查询并再次搜索。训练此类智能体面临信用分配挑战:可执行动作(如查询或摘要)可由检索器直接评估,而潜在推理步骤不可直接观察,仅影响未来的可执行动作。这种不对称使得结果级奖励分配不可靠,因为相同的最终奖励可能奖励那些实际上并未促成检索成功的推理步骤。我们提出RICE-PO,一种无批评策略优化框架,将检索交互转化为局部学习信号。RICE-PO选择高不确定性的可执行动作作为锚点,使用检索指标评估局部反事实分支,并仅在推理到动作的影响强且未来残差效应稳定时,将信用传播给潜在推理步骤。在BRIGHT和BEIR上,RICE-PO在相同检索器设置下始终优于基于提示的智能体和基于组的强化学习基线。这些结果表明,智能体-环境交互结构本身可以为训练基于推理的检索智能体提供有用的监督。

英文摘要

Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

2605.26346 2026-05-27 cs.CL 版本更新

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

每日剂量:工作流集成的大型语言模型自动化在放射肿瘤学中的临床总结和试验识别

Jason Holmes, Federico Mastroleo, Mariana Borras-Osorio, Srinivas Seetamsetty, Satomi Shiraishi, Mirek Fatyga, Judy C. Boughey, Cornelius A. Thiels, William G. Breen, Daniel J. Ma, Daniel K. Ebner, David M. Routman, Brady S. Laughlin, Carlos E. Vargas, Samir H. Patel, Sujay A. Vora, Nadia N. Laack, Andrew Y. K. Foong, Wei Liu, Mark R. Waddle

发表机构 * Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, United States(辐射肿瘤科,梅奥诊所,凤凰城,亚利桑那州,美国) Department of Radiation Oncology, Mayo Clinic, Rochester, MN, United States(辐射肿瘤科,梅奥诊所,罗切斯特,明尼苏达州,美国) Division of Radiation Oncology, IEO, European Institute of Oncology, IRCCS, Milan, Italy(放射肿瘤学部,IEO,欧洲肿瘤研究所,IRCCS,米兰,意大利)

AI总结 本文介绍了一个名为“每日剂量”的LLM驱动的自动化临床总结和临床试验识别系统,该系统集成到常规放射肿瘤学实践中,并通过混合方法评估其可用性、满意度和感知有用性,结果显示高使用率和满意度。

Comments 28 pages, 4 figures, 1 table

详情
AI中文摘要

目的:描述“每日剂量”(TDD)的设计和早期临床评估,这是一个由LLM驱动的自动化临床总结和临床试验识别系统,集成到常规放射肿瘤学实践中。设计:在系统部署1个月后,采用横断面匿名临床医生调查进行混合方法评估。暴露:每日自动生成医生特定的电子邮件摘要,使用RadOnc-GPT生成,包括患者日程、从电子健康记录中提取的简洁临床状态摘要,以及自动识别新就诊或咨询就诊的潜在相关临床试验。主要结果和指标:主要结果包括自我报告的可用性、满意度、感知有用性、对工作流程的感知影响、节省的时间以及继续使用的意愿。使用Cronbach's α评估内部一致性信度。结果:在55名受访者中,52名(94.5%)在放射肿瘤学领域工作,38名(69.1%)是主治医师。大多数参与者(83.6%)报告每天或每周多次使用TDD。平均(SD)得分为:可用性和满意度3.89(1.04),感知有用性3.43(1.24),影响和未来使用3.80(1.17)(5点李克特量表)。总体满意度与感知时间节省呈正相关(p < .001)。参与者报告的时间节省不一,27%的人估计每天节省≥10分钟。问卷表现出极好的内部一致性(总体Cronbach's α = 0.97)。

英文摘要

Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice. Design: Mixed-methods evaluation using a cross-sectional, anonymous clinician survey administered after 1 month of system deployment. Exposure: Daily automated delivery of physician-specific email summaries generated using RadOnc-GPT, including patient schedules, concise EHR-derived clinical-status summaries, and automated identification of potentially relevant clinical trials for new or consult visits. Main Outcomes and Measures: Primary outcomes included self-reported usability, satisfaction, perceived usefulness, perceived impact on workflow, time savings, and intention for continued use. Internal consistency reliability was assessed using Cronbach's $α$. Results: Among 55 respondents, 52 (94.5\%) worked in radiation oncology, and 38 (69.1\%) were attending physicians. Most participants (83.6\%) reported using TDD daily or several times per week. Mean (SD) scores were 3.89 (1.04) for usability and satisfaction, 3.43 (1.24) for perceived usefulness, and 3.80 (1.17) for impact and future use (5-point Likert scale). Overall satisfaction was positively associated with perceived time savings ($p < .001$). Participants reported variable time savings, with 27\% estimating $\geq 10$ minutes saved per day. The questionnaire demonstrated excellent internal consistency (overall Cronbach's $α$ = 0.97).

2605.26340 2026-05-27 cs.AI cs.CL cs.MA 版本更新

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne: 迈向基于证据链的人类级自主研究

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister

发表机构 * Google Cloud AI Research(谷歌云人工智能研究)

AI总结 提出证据链框架Chain-of-Evidence和自主研究系统ScientistOne,通过可追溯性解决可验证性失败问题,在多项任务上达到或超越人类专家水平。

Comments Project website: https://scientist-one.github.io/

详情
AI中文摘要

自主研究代理能产生有竞争力的解决方案和专业手稿,但其输出存在表面评估无法察觉的可验证性失败:捏造的引用、不可复现的分数以及与实现不符的方法描述。我们通过三项贡献解决这一问题。第一,Chain-of-Evidence (CoE),一个可验证性框架,要求每个声明都能追溯到其证据来源。第二,ScientistOne,一个端到端的自主研究系统,在文献综述、解决方案发现和论文撰写过程中通过构造保持证据链。第三,CoE Audit,一个事后审计,其四项完整性检查——分数验证、规范违反、引用验证和方法-代码对齐——统一适用于所有系统。在涵盖五个系统和五个前沿研究任务的75篇论文中,每个基线都表现出至少一种系统性失败模式:幻觉引用率高达21%,分数验证通过率低至42%,方法-代码对齐率在20%到80%之间。ScientistOne实现了零幻觉引用(0/337)、完美的分数验证(12/12)和最高的方法-代码对齐率(14/15),同时在所有五个任务上达到或超过人类专家表现。ScientistOne进一步泛化到涵盖医学影像、细粒度识别、3D感知和语言建模的六个额外任务,在Parameter Golf上取得最先进结果,并在基线完全失败的MLE-Bench任务上获得金牌。

英文摘要

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

2605.26339 2026-05-27 cs.LG cs.CL 版本更新

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W: 通过哈达玛旋转和激活感知缩放实现LLM权重的联合2D码本量化

Preetam Sharma, Kacper Dobek

发表机构 * Independent Research(独立研究) Institute of Computing Science(计算科学研究所) Poznan University of Technology(波兹南技术大学)

AI总结 提出QAM-W方法,通过L2归一化、块哈达玛旋转和2D坐标配对量化,结合激活感知缩放,在约5.5 bpw下使困惑度接近BF16,优于极坐标编码,并在5-6 bpw范围内保持质量。

详情
AI中文摘要

标量后训练量化器丢弃了权重行内的成对坐标结构。我们引入QAM-W(权重正交幅度调制),一种恢复该结构的编解码器:每行经过L2归一化、块哈达玛旋转、配对为2D坐标,并针对在单位圆高斯上训练的单个Lloyd-Max码本进行量化,同时采用激活感知的每通道缩放。在跨越四个家族(1.1B--13B参数)的五种LLM和八种量化配置的跨模型研究中,激活感知变体在约5.5 bpw下,每个模型的WikiText-2困惑度保持在BF16的±0.4%以内,以少32%的权重比特匹配SmoothQuant W8A8质量包络。联合2D编码在相同比特率下,在ΔPPL上优于极坐标(幅度×相位)编码2--15个百分点,且与BF16的配对KL散度在37个(方法,模型)行上以Spearman ρ=0.99跟踪ΔPPL%,与从编解码器失真到KL散度的单调复合界一致。3.5 bpw变体在量化容忍架构上具有竞争力。在严格的4 bpw下,旋转码本前沿方法QTIP优于QAM-W;贡献在于质量保持的5--6 bpw波段。

英文摘要

Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B--13B parameters) and eight quantized configurations, the activation-aware variant at $\approx 5.5$ bpw stays within $\pm 0.4\%$ of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at $32\%$ fewer weight bits. Joint 2D coding outperforms polar (amplitude $\times$ phase) coding by 2--15~pp $Δ$PPL at equal bitrate, and paired KL against BF16 tracks $Δ$PPL\% at Spearman $ρ= 0.99$ across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5--6~bpw band.

2605.26320 2026-05-27 cs.LG cs.CL 版本更新

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

MULTISEISMO: 面向跨模态地震理解的多模态地震数据集与模型

Sai Munikoti, Ian Stewart, Chengping Chai, Lisa Linville, Scott Vasquez, Sameera Horawalavithana, Karl Pazdernik

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) Oak Ridge National Laboratory(橡树岭国家实验室) Sandia National Laboratory(桑迪亚国家实验室) North Carolina State University(北卡罗来纳州立大学)

AI总结 针对地震学中多模态数据整合的缺失,构建了包含超过1.6万次地震事件的结构化多模态数据集MultiSeismo,并开发了专用多模态模型SeisModal,在跨模态地震推理任务上取得了优越性能。

详情
AI中文摘要

通用多模态模型(GMMs)在专业科学领域的应用仍然有限,原因是缺乏整合文本和图像之外多种数据模态的综合性领域特定数据集。在地震学中,理解地震现象需要综合时间序列波形数据、地理图像和上下文元数据,而现有地震数据集缺乏这种多模态整合。我们提出了MultiSeismo,一个大规模结构化多模态地震数据集,包含跨越13年(2010年至2023年)来自不同地理区域的超过1.6万次地震事件。每个事件数据整合了全球台网波形记录、烈度图、人口暴露可视化以及标准JSON格式的全面文本描述。此外,我们开发了MISCE,一个基于原始数据的多模态指令集,用于对GMMs进行监督训练和评估,涵盖从基本信息检索到复杂跨模态分析的地震推理任务。我们利用MISCE微调了一个现有的多模态模型(Unified IO 2),并增强了专门的时间序列编码器,从而得到了SeisModal——首个用于综合地震分析的领域特定多模态模型。在MultiSeismo上对最先进的多模态模型进行评估,揭示了显著挑战,特别是通用模型在处理时间序列数据方面的困难,同时证明了SeisModal在地震多模态推理任务上的优越性能。这些结果证明,MultiSeismo为未来地震学多模态研究提供了严格的基准,并验证了我们领域特定架构调整的成功。

英文摘要

The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal's superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.

2605.26302 2026-05-27 cs.AI cs.CL cs.MA 版本更新

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在老化:面向部署系统的智能体寿命工程

Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出 AgingBench 基准,通过四种老化机制和诊断工具评估部署后智能体的可靠性退化,并指出需要寿命评估、机制级诊断和阶段针对性修复。

详情
AI中文摘要

长寿命AI智能体越来越多地被部署为持久化运行系统,但它们仍然像刚初始化的模型一样被评估。第一天基准测试忽略了一个基本系统问题:智能体在部署后能保持可靠多久?即使模型权重被冻结,智能体的有效状态也在不断变化,因为它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实,并经历常规维护。因此,可靠性成为整个智能体框架的寿命属性,而不仅仅是基础模型的快照属性。我们引入了AgingBench,一个用于智能体寿命工程的纵向可靠性基准:不仅测量部署的智能体是否退化,还测量退化的形式以及修复应针对何处。AgingBench将智能体老化组织为四种机制:压缩老化、干扰老化、修订老化和维护老化。为了诊断这些故障,AgingBench使用时间依赖图和对偶反事实探针,为记忆管道的写入、检索和利用阶段生成诊断档案。在7个场景、14个模型、多种记忆策略以及运行者控制和自主智能体上,跨越约400次运行(涵盖8到200个会话)的结果表明,智能体老化不是一维的:行为测试可以保持干净,而事实精度下降;派生状态跟踪可能在单个模型内急剧崩溃;相同的错误答案可能需要不同的修复,具体取决于诊断档案指向的内容。这些结果表明,可靠的智能体部署需要寿命评估、机制级诊断和阶段针对性修复,而不仅仅是更强的第一天模型。

英文摘要

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

2605.26293 2026-05-27 cs.CL cs.AI 版本更新

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

CroCo: 基于自生成结果的跨语言对比偏好调优

Mike Zhang, Ali Basirat, Desmond Elliott

发表机构 * Department of Computer Science (DIKU), University of Copenhagen(哥本哈根大学计算机科学系(DIKU)) Centre for Language Technology (CST), University of Copenhagen(哥本哈根大学语言技术中心) Pioneer Centre for Artificial Intelligence(先锋人工智能中心)

AI总结 本文提出CroCo方法,利用英语偏好训练的奖励模型对多语言自生成结果进行对比偏好调优,无需语言特定偏好标注,在14种高低资源语言上提升模型性能,并避免灾难性遗忘。

详情
AI中文摘要

先前工作证实,通过奖励分数设置的大语言模型自生成结果之间的受控对比性,可以改善英语中的下游偏好调优。我们将此方法扩展到多种语言,并在总共14种高资源和低资源语言上,对两个模型在一系列多样化任务上进行评估。我们的核心发现是,跨语言对比偏好调优(CroCo)无需语言特定的偏好标注即可迁移。基于英语偏好(在多语言基础模型之上)训练的奖励模型,在大多数语言中产生了有用的语言内排名,并且在单语或多语设置中进行配对,在大多数设置上改进了每个模型,同时防止了监督微调的灾难性遗忘。我们观察到,这些增益需要基于策略的数据。非策略响应减少了收益,而在线偏好优化未能优于离线变体。具体来说,在结构化任务上,我们的方法在EuroLLM-9B的6/7种语言和Aya-3B的4/7种设置中匹配或超过了基础模型。在开放式生成中,两个调优模型在11种评估语言中均优于各自的基础模型。总体而言,我们展示了多语言偏好调优的有前景的方向。

英文摘要

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

2605.26275 2026-05-27 cs.CL 版本更新

SPEAR: Code-Augmented Agentic Prompt Optimization

SPEAR: 代码增强的智能体提示优化

Mengyin Lu, Cong Feng, Huimin Han, Guangming Lu, Yu Sun, Xiaonan Ding, Shihui Long, Fengyi Li, Tanvi Motwani

发表机构 * LinkedIn Corporation(领英公司)

AI总结 提出SPEAR方法,将代码执行作为智能体工具进行提示优化,通过Python沙箱实现结构错误分析,并在工业任务和基准测试中取得显著提升。

Comments 19 pages, 3 figures, EMNLP 2026 submission

详情
AI中文摘要

自动提示工程(APE)重写提示以改进下游任务性能,但现有的APE循环将优化器本身视为固定流水线。我们将CodeAct(Wang等人,2024a)的代码即行动范式移植到APE,并提出SPEAR(沙盒化主动回滚提示工程师),一个具有四个工具(评估、python、set_prompt、finish)的自由形式智能体优化器,自主决定如何使用这些工具。独特的工具是Python沙箱:优化器在当前评估DataFrame上编写并执行任意Python代码,进行智能体自身编写的结构错误分析(混淆矩阵、错误聚类、每组指标)。两个护栏将长时程智能体转变为单调改进的优化器:指标回归时的自动回滚,以及可选的防护指标下限。我们在三个工业级LLM作为评判者的套件(涵盖招聘人员面试、对话记忆和查询改进系统的13个评判任务)以及七个BBH任务和GSM8K上进行评估。SPEAR在每个工业任务的主要指标上获胜(工具选择上κ 0.857 vs 0.359;过滤相关性上F1-macro 0.815 vs 0.763;最难提取维度上κ 0.254 vs 0.218)。在BBH-7上,SPEAR平均准确率0.938,而GEPA为0.628,TextGrad为0.484。消融实验表明,Python工具是复杂评判任务上最大的单一杠杆(在5类工具选择评判任务上Δ≈+0.79κ,在移除时最难提取维度上Δ≈+0.35κ);其不可替代的贡献是类对混淆聚合,而长上下文LLM无法从原始评估DataFrame中可靠提取该信息。

英文摘要

Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools -- evaluate, python, set_prompt, finish -- that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($κ$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $κ$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ($Δ\approx +0.79κ$ on the 5-class tool-selection judge, $Δ\approx +0.35κ$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.

2605.26174 2026-05-27 cs.SE cs.AI cs.CL cs.MA 版本更新

A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration

一个通用悬崖与一个设计指纹:LLM编排下的跨段缺陷检测

Hiroki Fukui

发表机构 * Research Institute of Criminal Psychiatry(刑事精神病研究机构) Sex Offender Medical Center(性犯罪医疗中心) Department of Neuropsychiatry, Graduate School of Medicine, Kyoto University(京都大学医学研究生院神经精神病学部门)

AI总结 本研究揭示在LLM编排下,所有模型检测跨段矛盾缺陷的能力大幅下降(检测率降低三分之二以上),并发现不同对齐范式下的模型行为差异,其中一家开发商的模型随对齐增强呈现报告标准偏移。

Comments 24 pages, 2 figures. Data and code: doi:10.5281/zenodo.20372696

详情
AI中文摘要

生产级语言模型系统通过将请求分解为不可见的编排工作代理(这些代理重新组合成一份综合报告)来回答请求。我们探究这对一类单个代理无法察觉的缺陷——文档中两个远距离段落之间关系的矛盾——有何影响。保持文档、缺陷、机制、评分和种子不变,我们仅改变模型:来自同一开发商的五代的十个系统,以及来自不同对齐范式的五个提供商。 两个层次分离。首先,一个通用检测悬崖:每个在单代理下能发现这些跨段缺陷的模型,在编排下失去该能力,检测率在所有测试的范式中下降三分之二或更多。该悬崖源于机制,无法通过规模或扩展推理弥补。其次,模型跌落后的行为方式。信号检测分解显示,在六个高于随机水平的判别模型中,只有一家开发商的模型沿报告标准轴移动:随着对齐增强,模型漏检更少缺陷,但在干净文档上引发更多误报——这是同一标准偏移的两个方面,在该开发商内部随代际缩放(p < 0.001),而在其他地方几乎不存在。 在底层,漏检的缺陷往往并非不可见:模型的私有记录准确重构了结构故障,而综合报告却确认其完好,其关注点放在工件和缺失的合作者上。这难以量化——自动评判不稳定(精确率17-50%),关键词也无法将其与普通同意区分——我们将这种抵抗作为一项发现报告。我们发布所有运行、探针、缺陷密钥、评分提示和脚本。综合报告的置信度对跨分区缺陷无信息量,最对齐的系统并非最安全,而悬崖是结构性的。

英文摘要

Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one integrated report. We ask what this does to a class of defect no single worker can see: a contradiction in the relation between two distant sections of a document. Holding the documents, defects, mechanism, scoring, and seed fixed, we vary only the model -- ten systems across five generations from one developer and five providers from distinct alignment paradigms. Two layers separate. First, a universal detection cliff: every model that finds these cross-section defects under a single agent loses that ability under orchestration, detection falling two-thirds or more across every paradigm tested. The cliff is mechanism-derived and not closed by scale or extended reasoning. Second, how models behave once fallen. A signal-detection decomposition shows that, among the six models discriminating above chance, only one developer's generations move along the reporting-criterion axis: as alignment is strengthened, the model misses fewer defects yet raises more false alarms on clean documents -- two faces of one criterion shift, scaling with generation within that developer (p < 0.001) and near-absent elsewhere. At the floor the missed defect is often not out of view: the model's private record reconstructs the structural fault accurately, while the integrated report signs off on its soundness, its concern spent on the artifact and an absent collaborator. This resists quantification -- an automated judge is unstable (precision 17-50%) and keywords cannot separate it from ordinary agreement -- a resistance we report as a finding. We release all runs, probes, defect keys, scorer prompts, and scripts. An integrated report's confidence is uninformative about partition-spanning defects, the most aligned systems are not the safest, and the cliff is structural.

2605.26165 2026-05-27 cs.SE cs.AI cs.CL 版本更新

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

工具模式压缩实现受限上下文预算下的智能体检索增强生成

Furkan Sakizli

发表机构 * Independent Researcher(独立研究者)

AI总结 针对智能体RAG系统中工具模式与检索上下文竞争资源的问题,提出工具模式压缩方法,在8K上下文预算下将平均精确匹配率提升20.5个百分点,并验证了压缩模式在超过800个工具时仍可运行。

Comments 12 pages (8 main + 4 appendix), 7 tables, 2 figures. Code and data: https://github.com/SKZL-AI/tscg

详情
AI中文摘要

配备数十到数百个工具定义的语言模型的智能体RAG系统面临关键资源冲突:工具模式消耗了检索增强生成所需的相同上下文窗口。我们首次系统研究了这种工具-上下文权衡,评估了14个模型(涵盖1.5B-32B本地模型和一个前沿API模型),在三个上下文预算(8K、16K、32K)下使用28个工具定义进行了6,566次受控API调用。应用TSCG保守配置文件压缩(节省44-50%的模式令牌),我们观察到二元启用效应:在8K令牌时,JSON模式工具定义完全溢出上下文窗口,导致接近零的EM(平均2.6%),而压缩模式恢复了RAG功能,所有八个模型平均精确匹配提升20.5个百分点(六个表现出完全启用的模型平均提升24.7个百分点)。在32K(两种格式都适合)时,五个测试模型中的四个显示delta <= 1个百分点,确认该效应纯粹由预算驱动。在HotpotQA(50个多跳问题)上的外部验证显示,在相同溢出场景下EM提升48个百分点。前沿扩展测试表明,JSON模式在大约494个工具时溢出,而压缩模式在超过800个工具时仍可运行。我们的结果确立了工具模式压缩作为受限上下文部署中智能体RAG的必要基础设施层。所有代码、数据和检查点均已公开。

英文摘要

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

2605.26133 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露:成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology(日本先进科学研究院)

AI总结 本文首次统一综述了大型语言模型中的预训练数据暴露问题,涵盖成员推断和数据污染,形式化定义了暴露级别,回顾了攻击与防御方法,并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情
AI中文摘要

大型语言模型(LLMs)已成为NLP中的主导范式,推动了研究和工业的发展。随着模型规模和预训练数据的增长,由于训练数据集的规模和不可见性,对预训练数据暴露(PDE)的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要,涉及两个关键领域:数据污染和成员推断。尽管概念上相关,但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE,回顾了攻击和防御方法,综合了实证发现,并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

2605.26132 2026-05-27 cs.CL cs.LG 版本更新

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

自验证蒸馏:你的语言模型秘密地就是它自己的合成数据管道

Tony Lee, Percy Liang

发表机构 * Stanford University(斯坦福大学)

AI总结 提出自验证蒸馏算法,让大语言模型仅用无标注种子问题,通过自生成、自验证和自训练提升推理能力,在数学、科学和编程任务上取得显著提升。

详情
AI中文摘要

经过后训练的大语言模型能否仅使用无标注提示,在没有外部教师或工具反馈的情况下进一步提升自己?我们在三个推理领域(数学、科学和编程)中研究这一设置,仅从没有真实解的无标注种子问题开始。我们提出自验证蒸馏,一种简单的后训练精炼算法,其中模型生成这些种子问题的候选解,使用基于提示的自验证进行过滤,并在由此产生的自策展数据集上进行训练。受UQ基准使用多个验证器筛选困难未解问题候选答案的启发,我们将这种基于验证的过滤思想应用于自训练:模型通过三级级联的循环一致性、事实性和正确性检查来过滤自己生成的解,仅当解通过所有阶段且获得一致判断时才被接受。我们发现,在训练数据构建过程中采样更多候选生成并使用更大的验证预算,可以产生更高质量的自策展数据,进而得到更好的推理模型。然后,我们使用自验证蒸馏训练多个规模的Qwen3模型,并在所有三个领域获得收益。对于Qwen3-4B,我们的方法在数学(AIME26和HMMT)上将聚合保留pass@1提升了+16.7个百分点,在科学(GPQA Diamond和HLE)上提升了+11.1个百分点,在编程(LCBv5和LCBv6)上提升了+8.3个百分点,这些收益也扩展到0.6B和8B模型。与我们的仅测试时基线(UQ-TTC)相比,后者通过在推理时花费额外计算来提升性能,自验证蒸馏在大多数设置下实现了更好的性能,同时仅在测试时进行一次推理调用。

英文摘要

Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark's use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.

2605.26079 2026-05-27 cs.CL 版本更新

Automated Benchmark Auditing for AI Agents and Large Language Models

AI智能体与大语言模型的自动化基准审计

Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou

发表机构 * Duke University(杜克大学) Together AI Stanford University(斯坦福大学)

AI总结 提出自动化基准审计框架ABA,系统审计基准任务中的隐藏环境依赖、规范缺失和评分逻辑问题,在168个基准中发现25.7%的任务存在关键问题,过滤后模型排名变化且性能提升约10%。

详情
AI中文摘要

现代AI基准的复杂性超出了传统验证方法的能力。由领域专家编写的任务通常包含隐含假设、不完整的环境规范和脆弱的评估逻辑,人工标注无法可靠地捕捉这些问题。我们引入了自动化基准审计(ABA),一个系统审计单个基准任务的智能体框架,揭示隐藏的环境依赖、规范缺失和有限的评分逻辑等问题。我们在前沿LLM基准和之前的NeurIPS出版物上运行ABA,共涵盖九个领域的168个基准。在这些语料中,ABA识别出关键问题,包括模糊的任务设计、执行环境冲突和错误的地面真值,在超过25.7%的评估任务中。这些自动化审计的精确性通过专家评审和独立第三方报告(如上游PR)得到验证。关键的是,我们证明这些有问题的任务严重扭曲了对智能体和LLM的能力评估:过滤掉这些有问题的任务会改变模型排名,并在SWE-bench Verified和Terminal-Bench 2上分别将平均性能提高9.9%和9.6%。我们发布智能体工具和所有任务注释,以支持前沿基准的未来发展。

英文摘要

Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

2605.25971 2026-05-27 cs.CL cs.IR cs.MA 版本更新

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

预见与学习:在主动智能体中释放空闲时间计算

Haoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu, Jianghao Lin, Zixuan Guo, Yan Xu, Yasheng Wang, Weinan Zhang, Yong Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tencent(腾讯)

AI总结 提出ProAct主动智能体架构,利用空闲时间计算预测并满足用户未来需求,通过ProActEval基准测试验证其在任务加速、减少用户努力和降低幻觉率方面的显著优势。

Comments 26 pages, 4 figures; code available at https://github.com/AgentACE-AI/ProAct

详情
AI中文摘要

虽然AI智能体在推理和工具使用方面展现出显著能力,但它们本质上仍然是被动的:仅在用户明确提示后才计算响应。这种范式忽略了一个关键机会:交互之间的空闲时间很大程度上被浪费,使得智能体无法为未来的用户需求做准备。为弥补这一差距,我们引入了ProAct,一种主动智能体架构,利用空闲时间计算来预测并满足可能即将出现的用户需求。通过分析不断演变的对话历史以及持久记忆,ProAct预测即将到来的需求并迭代获取信息,使智能体能够在用户发起查询之前解决知识差距并准备证据。为严格评估主动能力,我们还引入了ProActEval,一个包含40个领域200个场景的综合基准,具有可预测的需求链和多样化的用户认知特征。实验结果表明,与被动基线相比具有显著优势。ProAct通过减少14.8%的必要交互轮次加速任务完成,减少11.7%的用户努力,并在ProActEval上将幻觉率降低28.1%。此外,MemBench评估证实ProAct达到了最先进的反思准确性,突显其持续且稳健的性能。

英文摘要

While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query. To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.

2605.25758 2026-05-27 cs.CL 版本更新

StreamProfileBench: A Benchmark for Fine-Grained User Profile Inference in Real-World Streaming Scenarios

StreamProfileBench:真实流式场景中细粒度用户画像推断的基准

Sizhe Wang, Feiyu Duan, Juelin Wang, Liwen Zhang, Zhongyu Wei

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) Shanghai University of Finance and Economics(上海财经大学) MOE laboratory for National Development and Intelligent Governance, Fudan University(复旦大学国家发展与智能治理实验室) Research Institute of Intelligent Complex Systems, Fudan University(复旦大学智能复杂系统研究院)

AI总结 提出StreamProfileBench基准,通过持续状态维护任务和免标注评估框架,研究大语言模型在流式用户画像更新中的保守偏差问题。

详情
AI中文摘要

大型语言模型(LLMs)重塑了用户画像,但当前的评估主要关注静态数据快照。这种范式忽视了个性化系统的现实,其中用户生成内容(UGC)持续到达,细粒度画像快速演变。为弥补这一差距,我们引入了StreamProfileBench,一个用于细粒度流式用户画像的大规模基准。我们将流式用户画像形式化为一个持续状态维护任务,并整理了一个高度真实的数据集,包含来自五个不同平台的7000多名真实用户的超过12万条UGC帖子。通过利用用户兴趣的时间相关性,我们进一步提出了一种新颖的、无需标注的评估框架。在14个领先的LLM上的大量实验表明,持续画像更新仍然是一个开放的挑战。模型表现出系统性的保守偏差,过度保留过去的兴趣而未能识别兴趣衰减。消融实验进一步验证了流式范式的实用性和必要性。

英文摘要

Large Language Models (LLMs) have reshaped user profiling, yet current evaluations mainly focus on static data snapshots. This paradigm overlooks the reality of personalized systems, where User-Generated Content (UGC) arrives continuously and fine-grained profile evolve rapidly. To bridge this gap, we introduce StreamProfileBench, a large-scale benchmark for fine-grained streaming user profiling. We formalize streaming user profiling as a continuous state maintenance task and curate a highly authentic dataset comprising over 120,000 UGC posts from 7,000+ real users across five diverse platforms. By leveraging the temporal correlation of user interests, we further propose a novel, annotation-free evaluation framework. Extensive experiments across 14 leading LLMs reveal that continuous profile updating remains an open challenge. Models exhibit a systemic conservative bias, over-retaining past interests while failing to recognize interest decay. Ablation experiments further validate the practical utility and necessity of the streaming paradigm.

2605.25629 2026-05-27 cs.CL cs.LG 版本更新

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

当分布内增益失效:评估偏好转移下的弱到强奖励模型

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

发表机构 * National University of Singapore(国立新加坡大学) VinUniversity(文大学) Nanyang Technological University(南洋理工大学)

AI总结 研究弱到强偏好学习在零样本分布转移下的表现,发现弱监督微调会导致强模型偏向源域特征,提出表示锚定正则化方法以改善跨分布迁移。

Comments Code: https://anonymous.4open.science/r/w2s_reward_ood-682F

详情
AI中文摘要

弱到强(W2S)泛化是一种有前景的可扩展监督框架,然而现有评估通常在同分布训练-测试条件下进行。因此,我们研究零样本分布转移下的W2S偏好学习,发现基于弱偏好标签训练的强学生模型在分布内表现成功,但无法跨偏好数据集迁移。我们提供了证据表明存在一种表示失败模式:弱监督微调可能将强模型拉向源域特征,而不是保持广泛可迁移的偏好表示。为了缓解这一问题,我们提出表示锚定(Anchor),一种简单而有效的正则化方法,在微调过程中约束强模型预训练表示空间的过度漂移,同时允许任务相关的适应。在多个偏好领域、数据集和模型家族中,Anchor一致地改进了分布外迁移,同时保持了具有竞争力的分布内性能。综合来看,我们的评估协议、迁移感知指标和方法揭示了当前W2S奖励建模中隐藏的脆弱性,并为实现更稳健的偏好迁移提供了实用路径。

英文摘要

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

2605.23651 2026-05-27 cs.CL 版本更新

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

大型语言模型有多像人类?一个语域感知的语言评估框架

Björn Nieth, Marianna Gracheva, Michaela Mahlberg, Bjoern Eskofier, Emmanuelle Salin

发表机构 * Department Artificial Intelligence in Biomedical Engineering (AIBE)(人工智能生物医学工程系) Department of Digital Humanities and Social Studies (DHSS)(数字人文与社会科学系) University of Birmingham(伯明翰大学) Chair of AI-supported Therapy Decisions(人工智能支持治疗决策教授职位) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Institute of AI for Health, Helmholtz Zentrum München(健康人工智能研究所,海德堡中心慕尼黑)

AI总结 提出一个基于语域感知的评估框架,通过比较人类参考语料库与LLM生成文本的词汇语法特征分布(使用最大均值差异和Biber的67个特征),发现LLM偏离人类基线,且最接近人类的模型取决于语域而非模型大小。

Comments 8.5 pages (main) + 31 pages appendix, 29 figures, 10 tables. Code and data: https://github.com/BjoernNieth/Register_Aware_LLMs

详情
AI中文摘要

虽然事实正确性和任务性能长期以来一直是大型语言模型(LLM)研究的焦点,但生成文本在语言层面上与人类相似程度这一基本问题尚未得到充分探索。从语料库语言学的角度来看,语言生产本质上是依赖语境的,不同的交际语境会导致语言特征的频率和共现模式产生差异。未能遵循这些模式的文本可能在内容上是正确的,但仍然不受人类读者欢迎。在这项工作中,我们提出了一个上下文感知的评估框架,其中通过使用给定语域的人类参考语料库与相应的LLM生成语料库之间的语言特征分布的两样本问题来评估人类相似度。我们使用最大均值差异(MMD)和Biber引入的67个词汇语法特征来实现该框架,这些特征通常应用于语料库语言学。在我们的实验中,我们比较了七个经过指令微调的开源模型,跨越五个不同语域的英语数据集,并与人类基线进行对比。虽然在所有测试设置中,LLM都偏离了人类基线,但哪些模型最接近人类语言取决于语域,而不是由模型大小决定。

英文摘要

While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.

2605.22834 2026-05-27 cs.CL cs.IR 版本更新

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

查询自适应语义分块用于检索增强生成:一种具有上下文窗口扩展的动态策略

Mudit Rastogi

发表机构 * Independent Researcher(独立研究者)

AI总结 提出查询自适应语义分块(QASC)方法,通过将查询融入分块过程,利用句子-查询嵌入余弦相似度、上下文窗口扩展和分块级分数聚合,动态构建相关且连贯的文档块,在F1分数上比固定分块提升18-27%,比语义和智能体分块提升8-12%。

详情
AI中文摘要

检索增强生成(RAG)系统关键依赖于文档分块质量以检索相关上下文。固定分块将文档分割成统一单元,不考虑语义或用户意图,导致精度-召回率权衡无法通过调整块大小解决。语义和智能体方法部分解决了这些限制,但未在分块阶段集成用户查询。我们提出查询自适应语义分块(QASC),通过三种机制将查询融入分割以动态构建块:句子与查询嵌入之间的余弦相似度评分以识别种子句子,围绕种子的上下文窗口扩展以保持连贯性,以及块级分数聚合以确保整体相关性。我们在100篇技术文档上评估QASC,涵盖四种类型的200个查询,并与五种粒度的固定分块、递归分割、语义分块和智能体分块进行比较。QASC实现了0.85的F1分数,相对于固定分块相对提升18-27%,相对于语义和智能体替代方法提升8-12%。消融研究证实每个组件都有意义贡献。三名标注者的人工评估(Cohen kappa = 0.82)证实QASC比现有方法产生更相关和连贯的块。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.

2605.21883 2026-05-27 cs.CL 版本更新

Token-weighted Direct Preference Optimization with Attention

基于注意力的令牌加权直接偏好优化

Chengyu Huang, Zhuohang Li, Sheng-Yen Chou, Claire Cardie

发表机构 * Cornell University(康奈尔大学) Vanderbilt University(范德比大学)

AI总结 提出Token-weighted DPO (TwDPO)方法,利用注意力机制估计令牌权重,在不增加额外训练成本的情况下提升大语言模型与人类偏好对齐的性能。

详情
AI中文摘要

直接偏好优化(DPO)无需单独奖励模型即可使大语言模型与人类偏好对齐。然而,DPO平等对待响应中的所有令牌,忽略了单个令牌的不同重要性。现有的令牌级PO方法要么使用基于令牌位置的启发式函数,要么使用单独训练模型给出的概率估计来计算令牌权重,这缺乏鲁棒性且增加了额外训练成本。相比之下,我们提出令牌加权DPO(TwDPO)——一种基于令牌加权RL的新型训练目标,以及AttentionPO——TwDPO的一个实例,它利用LLM自身的注意力来估计令牌权重。AttentionPO提示LLM作为成对评判者,并在比较响应时检查模型关注的位置。这种设计使AttentionPO具有内容感知能力,根据响应内容调整权重,并且高效,每个样本仅需额外两次前向传播。实验结果表明,AttentionPO在AlpacaEval、MT-Bench和ArenaHard上显著提升了性能,超越了现有的偏好优化方法。

英文摘要

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) -- a novel training objective grounded on token-weighted RL -- and AttentionPO -- an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.

2605.19908 2026-05-27 cs.CL 版本更新

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

作者身份信号在基于编码器的语言模型中出现在哪里?

Francis Kulumba, Guillaume Vimont, Laurent Romary, Florian Cafiero

发表机构 * Inria Paris(巴黎国家信息与自动化研究所) Sorbonne Université(索邦大学) IRIF(IRIF研究所) LRE, EPITA Ecole nationale des chartes – PSL(LRE,EPITA国立档案馆 – 法国社会科学研究院)

AI总结 通过机械可解释性工具,研究不同评分机制对基于编码器的作者身份归因模型性能的影响,发现评分机制决定了编码器在何处整合作者身份信号。

Comments 12 pages, 6 figures. Under review

详情
AI中文摘要

使用相同的预训练编码器、数据和损失进行微调的作者身份归因模型,其性能可能相差四倍,这仅取决于它们的评分机制。我们使用机械可解释性工具来解释这一差距。诸如词长、标点密度和功能词频率等风格特征在我们探测的每个模型的每一层中都是相似的,包括一个现成的控制编码器,这表明差距不是由它们的线性可读性解释的。相反,因果干预表明,评分器似乎决定了编码器在何处整合作者身份信号。平均池化迫使信号在早期到中期层整合,而后期交互则将其推迟到后期层。我们进一步从每个评分器的梯度结构中推导出这种差异,训练动态揭示了遵循该差异的不同学习轨迹。

英文摘要

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are similarly available at every layer in every model we probe, including an off-the-shelf control encoder, suggesting that the gap is not explained by their linear readability. Instead, causal intervention shows that the scorer appears to determine where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

2605.18592 2026-05-27 cs.LG cs.AI cs.CL 版本更新

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS: 一种用于基于评分标准的强化学习的记忆增强评分标准改进系统

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Adobe Inc.(Adobe公司) Department of Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 提出AMARIS系统,通过持久化评估记忆存储纵向训练证据来改进评分标准,在科学、医学、指令遵循和创意写作任务上优于静态、局部自适应和无记忆基线方法。

Comments Preprint. Under review

详情
AI中文摘要

基于评分标准的奖励塑形为通过强化学习(RL)微调大语言模型(LLMs)提供了可解释且可编辑的奖励信号,但现有的自适应评分标准方法通常从局部证据(如当前批次或实例级比较)更新标准。这种局部视角丢弃了训练过程中产生的诊断信息,使得难以跟踪重复失败、评估之前的评分标准编辑或在早期标准饱和后提高标准。我们引入了AMARIS,一种记忆增强的评分标准改进系统,它将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中,然后检索最近和语义相关的历史来修订评分标准。我们在全局和实例特定评分标准设置下,在科学、医学、指令遵循和创意写作任务上评估了AMARIS。AMARIS在静态、局部自适应和无记忆基线上有所改进,例如在GPQA-Diamond上比最强基线高出+2.8分,在IFBench上高出+2.2分,同时分析表明记忆减少了振荡性的评分标准编辑,并支持从早期错误纠正到后期课程推进的进展。AMARIS与正常RL循环异步运行,相对于同步评分标准更新减少了阻塞延迟。

英文摘要

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

2605.17774 2026-05-27 cs.CL 版本更新

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

通过QLoRA微调将工具知识内化到小型语言模型中

Yuval Shemla, Ayal Yakobe, Tanmay Agarwal, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia School of General Studies, Columbia University, NY, USA(哥伦比亚大学泛研学院) Columbia Engineering, Columbia University, NY, USA(哥伦比亚大学工程学院) IBM Research, NY, USA(IBM研究院)

AI总结 本文研究通过QLoRA参数高效微调将工具知识内化到小型语言模型中,在AssetOpsBench基准上,微调后的Gemma 4 E4B和Qwen3-4B模型在无描述推理下优于有完整工具描述的未微调基线,输入长度减少82.6%,规划分数提升。

详情
AI中文摘要

大型语言模型越来越多地被用作代理系统中的规划组件,但当前的工具使用流程通常需要将完整的工具模式包含在每个提示中,这产生了大量的令牌开销,并限制了较小模型的实用性。本文研究了是否可以通过参数高效微调将工具使用知识内化到小型语言模型中,从而在推理时无需显式的工具描述即可进行结构化规划。使用AssetOpsBench作为主要基准,我们使用8位QLoRA在约1700个工具使用示例上微调了Gemma 4 E4B和Qwen3-4B,这些示例涵盖工具知识、问题到规划的映射以及执行风格的轨迹。我们在无描述推理下评估了生成的模型,其中提示完全省略了工具目录。微调后的模型优于接收完整工具描述的有信息未微调基线,输入长度减少了82.6%,同时提高了结构性和LLM评判的规划分数。在最佳的Gemma运行中,模型达到了0.65的AT-F1和3.88的整体评判分数,而信息基线的分数分别为0.47和2.88。Qwen3-4B达到了3.78的强劲整体评判分数,同时使用的内存比Gemma少62%,运行速度快2.5倍,尽管它在一般多项选择基准上也表现出更大的灾难性遗忘。额外的消融实验表明,LoRA秩控制着质量与保留之间的权衡,其中$r=32$最大化规划质量,而较小的秩保留了更多的一般知识。这些结果表明,对于固定的工具目录,QLoRA微调可以将工具知识从提示上下文转移到模型权重中,从而在保持或提高工具规划质量的同时,大幅减少推理开销。

英文摘要

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

2605.17482 2026-05-27 cs.CL cs.LG 版本更新

RSD: A Local Triangulation Audit Primitive for Learned Vector Blocks

RSD:一种用于学习向量块的局部三角剖分审计原语

Seungmin Jin

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 提出RSD(关系语义分解)作为局部三角剖分审计方法,通过拟合单纯形成员关系和坐标极点,结合关系解码器和坐标残差,实现学习向量块的可解释性审计。

Comments 8 pages, 1 figure. Revised version with clarified scope, experiments, and limitations

详情
AI中文摘要

局部XAI审计将有限的学习向量块与弱侧信号进行比较。基线方法如最近邻查找、低秩坐标模型和关系分解揭示了审计的不同部分。我们引入关系语义分解(简称RSD),作为学习向量块的局部三角剖分审计。给定坐标X和一个声明的有界弱亲和代理A,RSD拟合单纯形成员关系S和坐标极点C。它在关系解码器中重用S来解码A,并报告坐标残差R=X-SC。这产生了一个范围限定的审计单元:所选块、代理、解码器类和损失预算的兼容性,以及组件质量和残差读数。合成控制检查单纯形重构、代理解码和固定S残差分解。定理陈述、月份和狗/狼块说明了为什么低代理损失应结合组件质量、残差读数和块大小来解读。

英文摘要

Local XAI audits compare a finite block of learned vectors with a weak side signal. Baselines such as nearest-neighbor lookup, low-rank coordinate models, and relation factorization expose different parts of this audit. We introduce Relational Semantic Decomposition, abbreviated as RSD, as a local triangulation audit for learned vector blocks. Given coordinates X and a declared bounded weak affinity proxy A, RSD fits simplex memberships S and coordinate poles C. It reuses S in a relation decoder for A and reports the coordinate residual R=X-SC. This yields a scoped audit unit: compatibility for the chosen block, proxy, decoder class, and loss budget, plus component mass and residual readouts. Synthetic controls check simplex reconstruction, proxy decoding, and fixed-S residual decomposition. The theorem-statement, month, and dog/wolf blocks illustrate why low proxy loss should be read with component mass, residual readouts, and block size.

2604.27019 2026-05-27 cs.LG cs.CL cs.CR 版本更新

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

动态对抗微调重组拒绝几何结构

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu, Junbin Yang, Haihua Shen, Yijun Yang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Inner Mongolia University of Technology(内蒙古科技大学) Tsinghua University(清华大学) Shandong University(山东大学)

AI总结 研究动态对抗微调如何改变安全对齐语言模型中拒绝行为的因果控制载体(低维子空间),发现R2D2沿鲁棒性-效用前沿重组几何结构但未建立自适应鲁棒性。

详情
AI中文摘要

安全对齐的语言模型必须拒绝有害请求而不广泛过度拒绝,但尚不清楚动态对抗微调如何改变拒绝控制载体:Kullback--Leibler (KL)约束方向或因果调节拒绝而不引起大规模安全提示分布偏移的小子空间。我们研究了一个7B骨干模型在监督微调(SFT)和鲁棒拒绝动态防御(R2D2)下的表现,将HarmBench、StrongREJECT和XSTest评估与五点几何测量、因果干预和稀疏自适应压力测试对齐。R2D2在早期检查点将固定源HarmBench攻击成功率降至零;然而,这些检查点也表现出最大的XSTest拒绝率并未能通过良性效用审计。后期检查点部分恢复了面向效用的行为,同时重新打开了攻击成功率,自适应GCG攻击成功率在第250步升至0.415,第500步升至0.613。内部地,R2D2在第100步之前保留了一个后期层的可接受拒绝控制载体,然后将最佳可接受载体迁移到早期层;SFT迁移更早但鲁棒性较差。有效秩保持在1.24附近,SFT表现出更大的主角漂移,这反对将维度扩展和漂移幅度作为充分解释。因果干预支持一个低维但效用耦合的载体。这些结果支持R2D2沿鲁棒性-效用前沿的几何重组解释,但未建立自适应鲁棒性。

英文摘要

Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT) and Robust Refusal Dynamic Defense (R2D2), aligning HarmBench, StrongREJECT, and XSTest evaluations with five-anchor geometry measurements, causal interventions, and sparse adaptive stress tests. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints; however, these checkpoints also exhibit maximal XSTest refusal and fail a benign-utility audit. Later checkpoints partially recover utility-facing behavior while reopening attack success, with adaptive GCG attack success rate rising to 0.415 at step 250 and 0.613 at step 500. Internally, R2D2 preserves a late-layer admissible refusal-control carrier through step 100 and then relocates the best admissible carrier to an early layer; SFT relocates earlier yet remains less robust. Effective rank stays near 1.24, and SFT shows larger principal-angle drift, arguing against both dimensional expansion and drift magnitude as sufficient explanations. Causal interventions support a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account of R2D2 along a robustness--utility frontier, without establishing adaptive robustness.

2605.14473 2026-05-27 cs.CL cs.AI 版本更新

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

RAG 能知道检索错误吗?知识冲突下的上下文合规性诊断

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, Xinpeng Wei

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Carnegie Mellon University(卡内基梅隆大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出上下文驱动分解(CDD)方法,在推理时探测并干预检索增强生成中的上下文与参数知识冲突,揭示上下文合规性模式并提升鲁棒性。

Comments 12 pages, 4 figures, 3 tables

详情
AI中文摘要

检索增强生成(RAG)中的上下文合规机制发生在检索到的上下文主导最终答案时,即使它与模型的参数化知识冲突。仅凭准确性并不能揭示在这种冲突下检索到的上下文如何因果性地塑造答案。我们引入了上下文驱动分解(CDD),这是一种在推理时运行的信念分解探针,并作为受控检索冲突的干预机制。通过跨Epi-Scale压力测试、TruthfulQA错误概念注入和跨模型重复实验,CDD揭示了三种模式。P1:上下文合规性在对抗性上界设置中是可测量的,标准RAG在TruthfulQA错误概念注入(N=500)上达到15.0%的准确率。P2:对抗性准确率提升跨模型家族迁移——CDD提高了Gemini-2.5-Flash以及Claude Haiku/Sonnet/Opus的准确率——但理由-答案因果耦合不迁移。CDD在Gemini-2.5-Flash上达到64.1%的错误注入因果敏感性,而所有三种Claude变体的敏感性落在[-3%, +7%]范围内,表明Claude侧的准确率提升通过一种与显式冲突解决轨迹不同的机制运作。P3:显式冲突分解提高了时间漂移和噪声干扰下的鲁棒性,CDD在完整Epi-Scale对抗性基准上对时间偏移达到71.3%,对干扰证据达到69.9%。这三种模式将上下文合规性识别为一个结构轴,沿此轴可以对标准RAG进行探测和干预,区别于检索质量或单一方法鲁棒性问题,并激励发布Epi-Scale以跨模型家族和检索管道进行系统研究。

英文摘要

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross-model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families -- CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus -- but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

2605.11651 2026-05-27 cs.CV cs.AI cs.CL 版本更新

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Hide to See: 面向VLM蒸馏中视觉锚定思维的推理前缀掩码

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达) POSTECH(POSTECH大学)

AI总结 提出一种推理前缀掩码蒸馏框架,通过掩码学生模型的显著推理前缀,迫使其在推理过程中更依赖视觉证据,从而缓解长推理轨迹中的视觉遗忘问题,提升多模态推理性能。

Comments Pre-print

详情
AI中文摘要

近期VLM中的思考-回答方法(如Qwen3-VL-Thinking)通过在最终答案前利用中间推理步骤来提升推理性能,但其计算成本显著增加,尤其是对于较大的VLM。为了将这种能力蒸馏到紧凑的思考-回答VLM中,一个主要目标是提高学生在整个推理轨迹中利用视觉证据的能力,因为长思考-回答轨迹存在视觉遗忘问题。为此,我们引入了一种新颖的思考-回答蒸馏框架,通过掩码学生模型的显著推理前缀,鼓励学生将思考锚定在视觉信息上。为了补偿这种被掩码的文本线索,学生在蒸馏过程中被鼓励更多地依赖视觉证据作为替代信息源。我们的掩码策略包括:1)逐token的显著推理前缀掩码,针对每个下一token预测选择性掩码高影响力的推理前缀;2)自调节掩码预算调度,根据教师-学生分布之间的差异(即蒸馏难度)逐渐增加掩码规模。在蒸馏阶段,学生模型由我们的显著推理前缀掩码引导,该掩码同时阻塞未来token和显著推理线索,替代了自回归语言建模中使用的标准因果掩码。实验结果表明,我们的方法在多模态推理基准上优于最近的开源VLM、VLM蒸馏和自蒸馏方法,进一步分析证实了学生思考过程中视觉利用的增强。

英文摘要

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyzes confirm enhanced visual utilization along the student thinking process.

2605.14480 2026-05-27 cs.CL 版本更新

Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

《会同馆华夷译语》中的跨语言转写与音系表征

Ji-eun Kim

发表机构 * Department of Korean language and literature, Duksung Women’s University(韩国语言文学系,杜克松女子大学)

AI总结 本研究将《会同馆华夷译语》视为一个连贯的多语言转写系统,通过数字化和音系分析,揭示了其主要转写和补充转写的跨语言规律,并论证了该系统作为历史音系证据的价值。

Comments 49 pages; 1 figure; 40 tables; SLE2019; under review

详情
AI中文摘要

目的:本研究调查《会同馆华夷译语》(HHY)的转写原则,该系列多语词汇集由明朝政府在15至16世纪间编纂,用于译员培训。本研究不将HHY视为孤立语言材料的集合,而是将其视为一个连贯的多语言转写系统,通过汉字表征非汉语语言的口语形式。方法:将HHY的绝大部分数字化,并与汉语音韵范畴对齐。对先前各语言部分的重建进行批判性审查,并整合到一个统一的比较数据库中。分析聚焦于八个语言部分中主要转写(MT)和补充转写(ST)的跨语言规律。结果:MT通常表征与当时汉语音节结构兼容的音,而ST主要编码与汉语音系兼容性较差的语音特征。分析进一步表明,汉语音韵范畴在外语转写中的使用比先前假设的更为灵活。因此,HHY作为一种相对系统的语音近似方法,而非汉语音系对非汉语语言的直接投射。结论:HHY可被分析为一个内部结构化的转写系统,而不仅仅是词汇集的集合。更广泛地说,该研究表明历史转写系统可为历史音系学提供宝贵证据,尤其对于历史记录有限的亚洲语言。

英文摘要

Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

2509.09544 2026-05-27 cs.CL 版本更新

MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025)

MetaGraph:金融NLP中GenAI的大规模元分析(2022-2025)

Paolo Pedinotti, Peter Baumann, Nathan Jessurun, Leslie Barrett, Enrico Santus

发表机构 * Bloomberg(贝莱德)

AI总结 提出MetaGraph方法,利用本体引导的LLM从科学语料中提取类型化知识图谱,对681篇GenAI在金融领域的论文进行结构化趋势分析,揭示了三个阶段:早期LLM驱动的任务和数据集扩展、对局限性和风险的日益关注、以及向模块化系统导向方法的转变。

Comments 8 pages, appendices, GEM, ACL

详情
AI中文摘要

自2022年底以来,金融NLP迅速发展,超越了叙述性综述。我们引入了MetaGraph,一种使用本体引导的LLM从科学语料中提取类型化知识图谱的方法,以实现结构化的大规模趋势分析。应用于681篇关于金融领域GenAI的论文(2022-2025),MetaGraph揭示了三个阶段:早期LLM驱动的任务和数据集扩展、对局限性和风险的日益关注、以及向模块化系统导向方法(如检索增强设计)的转变。我们发布了生成的资源和工件,以支持可重复的元分析和未来对该领域的监测。

英文摘要

Financial NLP has evolved rapidly since late 2022, outpacing narrative surveys. We introduce MetaGraph, a methodology for extracting typed knowledge graphs from scientific corpora using ontology-guided LLM extraction to enable structured, large-scale trend analysis. Applied to 681 papers on GenAI in Finance (2022-2025), MetaGraph reveals three phases: early LLM-driven expansion of tasks and datasets, growing emphasis on limitations and risk, and a shift toward modular, system-oriented methods (e.g., retrieval-augmented designs). We release the resulting resource and artifacts to support reproducible meta-analysis and future monitoring of the field.

2605.06152 2026-05-27 cs.LG cs.CL math.OC stat.ML 版本更新

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Grokking 还是 Glitching?低精度如何驱动 Slingshot 损失尖峰

Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

发表机构 * Tsinghua University(清华大学) The University of Tokyo(东京大学)

AI总结 本文证明深度神经网络训练中的 Slingshot 损失尖峰现象是由浮点精度限制导致的数值特征膨胀(NFI)机制引起的,并解释了参数范数快速增长和梯度消失等现象。

Comments 28 pages, 13 figures; ICML 2026 Workshop on High-dimensional Learning Dynamics (Spotlight)

详情
AI中文摘要

深度神经网络在无正则化的长期训练中会出现周期性的损失尖峰,这种现象被称为“Slingshot 机制”。现有工作通常将其归因于内在的优化动力学,但其触发机制仍不清楚。本文证明这种现象是浮点算术精度限制的结果。当训练进入高置信度阶段时,正确类别的 logit 与其他 logit 之间的差异可能超过吸收误差阈值。然后在反向传播中,正确类别的梯度被精确舍入为零,而错误类别的梯度保持非零。这打破了跨类别的梯度零和约束,并在分类器层的参数更新中引入了系统性漂移。我们证明这种漂移与特征形成正反馈循环,导致全局分类器均值和全局特征均值呈指数增长。我们将这种机制称为数值特征膨胀(NFI)。该机制解释了 Slingshot 尖峰前的快速范数增长、随后梯度的重新出现以及由此产生的损失尖峰。我们进一步表明,NFI 并不等同于观察到的损失尖峰:在更实际的任务中,部分吸收可能不会产生可见的尖峰,但它仍然可以打破零和约束并驱动参数范数的快速增长。我们的结果将 Slingshot 重新解释为有限精度训练的一种数值动力学,并为训练后期异常参数增长和 logit 发散提供了可检验的解释。

英文摘要

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

2605.09156 2026-05-27 cs.CL cs.AI 版本更新

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中?探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

发表机构 * Bavarian Academy of Sciences (BAdW)(巴伐利亚科学学院) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Freiburg(弗赖堡大学)

AI总结 本文提出一个可解释的深度学习框架,通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分(阳性、阴性、中性)到二分(阳性、阴性)的演变,并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情
AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组,在大多数罗曼语中从三分结构(阳性、阴性、中性)变为二分结构(阳性、阴性)。在这项工作中,我们引入了一个可解释的深度学习框架,在词法和上下文层面研究这一现象。首先,我们表明传统的分词策略对于这种低资源历史设置不够稳健,而我们提出的分词器在这些基线上提高了性能。在词法层面,我们评估了形态特征对性别预测的贡献。在上下文层面,我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE 版本更新

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * University College London(伦敦大学学院) Holistic AI Imperial College London(伦敦帝国学院)

AI总结 本文发现语言模型内部存在对应工具选择的线性方向,通过干预该方向可切换工具调用,并能提前检测潜在错误,在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情
AI中文摘要

当工具调用代理选错工具时,失败在执行之前是不可见的:邮件被发送,会议被错过。随着代理承担重要行动,一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它;本文表明我们可以做到。在模型内部,工具的选择由激活空间中的单个方向承载,每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1(270M 到 27B)的 12 个指令微调模型和 6 个基础模型上,这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率,在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式,因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误:模型在两个工具之间不确定的查询失败率比确定的高 21 倍(Gemma 3 27B)。这不仅仅是主题注入:相同幅度的随机向量给出 0% 的切换率,而在单个领域(共享一个主题的 14 个航空工具)内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具:从模型内部状态读取所选工具(余弦读出)在 BFCL 上恢复 61-82% 的准确率,而基础生成仅为 2-10%,这表明预训练形成了表示,而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置;在多轮代理循环中,相同的干预不太稳定(匹配基线的增益或损失高达 30 个百分点,没有一致的方向)。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

2605.07632 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Post-training makes large language models less human-like

后训练使大型语言模型更不像人类

Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska Brändle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, Noémi Éltető, Michael Franke, Thomas L. Griffiths, Fritz Günther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield, Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang, Jian-Qiao Zhu, Eric Schulz

发表机构 * Helmholtz Munich(海德堡-慕尼黑亥姆霍兹中心) Massachusetts Institute of Technology(麻省理工学院) University of Tübingen(图宾根大学) University of Oxford(牛津大学) Stanford(斯坦福大学)

AI总结 通过引入Psych-201数据集,发现后训练(将基础模型转化为有用助手的过程)一致地降低了模型与人类行为的对齐度,且这种错位在新模型世代中加剧,而人物诱导技术无法改善个体层面的预测。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类参与者的替代品,但目前尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题,我们引入了Psych-201,这是一个新颖的数据集,使我们能够大规模测量行为对齐。我们发现,后训练——将基础模型转化为有用助手的阶段——在模型家族、规模和目标上一致地降低了与人类行为的对齐度。此外,这种错位在新模型世代中扩大,即使基础模型继续改进。最后,我们发现人物诱导——一种通过将模型条件化为参与者特定信息来引发类人行为的流行技术——并不能改善个体层面的预测。综合来看,我们的结果表明,当前用于将LLMs转化为有用助手的那些过程也使得它们成为人类行为的不太准确的模型。

英文摘要

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

2605.07053 2026-05-27 cs.CL cs.AI 版本更新

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

GSM-SEM: 生成语义变体增强的基准与框架

Jyotika Singh, Fang Tu, Aziza Mirsaidova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Karan Dua, Yassine Benajiba, Weiyi Sun, Tao Sheng, Graham Horwood, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 提出GSM-SEM框架,通过修改实体、属性和关系生成语义多样的数学问题变体,降低模型对固定测试集的记忆偏差,并在多个基准上验证性能下降。

详情
AI中文摘要

像GSM8K这样的基准测试是数学推理的流行度量,但由于对固定测试集的记忆,排行榜上的提升可能夸大真实能力。大多数鲁棒性变体应用表面级别的扰动(释义、重命名、数字交换、干扰项),这些扰动在很大程度上保留了底层事实,而静态发布本身可能随着时间的推移成为记忆目标。我们引入了GSM-SEM,一个可重用且随机的框架,用于生成语义多样化的基准变体,其语义方差显著高于先前方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述,经常改变底层事实,并要求模型在新条件下重新计算解决方案,同时约束生成以保留原始计算/答案和近似问题难度。GSM-SEM在每次运行时生成新的变体,无需重新标注,减少了对静态公共基准评估的依赖,从而降低了记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有的变体系列(GSM-Symbolic和GSM-Plus),生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA LLM,我们观察到一致的性能下降,当语义扰动与符号/plus变体结合时下降更大(在GSM-SEM的最大严格配置中平均下降率为28%)。我们公开发布这三个SEM变体作为完全人工验证的数据集。最后,为了展示在GSM风格数学问题之外的适用性,我们将GSM-SEM应用于其他基准,包括BigBenchHard、LogicBench和NLR-BIRD。

英文摘要

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

2509.26619 2026-05-27 cs.CL cs.AI 版本更新

Searching the Internet for Challenging Benchmarks at Scale

在互联网上大规模搜索具有挑战性的基准测试

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

发表机构 * Google(谷歌) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种自动框架,将互联网建模为多臂老虎机问题,通过epsilon-greedy策略高效搜索最具挑战性的主题,以构建无需人工筛选的基准测试。

详情
AI中文摘要

许多静态基准测试开始饱和:随着模型快速改进,它们在固定测试集上获得近乎完美的分数,几乎没有剩余空间来暴露模型的真正弱点——即使是专家策划的挑战集在爬山后也会迅速饱和。我们提出一个完全自动化的框架,在互联网上大规模搜索以构建具有挑战性的基准测试,无需人工筛选。关键洞察是将互联网建模为一个广阔的主题空间,并将搜索形式化为多臂老虎机问题,其中每个主题的难度仅通过昂贵的采样和评估查询来揭示。我们的epsilon-greedy策略在仅探索6%的搜索空间的情况下识别出最具挑战性的主题——相比穷举评估成本降低了100倍。我们在机器翻译和知识问答上进行了验证,确认发现的难度在独立指标(GEMBA-SQA和MetricX)、语言和模型上都是稳健的。

英文摘要

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.

2605.01489 2026-05-27 cs.AI cs.CL 版本更新

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: 面向前沿科学推理的深度研究智能体规模化

Tianshi Zheng, Rui Wang, Xiyun Li, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Wei Fan, Yangqiu Song, Tianqing Fang

发表机构 * HKUST(香港科技大学) CUHK(香港大学) Tencent AI Lab(腾讯AI实验室)

AI总结 提出SciResearcher框架,通过合成基于学术证据的概念与计算任务并训练智能体,在HLE-Bio/Chem-Gold等基准上达到最优性能。

Comments 23 pages, 6 figures, 15 tables

详情
AI中文摘要

前沿科学推理正迅速成为推动AI智能体在自动化科学发现中的关键基础。深度研究智能体为此挑战提供了有前景的方法。这些模型通过后训练处理信息寻求任务(通常通过知识图谱构建或迭代网页浏览来策划)来发展强大的问题解决能力。然而,这些策略在前沿科学中面临固有局限性,因为领域特定知识分散在稀疏且异构的学术来源中,而问题解决需要远超事实回忆的复杂计算和推理。为弥合这一差距,我们引入了SciResearcher,一个用于前沿科学数据构建的全自动智能体框架。SciResearcher综合了基于学术证据的多样化概念和计算任务,同时激发信息获取、工具集成推理和长程能力。利用策划的数据进行监督微调和智能体强化学习,我们开发了SciResearcher-8B,一个在HLE-Bio/Chem-Gold基准上达到19.46%的智能体基础模型,在其参数规模上建立了新的最先进水平,并超越了多个更大的专有智能体。它在SuperGPQA-Hard-Biology和TRQA-Literature基准上进一步取得了13-15%的绝对提升。总体而言,SciResearcher为前沿科学推理的自动数据构建引入了一种新范式,并为未来的科学智能体提供了一条可扩展的路径。

英文摘要

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

2605.01188 2026-05-27 cs.CL 版本更新

Compute Optimal Tokenization

计算最优分词

Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer

发表机构 * FAIR at Meta(Meta 的 FAIR) University of Washington(华盛顿大学)

AI总结 通过训练988个BLT模型,研究压缩率(每token平均字节数)对缩放趋势的影响,发现计算最优配置下模型参数量与数据字节数成比例,且最优压缩率低于BPE并随计算量下降。

详情
AI中文摘要

缩放定律能够优化数据量和语言模型大小的选择,但数据单元——token——对此关系的影响尚未充分探索。本文系统研究了由压缩率(即每token平均文本字节数)控制的token信息粒度如何影响缩放趋势。我们训练了988个潜在分词模型(BLT),参数规模从50M到7B,这些模型可以设置所需的压缩率。这种灵活性使我们能够研究压缩率远超过使用流行BPE分词器得到的每token 4.57字节的作用。实验表明,在计算最优配置中,模型参数量与以字节为单位的数据大小成比例,而不是通常认为的以token为单位(Kaplan et al., 2020; Hoffmann et al., 2022)。此外,我们发现最优压缩率不同于BPE得到的压缩率,并且随计算量增加而降低。这些发现普遍适用于潜在分词和子词分词,以及英语以外的其他语言,指导语言模型开发者选择分词方案以实现最大计算效率。

英文摘要

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

2604.26102 2026-05-27 cs.SE cs.CL 版本更新

SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent

SWE-Edit:重新思考高效SWE智能体的代码编辑

Yikai Zhang, Jiaxin Pei, Kenan Li, Qirui Jin, Maoquan Wang, Jin Pan, Yu Kang, Shengyu Fu, Elsie Nallipogu, Junjie Hu, Yufan Huang, Zijian Jin

发表机构 * Microsoft(微软) Department of Computer Sciences, University of Wisconsin–Madison(威斯康星大学麦迪逊分校计算机科学系) Stanford Institute for Human-Centered Artificial Intelligence (HAI), Stanford University(斯坦福大学人机交互研究所)

AI总结 提出SWE-Edit框架,通过将代码编辑分解为查看器和编辑器两个子智能体,解决上下文耦合问题,在SWE-Bench Verified上提升解决率2.1个百分点并降低推理成本17.9%,且通过GRPO训练小模型实现高效编辑格式选择。

详情
AI中文摘要

大型语言模型智能体在软件工程方面取得了显著进展,但当前系统存在上下文耦合问题:标准代码编辑界面将代码检查、修改规划和编辑执行混杂在单个上下文窗口中,迫使智能体将探索性查看与严格格式化的编辑生成交错进行。无关上下文不断累积,编辑可靠性下降。我们提出SWE-Edit,将编辑界面分解为两个专门的子智能体:一个按需提取任务相关代码的查看器,以及一个根据高层自然语言计划执行修改的编辑器——让主智能体专注于推理,同时将上下文密集型操作委托给干净的上下文窗口。在SWE-Bench Verified上,这种分解将解决率提高了2.1个百分点,并将推理成本降低了17.9%,在多个推理模型家族(Kimi-K2、MiniMax-M2.1、GLM-4.7)上均有一致提升。我们进一步证明,有效的编辑格式选择可以训练到小模型中,而无需前沿规模的能力:在Qwen3-8B上使用自适应查找替换/全文件重写策略进行GRPO训练,将编辑成功率提高了12.5个百分点,并使8B开源编辑器在下游SWE-Bench解决率上与GPT-5-nano持平。为支持快速编辑器迭代,我们发布了PR-Edit,这是一个轻量级评估,其分数与SWE-Bench解决率强相关。我们在https://github.com/microsoft/SWE-Edit发布代码。

英文摘要

Large language model agents have made strong progress on software engineering, yet current systems suffer from a context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. Irrelevant context accumulates and edit reliability degrades. We propose SWE-Edit, which decomposes the editing interface into two specialized subagents: a Viewer that extracts task-relevant code on demand, and an Editor that executes modifications from high-level natural language plans -- letting the main agent focus on reasoning while delegating context-intensive operations to clean context windows. On SWE-Bench Verified, this decomposition raises resolve rate by 2.1 pp and cuts inference cost by 17.9%, with consistent gains across multiple reasoning-model families (Kimi-K2, MiniMax-M2.1, GLM-4.7). We further show that effective edit-format selection can be trained into a small model rather than requiring frontier-scale capacity: GRPO training on Qwen3-8B with an adaptive find-replace/whole-file-rewrite policy improves edit success by 12.5 pp and brings an 8B open-source editor to parity with GPT-5-nano on downstream SWE-Bench resolve rate. To enable rapid editor iteration, we release PR-Edit, a lightweight evaluation whose scores correlate strongly with SWE-Bench resolve rate. We release our code at https://github.com/microsoft/SWE-Edit.

2604.19499 2026-05-27 cs.CL 版本更新

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

秩湍流Delta与可解释的文体测量Delta方法

Dmitry Pronin, Evgeny Kazartsev

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 本文引入两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta,通过将距离函数应用于概率分布来推广Burrows经典Delta,并提出词级分解实现数值可解释性,在四个语料库上验证了方法的有效性。

Comments Published in Digital Scholarship in the Humanities. The version of record is available at https://academic.oup.com/dsh/advance-article-abstract/doi/10.1093/llc/fqag072/8692587 Code available at: https://github.com/DDPronin/Rank-Turbulence-Delta

详情
Journal ref
Digital Scholarship in the Humanities, 2026
AI中文摘要

本文介绍了两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta,它们通过应用为概率分布设计的距离函数来推广Burrows经典Delta。我们首先阐述了这些度量的理论基础,对比了词频向量的中心化和非中心化z分数,并将非中心化向量重新解释为概率分布。基于这一表示,我们开发了一种词级分解,使每个Delta距离在数值上可解释,从而促进细读和结果验证。这些方法的有效性在英语、德语、法语和俄语的四个文学语料库上进行了评估。英语、德语和法语数据集来自Project Gutenberg,而俄语基准是SOCIOLIT语料库,包含89位作者的639部作品,时间跨度从18世纪到21世纪。秩湍流Delta达到了与余弦Delta相当的归属准确率;Jensen-Shannon Delta始终匹配或超越经典Burrows Delta的性能。最后,在扩展的SOCIOLIT语料库上重新评估了几种已有的归属算法,提供了它们在显著时间和风格变化下鲁棒性的现实估计。

英文摘要

This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 639 works by 89 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus, providing a realistic estimate of their robustness under pronounced temporal and stylistic variation.

2604.21454 2026-05-27 cs.CL cs.AI 版本更新

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

混合与非混合大语言模型中的推理原语:架构差异在状态追踪和召回中是否带来优势?

Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa

发表机构 * Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所) Rheinische Friedrich-Wilhelms-Universität Bonn(波恩莱茵河弗里德里希-威廉大学)

AI总结 本研究通过五个受控任务族比较了Transformer和混合架构在状态召回任务上的表现,发现推理增强是主要优势因素,而混合架构的优势较窄且依赖于任务。

详情
AI中文摘要

大型语言模型中的推理通常被视为单一能力,但其部分收益可能源于更简单的底层操作。我们通过五个以状态召回为中心的控制任务族,研究了两种这样的原语——召回和状态追踪,并比较了匹配的Transformer和混合架构(有无推理增强)。在整个套件中,推理增强变体显著优于仅指令变体,通常差距很大。这一模式与“状态超越令牌”观点一致:外部化推理痕迹之所以有帮助,是因为它们在令牌空间中向前传递中间状态。相比之下,一旦推理令牌可用,混合归纳偏置在准确性上并不产生统一优势。当架构差异确实出现时,它们遵循任务结构:混合Think模型在严格顺序的链式更新上更稳健,而Transformer Think模型在平面多跳检索上更稳健。因此,我们将本研究的主要贡献视为对状态召回任务性能驱动因素的描述性说明:推理令牌增强似乎是主导因素,而混合优势更窄、依赖于任务,并且可能更多关乎推理效率而非整体能力。我们还发布了重现这些结果所需的代码库和数据。

英文摘要

Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five controlled task families centered on state-based recall, and compare matched transformer and hybrid architectures with and without reasoning augmentation. Across the suite, reasoning-augmented variants substantially outperform instruction-only variants, often by large margins. This pattern is consistent with the State over Tokens view: externalized reasoning traces help because they carry the intermediate state forward in token space. By contrast, hybrid inductive bias does not yield a uniform advantage in accuracy once reasoning tokens are available. When architectural differences do appear, they follow task structure: the hybrid Think model is more robust on strictly sequential chained updates, whereas the transformer Think model is more robust on flat multi-hop retrieval. We therefore cast the main contribution of this study as a descriptive account of what drives performance on state-based recall tasks: reasoning-token augmentation appears to be the dominant factor, while hybrid advantages are narrower, task-dependent, and potentially more about inference efficiency than overall capability. We also release the codebase and data required to reproduce these results.

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA 版本更新

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 提出Chat2Workflow基准,用于评估大语言模型从自然语言生成可执行可视化工作流的能力,并设计了一个智能体基线以提升性能。

Comments Work in progress

详情
AI中文摘要

目前,可执行的可视化工作流已成为实际工业部署中的主流范式,提供了强大的可靠性和可控性。然而,在当前实践中,此类工作流几乎完全通过手动工程构建:开发人员必须仔细设计工作流,为每个步骤编写提示,并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程,我们引入了Chat2Workflow,一个直接从自然语言生成可执行可视化工作流的基准,并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建,每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台(如Dify和Coze)上。实验结果表明,尽管最先进的语言模型通常能捕捉高层次意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升,但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

2510.06133 2026-05-27 cs.CL cs.AI 版本更新

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

CreditDecoding: 利用轨迹信用加速扩散大语言模型中的并行解码

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Westlake University(西湖大学)

AI总结 针对扩散大语言模型并行解码中正确令牌被反复重掩导致冗余迭代的问题,提出基于轨迹信用的无训练并行解码方法CreditDecoding,融合历史证据与当前logits提升低置信度正确令牌的置信度,实现高达5.48倍加速并提升准确性。

Comments 19 pages, 13 figures, 9 tables, Accepted to ACL 2026 main conference

详情
AI中文摘要

扩散大语言模型(dLLMs)通过迭代去噪生成文本。在普遍采用的并行解码方案中,每一步仅确认高置信度位置,而重掩其他位置。通过分析dLLM去噪轨迹,我们发现一个关键的低效问题:模型通常在目标令牌的置信度足够高以被解码之前的几个步骤就预测出正确令牌。这种早期预测与后期解码之间的差距导致已正确的令牌被反复重掩,造成冗余迭代并限制加速。为利用这种时间冗余,我们引入轨迹信用(Trace Credit),通过累积历史证据来量化令牌的解码潜力。基于此,我们提出CreditDecoding,一种无训练的并行解码方法,将轨迹信用与当前logits融合,以提升正确但低置信度令牌的置信度,从而加速去噪并提高鲁棒性。在八个基准测试上,CreditDecoding在LLaDA-8B上实现了高达5.48倍的加速和+0.48的准确率提升,并在多种dLLM架构和参数规模上持续改进性能。它还能扩展到长上下文,并与主流推理优化方法正交,使其成为一种实用且广泛适用的解决方案。

英文摘要

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising traces, we uncover a key inefficiency: models often predict the correct target token several steps before its confidence becomes high enough to be decoded. This gap between early prediction and late decoding forces repeated remasking of already-correct tokens, causing redundant iterations and limiting acceleration. To exploit this temporal redundancy, we introduce Trace Credit to quantify a token's decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. On eight benchmarks, CreditDecoding achieves up to 5.48 times speedup with +0.48 accuracy on LLaDA-8B and consistently improves performance across diverse dLLM architectures and parameter scales. It further scales to long contexts and remains orthogonal to mainstream inference optimizations, making it a practical and widely applicable solution.

2604.14640 2026-05-27 cs.CL cs.AI 版本更新

Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

Fact4ac在金融虚假信息检测挑战赛中的方法:通过微调和少样本提示的大语言模型实现无参考金融虚假信息检测

Cuong Hoang, Le-Minh Nguyen

发表机构 * KaiNKaiho

AI总结 本文提出一种结合零样本/少样本提示和LoRA参数高效微调的大语言模型框架,用于无外部证据的金融虚假信息检测,在公开和私有测试集上分别达到95.4%和96.3%的准确率,获得竞赛第一名。

详情
Journal ref
Proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD 2026), 20th International AAAI Conference on Web and Social Media
AI中文摘要

金融虚假信息的泛滥对市场稳定和投资者信任构成严重威胁,误导市场行为并造成关键信息不对称。检测此类误导性叙述本身具有挑战性,尤其是在现实场景中,外部证据或用于交叉验证的补充参考资料严格不可用。本文介绍了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。该任务基于最近提出的RFC-BENCH框架(Jiang等人,2026),挑战模型仅依赖内部语义理解和上下文一致性而非外部事实核查来判断金融声明的真实性。为应对这一艰巨的评估设置,我们提出了一个综合框架,利用最先进的大语言模型(LLM)的推理能力。我们的方法系统地集成了上下文学习(特别是零样本和少样本提示策略)以及通过低秩适应(LoRA)的参数高效微调(PEFT),以最优方式使模型与金融操纵的微妙语言线索对齐。我们提出的系统表现出卓越效果,成功在两个官方排行榜上均获得第一名。具体来说,我们在公开测试集上达到95.4%的准确率,在私有测试集上达到96.3%的准确率,突显了我们方法的鲁棒性,并有助于加速金融自然语言处理中上下文感知的虚假信息检测。我们的模型(14B和32B)可在https://huggingface.co/KaiNKaiho获取。

英文摘要

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

2604.13502 2026-05-27 cs.CL 版本更新

Using reasoning LLMs to extract SDOH events from clinical notes

使用推理型大语言模型从临床笔记中提取社会健康决定因素事件

Ertan Dogan, Kunyu Yu, Yifan Peng

发表机构 * Department of Population Health Sciences, Weill Cornell Medicine(流行病学与公共卫生系,韦尔·科恩医学中心)

AI总结 本研究提出一种基于推理型大语言模型的提示工程方法,通过四个模块(简洁提示、少样本学习、自一致性机制和后处理)从临床笔记中提取结构化SDOH事件,取得0.866的微平均F1分数,展示了简单实现与强性能的平衡。

详情
AI中文摘要

社会健康决定因素(SDOH)指影响个人生活、工作和衰老的环境、行为和社会条件。SDOH对个人健康结果有显著影响,其系统识别和管理可大幅改善患者护理。然而,SDOH信息主要记录在电子健康记录的非结构化临床笔记中,限制了其作为机器可读实体的直接使用。为解决此问题,研究人员采用基于预训练BERT模型的自然语言处理(NLP)技术,展示了有前景的性能,但需要复杂的实现和大量计算资源。在本研究中,我们探索了利用具有高级推理能力的大语言模型(LLM)提取结构化SDOH事件的提示工程策略。我们的方法包含四个模块:1)开发结合既定指南的简洁描述性提示,2)应用精心策划示例的少样本学习,3)使用自一致性机制确保稳健输出,4)后处理进行质量控制。我们的方法达到了0.866的微平均F1分数,展示了与领先模型相比具有竞争力的性能。结果表明,具有推理能力的LLM是SDOH事件提取的有效解决方案,兼具实现简单性和强性能。

英文摘要

Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

2603.12564 2026-05-27 cs.CL cs.AI 版本更新

Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

卖给我这支股票:LLM智能体中的不安全推荐漂移

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * Centre for Artificial Intelligence, University College London(人工智能研究中心,伦敦大学学院)

AI总结 研究LLM智能体在多轮金融推荐中因工具输出被操纵而产生风险不匹配推荐的问题,通过实验揭示评估盲区并分析机制。

详情
AI中文摘要

人们越来越多地使用LLM智能体进行多轮金融推荐,智能体通过工具获取市场数据并跨轮次跟踪用户偏好。当工具输出被操纵时,推荐不再匹配用户声明的风险偏好,但由于NDCG等标准指标仅衡量一般相关性,风险股票和安全股票的得分相同,因此指标显示一切正常。我们将这种差距称为评估盲区。我们在八个语言模型上回放23轮金融咨询对话,每段对话分别使用干净和被操纵的工具数据运行两次。质量得分与干净会话几乎相同,而智能体在65-99%的轮次中产生风险不匹配的推荐,所有八个模型一致。该机制在逐轮中可见:在1,840轮中,80%的风险评分引用逐字复现了被操纵的值,没有一轮提出质疑,高风险股票的安全语言框架比例从14%(Qwen2.5-7B)到69%(Claude Sonnet 4.6)不等。使前沿模型成为优秀智能体的特性——忠实地将其推理基于工具输出——也使其跟随被操纵的输出。损害并非由记忆驱动:仅污染当前轮次仍会产生95%的违规。模型内部能区分操纵(稀疏自编码器特征将对抗性扰动与随机扰动分开),但这并未转化为更安全的输出。激活层干预仅恢复不到6%的安全差距,提示级自我验证失败,因为自我检查读取了相同的被操纵数据,而参数化交叉检查在前沿模型上每轮以99-100%的比率标记污染,但整体适宜性仍未改变:智能体识别出篡改,但仍然推荐它。

英文摘要

People increasingly use LLM agents for multi-turn financial recommendations, where the agent pulls market data through tools and tracks user preferences across turns. When tool outputs are manipulated, the recommendations stop matching the user's stated risk profile, but because standard metrics like NDCG only score general relevance, risky and safe stocks score alike, so the metric says nothing went wrong. We call this gap evaluation blindness. We replay 23-turn financial advisory conversations across eight language models, running each dialogue twice with clean and manipulated tool data. Quality scores stay nearly identical to clean sessions while the agents produce risk-mismatched recommendations in 65-99% of turns, unanimous across all eight models. The mechanism is visible turn-by-turn: 80% of risk-score citations across 1,840 turns reproduce the manipulated value verbatim, not a single turn pushes back, and safe-language framing of high-risk stocks ranges from 14% (Qwen2.5-7B) to 69% (Claude Sonnet 4.6). The property that makes frontier models good agents, faithfully grounding their reasoning in tool outputs, also makes them follow manipulated ones. The damage is not memory-driven: contaminating only the current turn still produces 95% of the violations. The model internally distinguishes the manipulation (sparse autoencoder features separate adversarial from random perturbations), but this does not translate into safer output. Activation-level interventions recover under 6% of the safety gap, prompt-level self-verification fails because the self-check reads the same manipulated data, and a parametric cross-check that flags contamination at 99-100% per turn on a frontier model still leaves aggregate suitability unchanged: the agent identifies the tampering and recommends it anyway.

2604.13018 2026-05-27 cs.CL 版本更新

Toward Autonomous Long-Horizon Engineering for ML Research

面向机器学习研究的自主长周期工程

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia

发表机构 * GitHub

AI总结 提出AiScientist多智能体系统,通过轻量级层级研究团队和File-as-Bus工作空间解决长周期ML研究工程中的累积进度维持问题,在PaperBench和MLE-Bench Lite上取得显著提升。

Comments Repo: https://github.com/AweAI-Team/AiScientist

详情
AI中文摘要

智能体系统日益自动化AI研究的各个环节。然而,将未明确的研究目标转化为可运行、经实验验证的ML系统仍是一个核心瓶颈。我们将这一操作环境研究为“长周期ML研究工程”:通过反复实现、实验和改进,将研究规范转化为可运行的ML系统。核心挑战是在延迟、混杂反馈下,跨异构阶段维持累积的项目进展。我们引入了AiScientist,一个围绕“薄控制厚状态”构建的多智能体系统:轻量级层级研究团队通过File-as-Bus工作空间进行协调,该工作空间跨角色和调用保留决策相关工件。在PaperBench上,AiScientist使用Gemini-3-Flash和GLM-5分别比最强匹配基线提高9.92和11.15分。在MLE-Bench Lite上,它在两个骨干网络下均达到81.82 Any Medal%,比最强匹配基线提高4.55和16.67分,并超过Codex/GPT-5.5 xhigh前沿参考基准13.64 Any Medal分。消融和过程分析表明,持久的项目状态对后期轮次改进至关重要:移除File-as-Bus使PaperBench分数降低6.41分,MLE-Bench Lite Any Medal%降低31.82分。这些结果表明,长周期AI研究不仅是一个更强的局部推理问题,更是一个维持累积、可检查项目进展的系统问题。

英文摘要

Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.

2604.08999 2026-05-27 cs.CL cs.AI cs.LG 版本更新

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA: 面向复杂表格问答的自适应语义树推理架构

Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ASTRA架构,通过AdaSTR将表格重构为逻辑语义树,并利用DuTR双模式推理框架结合树搜索文本导航与符号代码执行,在复杂表格问答中达到最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

表格序列化仍然是大型语言模型(LLMs)在复杂表格问答中的关键瓶颈,受到结构忽视、表示差距和推理不透明等挑战的阻碍。现有的序列化方法无法捕获显式层次结构且缺乏模式灵活性,而当前的基于树的方法则存在语义适应性有限的问题。为了解决这些限制,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块:AdaSTR和DuTR。首先,我们引入AdaSTR,它利用LLMs的全局语义意识将表格重构为逻辑语义树。这种序列化显式建模了层次依赖关系,并采用自适应机制根据表格规模优化构建策略。其次,基于此结构,我们提出了DuTR,一种双模式推理框架,集成了基于树搜索的文本导航以实现语言对齐,以及符号代码执行以实现精确验证。在复杂表格基准上的实验表明,我们的方法达到了最先进的性能。

英文摘要

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

2603.11394 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

别听我的!多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔大学) Vanderbilt University Medical Center(范德比尔大学医学中心) Intuit AI Research(Intuit人工智能研究)

AI总结 提出“坚持或切换”(SoS)框架,通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性,发现对话税导致准确性和拒绝错误建议的能力平均下降30%,并观察到盲目切换现象。

详情
AI中文摘要

大型语言模型(LLM)在静态基准测试中表现出色,但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要,因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里,我们引入了“坚持或切换”(SoS)框架,该框架将问答空间划分为多个顺序呈现,以模拟两种以安全为中心的行为:坚持(即坚持正确的答案选择或拒绝错误的建议)和灵活性(即在引入正确建议时切换到该建议)。在三个临床基准测试中评估了17个LLM,我们观察到普遍存在的对话税,其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%,在某些模型中达到65%。我们还观察到盲目切换,即模型从初始拒绝转向错误和正确建议的比率几乎相同,达到50%。最后,我们表明,增加模型规模可以缓解其中一些对话效率低下的问题,但会加剧其他问题,例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明,静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

2604.07028 2026-05-27 cs.MA cs.AI cs.CL 版本更新

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

基于特质条件的多智能体系统在迭代法律论证中的战略说服

Philipp D. Siedler

发表机构 * Aleph Alpha Research(Aleph Alpha研究)

AI总结 提出战略法庭框架,通过特质条件化的大语言模型智能体模拟多轮法律辩论,发现异质团队表现更优,并引入强化学习特质编排器动态优化辩护策略。

详情
AI中文摘要

在诸如法律、外交和谈判等对抗性领域中的战略互动是通过语言中介的,然而大多数博弈论模型忽略了通过话语运作的说服机制。我们提出了战略法庭框架,这是一个多智能体模拟环境,其中由特质条件化的大语言模型(LLM)智能体组成的控方和辩方团队参与迭代的、基于轮次的法律论证。智能体使用九种可解释的特质进行实例化,这些特质被组织成四种原型,从而能够系统控制修辞风格和战略取向。我们在10个合成法律案例和84个三特质团队配置上评估该框架,使用DeepSeek-R1和Gemini 2.5 Pro进行了超过7,000次模拟试验。我们的结果表明,具有互补特质的异质团队始终优于同质配置,适度的交互深度产生更稳定的判决,并且某些特质(特别是量化和魅力型)对说服成功贡献不成比例。我们进一步引入了一个基于强化学习的特质编排器,该编排器根据案件和对手团队动态生成辩护特质,发现优于静态、人类设计的特质组合的策略。这些发现共同证明了语言可以被视为第一类战略行动空间,并为构建能够在多智能体环境中进行自适应说服的自主智能体提供了基础。

英文摘要

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

2603.27146 2026-05-27 cs.CL 版本更新

Learning to Predict Future-Aligned Research Proposals with Language Models

学习用语言模型预测未来对齐的研究提案

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出将研究提案生成重构为时间切片科学预测问题,通过未来对齐分数(FAS)评估模型能否预测截止时间后发表的论文方向,并构建时间一致数据集和推理轨迹进行训练,实验表明未来对齐微调显著提升提案质量。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于辅助研究中的构思,但评估LLM生成的研究提案的质量仍然困难:新颖性和合理性难以自动衡量,而大规模人工评估成本高昂。我们通过将提案生成重构为时间切片科学预测问题,提出了一种可验证的替代方案。给定一个研究问题和截止时间前可用的启发论文,模型生成一个结构化提案,并通过其是否预测到截止时间后发表的论文中出现的研究方向来评估。我们通过检索和基于LLM的语义评分,针对保留的未来语料库计算未来对齐分数(FAS)来操作化这一目标。为了训练模型,我们构建了一个时间一致的数据集,包含来自目标及其截止前引用的3,642个实例中的21,835篇论文出现次数,并合成推理轨迹,教授差距识别和灵感借鉴。在Llama-3.1和Qwen2.5模型上,未来对齐微调相比未对齐基线提高了未来对齐(总体FAS最高提升+10.6%),领域专家的人工评估证实了提案质量的改进。最后,我们通过使用代码代理实现两个模型生成的提案来展示实际影响,从新的提示策略中获得MATH 4.17%的准确率提升,并对一种新颖的模型合并方法实现了一致的改进。我们的代码和数据公开在https://github.com/Arthur-Heng/future-aligned-proposals。

英文摘要

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 21,835 paper occurrences across 3,642 instances from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method. Our code and data are publicly available at https://github.com/Arthur-Heng/future-aligned-proposals.

2603.28730 2026-05-27 cs.RO cs.CL cs.CV 版本更新

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1:视频语言推理作为机器人强化学习的唯一奖励

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

发表机构 * MIT(麻省理工学院) RAI Institute(机器人智能研究所)

AI总结 提出SOLE-R1模型,通过视频语言时空推理生成密集任务进度估计作为唯一奖励信号,实现在无真实奖励、演示或任务特定调优下的零样本在线强化学习。

详情
AI中文摘要

视觉语言模型(VLM)在各种任务中展现出令人印象深刻的能力,这促使人们努力利用这些模型来监督机器人学习。然而,当在强化学习(RL)中用作评估器时,当今最强的模型在部分可观测性和分布偏移下常常失败,使得策略能够利用感知错误而非解决任务。我们提出SOLE-R1(自观察学习器),一种专门设计用于为在线RL提供唯一奖励信号的视频语言推理模型。仅给定原始视频观测和自然语言目标,SOLE-R1执行每时间步的时空思维链(CoT)推理,并生成可直接用作奖励的密集任务进度估计。为了训练SOLE-R1,我们开发了一个大规模视频轨迹和推理合成流水线,生成与连续进度监督对齐的时间基础CoT轨迹。这些数据与基础的空间和多帧时间推理相结合,并使用混合框架训练模型,该框架将监督微调与可验证奖励的RL相结合。在四个不同的仿真环境和真实机器人设置中,SOLE-R1实现了从随机初始化的零样本在线RL:机器人学习之前未见过的操作任务,无需真实奖励、成功指标、演示或任务特定调优。SOLE-R1在24个未见过的任务上成功,并显著优于强视觉语言奖励器,包括Robometer、RoboReward、ReWiND、GPT-5和Gemini-3-Pro,同时对奖励破解表现出明显更强的鲁棒性。我们在匿名页面发布所有模型、数据、代码和演示:https://philip-mit.github.io/sole-r1/

英文摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including Robometer, RoboReward, ReWiND, GPT-5, and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. We release all models, data, code, and demos at the anonymous page: https://philip-mit.github.io/sole-r1/

2601.18987 2026-05-27 cs.CL cs.AI cs.PL 版本更新

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

LLMs 与停机问题:程序终止推理的特征化

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O'Hearn

发表机构 * FAIR Team, Meta AI(Meta AI FAIR 团队) The Hebrew University of Jerusalem, Israel(耶路撒冷希伯来大学) Bloomberg, New York, USA(彭博社,纽约,美国) Imperial College London, UK(伦敦帝国理工学院,英国) University College London, UK(伦敦大学学院,英国)

AI总结 本文评估了前沿LLMs在程序终止推理上的能力,发现GPT-5和Claude Sonnet 4.5在C程序终止判断上达到顶级验证工具水平,但无法生成形式化证明,并引入分歧前置条件形式化描述非终止条件。

详情
AI中文摘要

判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止的不可判定性,表明没有算法能普遍确定所有程序和输入的终止性。因此,验证工具近似地处理终止问题,有时无法证明或反驳;这些工具依赖于特定问题的架构,并且通常与特定的编程语言绑定。LLMs的最新进展提出了一个自然的问题:它们在多大程度上能够推理程序终止?我们在2025年国际软件验证竞赛(SV Comp)的一组多样化C程序上评估了前沿LLMs。我们的结果表明,GPT-5和Claude Sonnet 4.5(通过测试时缩放)达到了与顶级验证工具相当的分数。然而,尽管模型通常能正确推断程序是否终止,但它们经常无法构造一个见证作为形式化证明,揭示了语义识别与符号证明生成之间的差距。随着代码长度的增加,性能进一步下降。为了分析这一差距,我们引入了一个分歧前置条件形式化方法,将非终止条件描述为逻辑约束。我们希望这些发现能激励未来在现实世界终止基准测试、结合LLMs与符号验证方法的神经符号方法,以及更广泛地关于LLMs在其他不可判定问题上推理的研究。

英文摘要

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.

2603.17218 2026-05-27 cs.CL cs.AI cs.GT 版本更新

Alignment Makes Language Models Normative, Not Descriptive

对齐使语言模型变得规范,而非描述性

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

发表机构 * Technion – Israel Institute of Technology(技术离子-以色列理工学院)

AI总结 通过对比120个基础-对齐模型对在超过10,000个真实人类决策中的表现,发现对齐诱导了规范性偏差:在单轮教科书式博弈中提升预测,但在多轮战略博弈中因忽略互惠、报复等描述性动态而损害预测。

详情
AI中文摘要

后训练对齐优化语言模型以匹配人类偏好信号,但这一目标并不等同于对观察到的人类行为进行建模。我们在多轮战略博弈——讨价还价、说服、谈判和重复矩阵博弈中,比较了120个基础-对齐模型对在超过10,000个真实人类决策上的表现。在这些设置中,基础模型在预测人类选择方面以近10:1的比例优于其对齐版本,这一结果在模型家族、提示表述和博弈配置中均稳健成立。然而,在人类行为更可能遵循规范预测的设置中,这一模式发生了逆转:对齐模型在所有12种测试的单轮教科书式博弈以及非战略彩票选择中占据主导地位——甚至在多轮博弈本身中,在交互历史发展之前的第一轮也是如此。这种边界条件模式表明,对齐诱导了规范性偏差:当人类行为相对较好地由规范性解决方案捕捉时,它改善了预测;但在多轮战略设置中,当行为由互惠、报复和依赖于历史的适应等描述性动态塑造时,它损害了预测。这些结果揭示了在优化模型以供人类使用和将其用作人类行为代理之间的根本权衡。

英文摘要

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

2601.10566 2026-05-27 cs.CL cs.LG 版本更新

Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

基于激活签名的表示感知遗忘:从抑制到实体签名擦除

Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, K. M. Shadman Wadith, Nazia Tasnim, Farig Sadeque

发表机构 * Computer Science and Engineering, BRAC University, Dhaka, Bangladesh(布拉格大学计算机科学与工程系,达卡,孟加拉国) Boston University, Boston, MA, USA(波士顿大学,波士顿,马萨诸塞州,美国)

AI总结 提出ERUF框架,通过挖掘实体特异性激活签名并抑制对应方向,实现表示层面的遗忘,同时保持表面抑制、内部衰减和效用保留。

Comments 16 pages, 4 figures

详情
AI中文摘要

实体级遗忘通常通过模型输出评估:是否停止命名目标、拒绝查询或改变真值比分布。然而,这些输出级测试无法显示主体的内部表示是否被衰减。我们引入实体表示遗忘框架(ERUF),这是一个表示感知框架,挖掘主体特定的激活签名,抑制相应的激活方向,并将行为蒸馏到LoRA参数中。在评估的基线中,ERUF是唯一同时实现表面级抑制、内部衰减和效用保留的方法。在TOFU forget10上,ERUF达到FQ=0.99和MU=0.62,匹配报告的神谕效用,同时接近神谕遗忘质量。在大多数标准基础模型设置中,ERUF保持低泄漏和低内部目标激活,SMR在0.00%至1.10%之间,EL10低于0.06,效用漂移低于3%。在Llama-3.1-8B上,对抗性实体恢复从63.89%降至20.15%,而名称无关恢复减少72.7%至77.4%。联合表面/内部诊断进一步揭示了推理优先模型中仅靠表面指标无法发现的尺度依赖行为。我们将这些结果解释为表示层面衰减的操作性证据,而非不可逆删除的正式保证。

英文摘要

Entity-level unlearning is usually evaluated by what a model says: whether it stops naming the target, refuses a query, or shifts a Truth Ratio distribution. These output-level tests, however, do not show whether a subject's internal representation has been attenuated. We introduce the Entity Representation Unlearning Framework (ERUF), a representation-aware framework that mines subject-specific activation signatures, suppresses the corresponding activation direction, and distills the behavior into LoRA parameters. Among evaluated baselines, ERUF is the only method that jointly achieves surface-level suppression, internal attenuation, and utility preservation. On TOFU forget10, ERUF achieves FQ = 0.99 and MU = 0.62, matching reported oracle utility while approaching oracle forget quality. Across most standard foundation-model settings, ERUF maintains low leakage and low internal target activation, with SMR between 0.00% and 1.10%, EL10 below 0.06, and utility drift below 3%. On Llama-3.1-8B, adversarial entity recovery falls from 63.89% to 20.15%, while name-agnostic recovery decreases by 72.7% to 77.4%. Joint surface/internal diagnostics further reveal scale-dependent behavior in reasoning-prior models that surface metrics alone would miss. We interpret these results as operational evidence of representation-level attenuation, not as a formal guarantee of irreversible deletion.

2601.03471 2026-05-27 cs.CL cs.AI 版本更新

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

EpiQAL:大型语言模型在流行病学问答与推理中的基准测试

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

发表机构 * Emory University(埃默里大学) University of Illinois Chicago(伊利诺伊大学香槟分校) Microsoft(微软公司)

AI总结 提出EpiQAL基准,通过三个子集(事实回忆、多步推理、不完整信息下结论重建)评估LLM在流行病学推理中的表现,发现当前模型在多步推理上表现有限。

Comments 31 pages, 7 figures, 25 tables

详情
AI中文摘要

可靠的流行病学推理需要综合研究证据来推断疾病负担、传播动态和人群层面的干预效果。现有的医学问答基准主要强调临床知识或患者层面的推理,但很少有系统评估基于证据的流行病学推理。我们提出了EpiQAL,这是首个针对多种疾病的流行病学问答诊断基准,包含三个从开放获取文献构建的子集。这三个子集逐步测试事实回忆、多步推理以及在不完整信息下的结论重建,并通过结合分类学指导、多模型验证和难度筛选的质量控制流程构建。对涵盖开源和专有系统的15个模型的实验表明,当前LLM在流行病学推理上表现有限,其中多步推理构成最大挑战。模型排名在不同子集间发生变化,仅靠规模并不能预测成功。思维链提示有利于多步推理,但在其他情况下效果不一。EpiQAL为证据基础、推理推理和结论重建提供了细粒度的诊断信号。

英文摘要

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fifteen models spanning open-source and proprietary systems reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction.

2601.03079 2026-05-27 cs.CL 版本更新

Learning to Diagnose and Correct Errors: Towards Moral Sensitivity Acquisition in Large Language Models

学习诊断和纠正错误:大型语言模型中的道德敏感性获取

Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi, Xitong Zhang, Kristen Johnson, Guangliang Liu

发表机构 * University of Mississippi(密西根州立大学) Nanyang Technological University(南洋理工大学) Northeastern University(东北大学) AWS AI Labs(AWS AI实验室) Johns Hopkins University(约翰霍普金斯大学) Qualcomm(高通) Michigan State University(密西根州立大学) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 提出一种实用推理方法,通过让大型语言模型诊断和纠正道德错误来获取道德敏感性,并在多个任务上验证了其有效性。

详情
AI中文摘要

道德敏感性是人类道德能力最基础的能力。尽管许多方法旨在使大型语言模型(LLMs)与人类道德价值观对齐,但它们主要关注拟合道德适当文本的分布,而忽视了如何使LLMs获得道德敏感性。在本文中,我们朝着解决以下问题迈出了一步:LLMs如何获得道德敏感性?具体来说,我们提出了一种实用推理方法,通过使LLMs能够诊断和纠正道德错误来促进其道德敏感性的获取。我们的实用推理方法的一个核心优势在于其统一的视角:它不是对语义多样且复杂的表面形式进行道德话语建模,而是提供了一个基于推理负荷设计实用推理过程的原则性框架。实验证据表明,我们的实用方法能够使LLMs获得道德敏感性,并在多个任务中有效泛化。

英文摘要

Moral sensitivity is the most fundamental capability underlying human moral competence. Although many approaches aim to align large language models (LLMs) with human moral values, they primarily focus on fitting the distributions of morally appropriate texts while overlooking how to enable moral sensitivity acquisition in LLMs. In this paper, we take a step toward addressing the question: How can moral sensitivity be acquired in LLMs? Specifically, we propose a pragmatic inference approach that facilitates moral sensitivity acquisition in LLMs by enabling them to diagnose and correct moral errors. A central strength of our pragmatic inference approach lies in its unified perspective: rather than modeling moral discourses across semantically diverse and complex surface forms, it provides a principled framework for designing pragmatic inference procedures grounded in their inferential load. Empirical evidence demonstrates that our pragmatic approach can enable moral sensitivity acquisition in LLMs and generalizes effectively across tasks.

2603.16654 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic:迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

发表机构 * The University of Tokyo(东京大学) Yale University(耶鲁大学) Stanford University(斯坦福大学) Xiaomi EV(小米EV) Soongsil University(顺天大学)

AI总结 针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题,提出Omanic基准,通过分解为单跳子问题并分析步骤级错误,揭示后期跳数瓶颈、事实知识下限和错误传播,微调后提升多个推理基准性能。

详情
AI中文摘要

仅从最终答案评估大语言模型(LLM)的推理能力可能会掩盖中间步骤的失败,尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题,我们引入了Omanic,一个开放域4跳问答基准,它不仅用于衡量最终答案的准确性,还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例(OmanicSynth)和967个经专家审核的人工标注评估示例(OmanicBench),每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明,Omanic具有挑战性,而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准,平均提升7.41分,验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据,在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

2603.13853 2026-05-27 cs.CL cs.AI 版本更新

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

APEX-Searcher: 通过子目标细化信用分配以增强智能体检索增强生成

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) Wenge Technology Co., Ltd(Wenger科技有限公司)

AI总结 针对复杂多跳问答中检索路径模糊和端到端强化学习奖励稀疏的问题,提出APEX-Searcher,通过分离规划与执行的信用分配(规划用RL优化、执行用SFT学习),在多个基准上取得一致提升。

详情
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)与外部知识连接起来,但单轮检索通常不足以应对复杂的多跳问题。为了增强复杂任务的搜索能力,大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法提高了问题解决性能,但它们仍然面临任务推理和模型训练方面的挑战,尤其是模糊的检索执行路径和端到端强化学习(RL)中的稀疏奖励,这可能导致不准确的检索结果和较低的性能。我们将这些失败归因于层次化的信用纠缠:单一的最终奖励同时更新规划和执行,因此模型无法清晰地区分规划错误和检索错误。我们提出APEX-Searcher,它采用了一种细化信用分配的范式:规划通过带有规划级奖励的RL进行优化,而执行则通过SFT学习。大量实验表明,在多跳RAG和任务规划基准上均取得了一致的提升。

英文摘要

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insufficient for complex multi-hop questions. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches improve problem-solving performance, they still face challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL), which can lead to inaccurate retrieval results and lower performance. We attribute these failures to hierarchical credit entanglement: a single final reward updates planning and execution together, so the model cannot clearly separate plan errors from retrieval errors. We propose APEX-Searcher, which uses a Refining Credit Assignment paradigm: planning is optimized by RL with a plan-level reward, while execution is learned by SFT. Extensive experiments show consistent gains in both multi-hop RAG and task planning across benchmarks.

2603.12754 2026-05-27 cs.CL 版本更新

A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

一种从语义标注语料库学习大规模计算构式语法的方法

Paul Van Eecke, Katrien Beuls

发表机构 * Artificial Intelligence Laboratory(人工智能实验室) Vrije Universiteit Brussel(布鲁塞尔自由大学) Faculté d’informatique(计算机科学系) Université de Namur(纳慕尔大学)

AI总结 提出一种从语义标注语料库自动学习大规模、广覆盖计算构式语法的方法,生成包含数万构式的网络,支持开放域文本的框架语义分析并揭示句法-语义使用模式。

Comments Accepted for oral presentation at CoNLL 2026

详情
AI中文摘要

我们提出了一种从语言使用语料库中学习大规模、广覆盖构式语法的方法。该方法从带有成分结构和语义框架标注的话语出发,促进学习可解释的计算构式语法,捕捉句法结构与所表达语义关系之间的复杂关联。生成的语法由在流体构式语法框架内形式化的数万个构式网络组成。这些语法不仅支持开放域文本的框架语义分析,还蕴含了关于学习数据中句法-语义使用模式的大量信息。该方法及学习到的语法有助于基于使用的构式主义语言方法的规模化,因为它们证实了若干基本构式语法猜想的可扩展性,同时为广覆盖语料库中英语论元结构的构式主义研究提供了实用工具。

英文摘要

We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

2603.03585 2026-05-27 cs.CL cs.AI 版本更新

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Belief-Sim:迈向信念驱动的人口统计错误信息易感性模拟

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) Texas State University(德克萨斯州立大学)

AI总结 提出BeliefSim框架,利用心理学分类和调查先验构建人口信念档案,通过提示条件化和后训练适应,实现基于信念模拟人口统计错误信息易感性,对齐度达92%。

Comments Paper Under Review

详情
AI中文摘要

错误信息是一种日益严重的社会威胁,由于潜在信念的差异,不同人口群体对错误信息的易感性各不相同。随着大型语言模型(LLM)越来越多地被用于模拟人类行为,我们研究它们是否能够模拟人口统计错误信息易感性,将信念视为主要驱动因素。我们引入BeliefSim,一个模拟框架,利用心理学信息错误信息分类法和调查先验构建人口信念档案。我们研究了基于提示的条件化和后训练适应,并使用以下方法进行了多方面的评估:(i)易感性对齐和(ii)反事实人口敏感性。在两个数据集和建模策略中,我们表明信念为模拟错误信息易感性提供了强大的先验,对齐度高达92%。

英文摘要

Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed misinformation taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility alignment and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with alignment up to 92%.

2603.03194 2026-05-27 cs.CL cs.SE 版本更新

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

BeyondSWE:当前代码代理能否超越单仓库错误修复?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

发表机构 * GitHub

AI总结 提出BeyondSWE基准测试,评估代码代理在跨仓库、领域特定、依赖迁移和文档生成等复杂软件工程任务上的表现,发现现有代理在利用外部信息进行精确代码修改方面仍存在显著不足。

Comments Benchmark: https://huggingface.co/datasets/AweAI-Team/BeyondSWE. Repo: https://github.com/AweAI-Team/BeyondSWE. Scaffold: https://github.com/AweAI-Team/AweAgent

详情
AI中文摘要

当前的代码代理基准主要评估单个目标仓库内的局部问题解决能力,而许多需要外部知识或更广泛仓库级变更的软件工程任务仍未得到充分测试。我们引入了BeyondSWE,这是一个包含500个实例的基准测试,来自246个真实世界的GitHub仓库,用于评估超越单仓库错误修复的代码代理。BeyondSWE涵盖了四种代表性场景:跨仓库问题解决、领域特定问题解决、依赖驱动的迁移以及文档到仓库的生成,涵盖了更广泛的知识范围和解决范围。我们的评估显示,BeyondSWE远未饱和:基于OpenHands的最佳代理达到了46.12的平均分数,而使用GPT-5.4(xhigh)的最强Codex harness在搜索感知提示下达到了56.65。为了研究外部信息访问是否能缩小这一差距,我们使用SearchSWE作为搜索增强编码的受控诊断基线。搜索访问改善了大多数模型,并对某些任务有显著帮助,但收益仍然有限且不均衡,表明当前代理仍然难以将检索到的信息转化为精确、版本兼容且局部可操作的代码更改。这些结果表明,深度编码搜索仍然是一个开放问题:进展需要代理能够可靠地将外部证据与仓库局部推理和基于执行的验证结合起来。

英文摘要

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

2601.09001 2026-05-27 cs.CL 版本更新

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

熵哨兵:基于STEM解码熵迹的连续LLM准确性监控

Pedro Memoli Buffa, Luciano Del Corro

发表机构 * Departamento de Matematica, FCEyN Universidad de Buenos Aires(数学系,布宜诺斯艾利斯大学) ELIAS Lab, Departamento de Ingeniería Universidad de San Andres(ELIAS实验室,圣安德烈斯大学)

AI总结 提出利用输出熵迹作为推理时信号,通过轻量分类器预测实例正确性并聚合为领域级准确性估计,在STEM推理基准上验证了其用于监控和数据采集的有效性。

详情
AI中文摘要

部署LLM引发两个耦合挑战:(1)监控——在流量和领域漂移时估计模型表现不佳的位置;(2)改进——优先获取数据以缩小最大的性能差距。我们测试推理时信号能否在领域偏移下估计切片级准确性。对于每个响应,我们从最终层下一个词元概率(来自top-$k$ logprobs)计算输出熵迹,并用不同统计量汇总。一个轻量分类器预测实例正确性,平均预测概率得到领域级准确性估计。我们在十个STEM推理基准上进行了详尽的训练/测试组合($k\in\{1,2,3,4\}$;所有$inom{10}{k}$组合),在来自六个系列(3B--20B)的九个LLM上评估不同分类器模型和特征。估计值通常跟踪保留的基准准确性,并且多个模型显示领域近乎单调的排序,为输出熵迹作为可扩展监控和针对性数据采集的可访问信号提供了证据。

英文摘要

Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in\{1,2,3,4\}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B--20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.

2603.01327 2026-05-27 cs.SE cs.CL cs.LG 版本更新

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

SWE-Adept:基于LLM的深度代码库分析与结构化问题解决代理框架

Kang He, Kaushik Roy

发表机构 * Electrical and Computer Engineering, Purdue University(电子工程与计算机工程系,普渡大学)

AI总结 提出SWE-Adept双代理框架,通过代理引导的深度优先搜索进行代码定位,结合自适应规划和结构化问题解决,在SWE-Bench上提升端到端解决率最多4.3%。

详情
AI中文摘要

大型语言模型(LLM)在独立编程任务上表现出色,但在仓库级软件工程(SWE)中仍面临挑战,这需要(1)深度代码库导航与有效上下文管理以实现准确定位,以及(2)系统化的迭代、测试驱动代码修改以解决问题。为应对这些挑战,我们提出SWE-Adept,一个基于LLM的双代理框架,其中定位代理识别与问题相关的代码位置,解决代理实施相应的修复。对于问题定位,我们引入代理引导的深度优先搜索,选择性遍历代码依赖关系。这最小化了代理上下文窗口中的问题无关内容,提高了定位准确性。对于问题解决,我们采用自适应规划和结构化问题求解。我们为代理配备了用于进度跟踪和基于Git的版本控制的专用工具。这些工具与共享工作记忆交互,该记忆存储按执行步骤索引的代码状态检查点,便于精确的检查点检索。这种设计实现了可靠的代理驱动版本控制操作,用于系统化问题解决,包括分支以探索替代方案和回滚失败的编辑。在SWE-Bench Lite和SWE-Bench Pro上的实验表明,SWE-Adept在问题定位和解决方面均持续优于先前方法,端到端解决率提升最多4.3%。

英文摘要

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context management for accurate localization, and (2) systematic approaches for iterative, test-driven code modification to resolve issues. To address these challenges, we propose SWE-Adept, an LLM-based two-agent framework where a localization agent identifies issue-relevant code locations and a resolution agent implements the corresponding fixes. For issue localization, we introduce agent-directed depth-first search that selectively traverses code dependencies. This minimizes issue-irrelevant content in the agent's context window and improves localization accuracy. For issue resolution, we employ adaptive planning and structured problem solving. We equip the agent with specialized tools for progress tracking and Git-based version control. These tools interface with a shared working memory that stores code-state checkpoints indexed by execution steps, facilitating precise checkpoint retrieval. This design enables reliable agent-driven version-control operations for systematic issue resolution, including branching to explore alternative solutions and reverting failed edits. Experiments on SWE-Bench Lite and SWE-Bench Pro demonstrate that SWE-Adept consistently outperforms prior approaches in both issue localization and resolution, improving the end-to-end resolve rate by up to 4.3%.

2602.22190 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra:训练原生GUI代理进行推理与行动——基于动作感知监督和部分可验证强化学习

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出GUI-Libra训练方案,通过动作感知SFT和部分可验证RL中的KL正则化,解决GUI代理在长程导航任务中推理与定位冲突及部分可验证性问题,显著提升步骤准确率和任务完成率。

Comments 57 pages, 17 figures

详情
AI中文摘要

开源原生GUI代理在长程导航任务上仍落后于闭源系统。这一差距源于两个限制:缺乏高质量、动作对齐的推理数据,以及直接采用忽视GUI代理独特挑战的通用后训练流程。我们识别出这些流程中的两个基本问题:(i) 带有CoT推理的标准SFT常损害定位能力,(ii) 逐步RLVR式训练面临部分可验证性,即多个动作可能正确但仅有一个示范动作用于验证。这使得离线逐步指标成为在线任务成功的弱预测器。在本工作中,我们提出GUI-Libra,一种定制化训练方案以应对这些挑战。首先,为缓解动作对齐推理数据的稀缺性,我们引入数据构建和过滤流程,并发布精心整理的81K GUI推理数据集。其次,为调和推理与定位,我们提出动作感知SFT,混合推理后动作和直接动作数据,并重新加权token以强调动作和定位。第三,为在部分可验证性下稳定RL,我们识别出RLVR中KL正则化被忽视的重要性,并证明KL信任域对改善离线到在线可预测性至关重要;我们进一步引入成功自适应缩放以降低不可靠负梯度的权重。在多种Web和移动基准测试中,GUI-Libra一致地提升了步骤准确率和端到端任务完成率。我们的结果表明,精心设计的后训练和数据整理可以在无需昂贵在线数据收集的情况下,释放显著更强的任务解决能力。我们发布数据集、代码和模型,以促进对具备推理能力的GUI代理的数据高效后训练的进一步研究。

英文摘要

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

2510.07231 2026-05-27 cs.CL cs.AI 版本更新

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

EconCausal: 面向大语言模型的上下文感知经济推理基准

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim

发表机构 * Graduate School of Data Science, KAIST(韩国科学技术院数据科学研究生院) College of Business, KAIST(韩国科学技术院商学院) Data Science for Humanity Group, MPI-SP(马克斯·普朗克所际数据科学为人类集团) School of Computing, KAIST(韩国科学技术院计算学院) Division of Social Science, HKUST(香港科技大学社会科学系)

AI总结 提出EconCausal基准,包含从顶级经济金融期刊提取的10,490个上下文标注因果三元组,评估大语言模型在指定上下文中推断因果方向及随上下文变化调整判断的能力。

详情
AI中文摘要

社会经济因果效应高度依赖于制度和环境背景。相同的干预措施在不同监管制度、市场条件、时间段或人群中可能产生不同甚至相反的效果。这对大语言模型(LLM)在决策支持角色中提出了挑战:它们能否在指定上下文中推断因果效应的方向,并在上下文变化时修正该判断?为此,我们引入了EconCausal,这是一个大规模基准,包含从顶级经济和金融期刊的2,595项高质量实证研究中提取的10,490个上下文标注因果三元组,通过严格的四阶段流程构建,包括多轮共识、上下文细化和多批评者过滤。跨模型实验表明,LLM往往无法根据上下文调整其预测。虽然顶级模型在固定、显式上下文中达到88%的准确率,但在需要跨上下文修正符号的情况下,准确率下降32.6个百分点(从73.9%降至41.3%),一旦引入误导性的符号证据,准确率降至50%以下。模型还过度承诺于方向性(+/-)符号,仅在13.8%的情况下识别出零效应,且在这些类别上校准不良。数据集和基准公开于 https://anonymous.4open.science/r/econcausal-benchmark-6F12。

英文摘要

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different, even opposite, effects across regulatory regimes, market conditions, time periods, or populations. This poses a challenge for large language models (LLMs) in decision-support roles: can they infer the direction of a causal effect under a specified context, and revise that judgment when the context changes? To address this, we introduce EconCausal, a large-scale benchmark of 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies in top-tier economics and finance journals, constructed through a rigorous four-stage pipeline with multi-run consensus, context refinement, and multi-critic filtering. Across models, LLMs often fail to condition their predictions on context. While top models reach 88% accuracy in fixed, explicit contexts, accuracy falls by 32.6~pp on cases that require revising the sign across contexts (73.9% to 41.3%), and drops below 50% once misleading signed evidence is introduced. Models also over-commit to directional (+/-) signs, recognizing null effects only 13.8% of the time while remaining poorly calibrated on these categories. The dataset and benchmark are publicly available at https://anonymous.4open.science/r/econcausal-benchmark-6F12.

2602.17443 2026-05-27 cs.CL 版本更新

AIDG: A Formal Decomposition of Information Extraction and Containment Asymmetries in Multi-Turn LLM Dialogue

AIDG:多轮LLM对话中信息提取与包含不对称性的形式化分解

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 提出AIDG框架,将多轮对抗对话形式化为部分可观察随机博弈,分解提取与包含角色,揭示防御性能聚类而攻击性能分散等不对称性。

Comments 20 pages, 5 figures, 13 tables. Includes appendix and supplementary materials

详情
AI中文摘要

多轮LLM评估通常报告为单一胜率标量,混淆了不同能力。我们引入AIDG(对抗信息推断游戏),将多轮对抗对话形式化为双人部分可观察随机博弈(POSG),并沿着搜索者(提取)和持有者(包含)角色分解性能。该分解隔离了三种失败模式:合作先验泄漏、约束推理干扰和低效假设空间遍历。在六个前沿LLM的439场游戏中,防御性能紧密聚类(sigma = 1.9 ELO),而攻击性能差异显著(sigma = 53.3 ELO);确认框架使提取几率比无信息推断高7.75倍(p < 0.00001);约束违规占推断失败的41.3%,与规模无关(rho = 0.0)。我们将包含优于提取的差距定位为局部可解的防御决策与全局耦合的攻击规划的可测量结果,而非令人惊讶的发现,并使用该分解将差距归因于每个模型。所有设计选择,包括轮次衰减加权和Bradley-Terry评级模型,均源自明确假设。

英文摘要

Multi-turn LLM evaluation is typically reported as a single win-rate scalar, conflating distinct capabilities. We introduce AIDG (Adversarial Information Deduction Game), formalizing multi-turn adversarial dialogue as a two-player partially observable stochastic game (POSG) and decomposing performance along Seeker (extraction) and Holder (containment) roles. The decomposition isolates three failure modes: cooperative-prior leakage, constraint-reasoning interference, and inefficient hypothesis-space traversal. Across 439 games over six frontier LLMs, defensive performance is tightly clustered (sigma = 1.9 ELO) while offensive performance varies substantially (sigma = 53.3 ELO); confirmation framing increases extraction odds 7.75x over uninformed deduction (p < 0.00001); and constraint violations account for 41.3% of deductive failures, uncorrelated with scale (rho = 0.0). We position the containment-over-extraction gap not as a surprising finding but as a measurable consequence of locally resolvable defensive decisions versus globally coupled offensive planning, and use the decomposition to attribute the gap per model. All design choices, including turn-decay weighting and the Bradley-Terry rating model, are derived from explicit assumptions.

2503.18288 2026-05-27 cs.CL 版本更新

TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling

TFD:面向低资源语言处理和大规模建模的综合结构化藏语基础数据集

Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Yadi Liu, Wenbin Wei, Xiangxiang Wang, Yongbin Yu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Xizang University(西藏大学) ZenWeave AI The State Key Laboratory of Tibetan Intelligence(藏语智能国家重点实验室) Nanyang Technological University(南洋理工大学)

AI总结 为解决藏语大语言模型开发中缺乏覆盖预训练、指令微调、安全对齐、偏好优化和推理监督等完整流程的数据集问题,提出首个结构化、大规模、专家精选的藏语基础数据集TFD,包含超过110亿词元的统一语料库及链式推理数据集,通过训练Sun-Shine系列藏语模型在理解、安全、推理和生成基准上取得显著提升。

详情
AI中文摘要

大型语言模型(LLMs)在高资源语言中取得了显著成功,但藏语的进展仍然严重受限。尽管最近的工作开始解决藏语的预训练数据稀缺问题,但一个更根本的差距仍然存在:现有资源不支持完整的LLM开发流程,涵盖预训练、指令微调、安全对齐、偏好优化和推理监督。我们引入了藏语基础数据集(TFD),这是第一个结构化、大规模、专家精选的数据集,覆盖藏语大语言建模的所有关键阶段。TFD包含TIBSTC,一个超过110亿词元的统一语料库,带有用于指令微调、安全对齐和偏好优化的精选子数据集,以及TIBSTC-CoT,第一个大规模藏语链式推理数据集。我们通过训练Sun-Shine系列藏语LLM来展示其效用,在理解、安全、推理和生成基准上相比强基线取得了显著改进。这些结果强调,推进低资源语言建模不仅需要规模,还需要结构完整的数据生态系统。我们发布TFD以促进可重复研究和开发稳健、文化对齐的藏语LLM。代码和数据可在https://github.com/Vicentvankor/sun-shine获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable success in high-resource languages, yet progress in Tibetan remains severely constrained. While recent efforts have begun to address pre-training data scarcity for Tibetan, a more fundamental gap persists: no existing resource supports the complete LLM development pipeline, spanning pre-training, instruction tuning, safety alignment, preference optimization, and reasoning supervision. We introduce the Tibetan Foundation Dataset (TFD), the first structured, large-scale, and expert-curated dataset covering all key stages of Tibetan large language modeling. TFD comprises TIBSTC, a unified corpus of over 11 billion tokens with curated sub-datasets for instruction tuning, safety alignment, and preference optimization, and TIBSTC-CoT, the first large-scale Tibetan chain-of-thought dataset. We demonstrate its utility by training the Sun-Shine family of Tibetan LLMs, achieving substantial improvements over strong baselines on understanding, safety, reasoning, and generation benchmarks. These results underscore that advancing low-resource language modeling requires not only scale, but a structurally complete data ecosystem. We release TFD to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.

2602.11460 2026-05-27 cs.CL 版本更新

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

ADRD-Bench:阿尔茨海默病及相关痴呆症的初步大语言模型基准

Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Yiyu Shi, Meng Jiang, Zhi Zheng

发表机构 * Electrical Engineering, University of Notre Dame(诺丁汉大学电气工程系) Computer Science and Engineering, University of Notre Dame(诺丁汉大学计算机科学与工程系) School of Medicine, Indiana University(印第安纳大学医学院)

AI总结 针对现有基准对阿尔茨海默病及相关痴呆症覆盖不足的问题,提出ADRD-Bench,包含统一问答和照护问答两部分,评估了36个LLM,发现顶级模型准确率高但推理质量不稳定。

Comments Update article

详情
AI中文摘要

大语言模型(LLM)在医疗应用中显示出巨大潜力。然而,现有评估基准对阿尔茨海默病及相关痴呆症(ADRD)的覆盖极少。为解决这一差距,我们引入了ADRD-Bench,一个初步的ADRD专用LLM基准。ADRD-Bench包含两个部分:1) ADRD统一问答,整合了七个已有医学基准的1,438个问题,提供临床知识的统一评估;2) ADRD照护问答,一组新颖的149个问题,源自一个全国采用、大型临床试验支持的脑健康管理项目,弥补了现有基准缺乏实际照护背景的不足。我们在提出的ADRD-Bench上评估了36个最先进的LLM。结果显示,开源通用模型、开源医学模型和前沿闭源通用模型的准确率范围分别为0.63至0.93(均值:0.77;标准差:0.09)、0.47至0.93(均值:0.81;标准差:0.14)和0.83至0.93(均值:0.90;标准差:0.03)。虽然顶级模型达到了高准确率(>0.9),但案例研究揭示了不一致的推理质量和稳定性,凸显了需要领域特定改进以增强LLM基于日常照护数据的知识和推理。整个数据集可在https://github.com/IIRL-ND/ADRD-Bench获取。

英文摘要

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, a preliminary ADRD-specific LLM benchmark. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,438 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from a nationally adopted, large clinical trials supported brain health management program, mitigating the lack of practical caregiving context in existing benchmarks. We evaluated 36 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models, open-weight medical models, and frontier closed-source general models ranged from 0.63 to 0.93 (mean: 0.77; std: 0.09), 0.47 to 0.93 (mean: 0.81; std: 0.14), and 0.83 to 0.93 (mean: 0.90; std: 0.03), respectively. While top-tier models achieved high accuracies (>0.9), case studies revealed inconsistent reasoning quality and stability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

2602.07120 2026-05-27 cs.CL 版本更新

Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model

锚定解码:可证明降低任何语言模型的版权风险

Jacqueline He, Jonathan Hayase, Wen-tau Yih, Sewoong Oh, Luke Zettlemoyer, Pang Wei Koh

发表机构 * University of Washington(华盛顿大学) Allen Institute for Artificial Intelligence(人工智能研究院)

AI总结 提出锚定解码,一种即插即用的推理时方法,通过将生成内容约束在许可训练的安全模型附近,可证明地抑制语言模型逐字复制受版权保护的内容,实现可调的风险-效用权衡。

Comments Accepted to ICML 2026. 53 pages, 14 figures, 22 tables. Code is publicly available at https://github.com/jacqueline-he/anchored-decoding

详情
AI中文摘要

语言模型倾向于记忆其训练数据的部分内容并逐字生成。当底层来源敏感或受版权保护时,这种复现会引发创作者同意和补偿问题以及开发者合规风险。我们提出锚定解码,一种即插即用的推理时方法,用于抑制逐字复制:它通过将生成内容保持在许可训练的安全模型的有界邻近范围内,使得任何在混合许可数据上训练的有风险语言模型都能进行解码。锚定解码在生成轨迹上自适应地分配用户选择的信息预算,并强制执行每步约束,从而提供序列级别的保证,实现可调的风险-效用权衡。为使锚定解码实用化,我们引入了一个新的许可训练的安全模型(TinyComma 1.8B),以及锚定字节解码,这是我们方法的字节级变体,通过ByteSampler框架(Hayase等人,2025)实现跨词汇融合。在六个模型对上,针对复制风险和效用的长文本指标,锚定解码和锚定字节解码定义了新的帕累托前沿,在保持接近原始流畅性和事实性的同时,将有风险基线与安全参考之间的可测量复制差距缩小了高达75%,且推理开销适中。

英文摘要

Language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). Across six model pairs on long-form metrics for copying risk and utility, Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while closing up to 75% of the measurable copying gap between the risky baseline and a safe reference, at a modest inference overhead.

2602.02518 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer: 通过两阶段课程后训练训练LLMs在图上的探索与推理

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

发表机构 * Texas A&M University(德克萨斯大学A&M分校) University of Waterloo(滑铁卢大学) Lambda(Lambda公司) University of Oregon(俄勒冈大学)

AI总结 提出GraphDancer两阶段后训练框架,通过图感知课程逐步增加任务难度,使LLMs学会在异构图上进行自然语言推理与函数调用交织的探索与推理,仅用3B骨干模型即在跨域基准上超越更强基线。

Comments 15 pages, Project website: https://yuyangbai.com/graphdancer/

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖外部知识来提高事实性,然而许多真实世界的知识源被组织为异构图而非纯文本。在此类图上进行推理要求模型通过精确的函数调用遵循模式定义的关系,并在多轮交互中聚合证据。我们提出GraphDancer,一个两阶段后训练框架,通过将自然语言推理与图函数执行交织来教导LLMs在图上的推理。第一阶段教导模型在基于规则的奖励下如何与图交互,而第二阶段进一步教导其偏好更基于事实且高效的交互轨迹。GraphDancer的关键创新在于一个图感知课程,该课程根据信息寻求轨迹的结构复杂性组织两个阶段,在训练期间逐步增加任务难度。我们在一个多领域基准上评估GraphDancer,仅在一个领域上训练,并在未见过的领域和分布外问题类型上进行测试。尽管仅使用3B骨干模型,GraphDancer仍优于配备更大/更强骨干的基线,展示了图探索和推理技能的强大跨域泛化能力。我们的代码可在https://github.com/leopoldwhite/GraphDancer找到。

英文摘要

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We propose GraphDancer, a two-stage post-training framework that teaches LLMs to reason over graphs by interleaving natural-language reasoning with graph function execution. The first stage teaches the model how to interact with the graph under rule-based rewards, while the second stage further teaches it to prefer more grounded and efficient interaction trajectories. The key novelty of GraphDancer is a graph-aware curriculum that organizes both stages by the structural complexity of information-seeking trajectories, progressively increasing task difficulty during training. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with larger/stronger backbones, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code can be found at https://github.com/leopoldwhite/GraphDancer.

2602.00959 2026-05-27 cs.LG cs.CL 版本更新

Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

探测知识边界:一种用于深度知识提取的交互式智能体框架

Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Westlake University(西交利物浦大学)

AI总结 提出一种交互式智能体框架,通过四种自适应探索策略和三级知识处理流水线,系统性地提取和量化大语言模型的知识,发现递归分类法最有效,并揭示了知识缩放定律、Pass@1与Pass@k的权衡以及训练数据对知识轮廓的影响。

Comments Homepage: https://ulab-uiuc.github.io/KnowledgeExtraction/

详情
AI中文摘要

大型语言模型(LLMs)可被视为压缩的知识库,但尚不清楚它们真正包含哪些知识以及其知识边界延伸多远。现有基准大多是静态的,对系统性知识探测的支持有限。本文提出一种交互式智能体框架,用于系统性地提取和量化LLMs的知识。我们的方法包括四种自适应探索策略,以不同粒度探测知识。为确保提取知识的质量,我们引入了一个三级知识处理流水线,结合基于向量的过滤以去除严格重复、基于LLM的裁决以解决模糊语义重叠,以及领域相关性审计以保留有效的知识单元。通过大量实验,我们发现递归分类法是最有效的探索策略。我们还观察到清晰的知识缩放定律,即更大的模型始终能恢复更多知识。此外,我们识别出Pass@1与Pass@k之间的权衡:领域专用模型初始准确率更高但退化迅速,而通用模型在长时间提取中保持稳定性能。最后,我们的结果表明,训练数据组成的差异导致不同模型家族具有独特且可测量的知识轮廓,反映了预训练如何塑造每个模型的参数化知识。

英文摘要

Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundary extends. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularity. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove strict duplicates, LLM-based adjudication to resolve ambiguous semantic overlap, and domain relevance auditing to retain valid knowledge units. Through extensive experiments, we find that Recursive Taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently recover more knowledge. In addition, we identify a Pass@1 versus Pass@k trade-off: domain-specialized models achieve higher initial accuracy but experience rapid degradation, while general-purpose models maintain stable performance over extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families, reflecting how pretraining shapes each model's parametric knowledge.

2602.00491 2026-05-27 cs.CL 版本更新

From Knowledge to Inference: Formalizing Specialized Public Health Reasoning on GlobalHealthAtlas

从知识到推理:形式化GlobalHealthAtlas上的专业公共卫生推理

Zhaokun Yan, Shan Xu, Wuzheng Dong, Zhaohan Liu, Lijie Feng, Chengxiao Dai, Chen Tianqi, Binfan Liu, Yunpu Ma, Wenting Wei, Yingting Li, Yi Zhang, Tongning Wu

发表机构 * China Academy of Information and Communications Technology(中国信息通信技术研究院) CRRC Industrial Academy Co., Ltd.(CRRC工业学院有限公司) The University of Sydney(悉尼大学) Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Institute of Infectious Disease and Biosecurity(上海传染病与生物安全研究院) School of Public Health, Fudan University(复旦大学公共卫生学院)

AI总结 为解决公共卫生推理缺乏结构化监督信号和基准的问题,提出大规模多语言数据集GlobalHealthAtlas(280,210实例,15领域,17语言),并构建LLM辅助的构建与质量控制流水线及领域对齐评估器,支持安全关键型公共卫生推理的LLM训练与评估。

详情
Journal ref
ICML 2026 regular
AI中文摘要

公共卫生推理需要基于科学证据、专家共识和安全约束的群体层面推理。然而,作为一个结构化的机器学习问题,它仍然未被充分探索,且缺乏监督信号和基准。我们引入了GlobalHealthAtlas,一个大规模多语言数据集,包含280,210个实例,涵盖15个公共卫生领域和17种语言。我们进一步提出了一种大语言模型(LLM)辅助的构建和质量控制流水线,包括检索、去重、证据基础检查和标签验证,以提高大规模数据的一致性。最后,我们提出了一个从不同LLM的高置信度判断中提炼的领域对齐评估器,用于评估输出在六个维度上的表现:准确性、推理、完整性、共识一致性、术语规范性和洞察力。这些贡献共同使得LLM在安全关键的公共卫生推理中的可重复训练和评估成为可能,超越了传统的问答基准。我们公开发布了项目代码库、评估器和模型,网址为:https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator 和 https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model。

英文摘要

Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce GlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages. We further propose a large language model (LLM) assisted construction and quality control pipeline with retrieval, deduplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks. We publicly release project codebase, evaluator, and model at:: https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator and https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model

2601.20796 2026-05-27 cs.CL cs.LG 版本更新

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

解析多模态上下文学习:现代Transformer中的模态不对称性与电路动力学

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心) DeepMind(深Mind) Beijing University of Posts(北京邮电大学)

AI总结 通过可控实验,研究现代Transformer中多模态上下文学习的基本机制,发现模态间学习不对称性,并揭示其背后的归纳式电路机制。

Comments ICML 2026 Spotlight

详情
AI中文摘要

基于Transformer的多模态大语言模型通常展现出上下文学习(ICL)能力。受此现象启发,我们提出疑问:Transformer如何从上下文示例中跨模态关联信息?我们通过在合成分类任务上训练的小型Transformer进行可控实验来研究这一问题,从而能够精确操控数据统计和模型架构。我们首先重新审视现代Transformer中单模态ICL的核心原理。虽然多个先前发现得以复现,但我们发现旋转位置编码(RoPE)提高了ICL的数据复杂度阈值。扩展到多模态设置揭示了一个基本的学习不对称性:当在来自主要模态的高多样性数据上预训练时,次要模态中令人惊讶的低数据复杂度就足以使多模态ICL出现。机制分析表明,两种设置都依赖于一种归纳式机制,该机制从匹配的上下文示例中复制标签;多模态训练则跨模态细化和扩展这些电路。我们的发现为理解现代Transformer中的多模态ICL提供了机制基础,并为未来研究引入了一个可控的测试平台。代码可在 https://github.com/YiranHuangIrene/multimodal-icl 获取。

英文摘要

Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation. Code is available at: https://github.com/YiranHuangIrene/multimodal-icl

2601.18904 2026-05-27 cs.SD cs.AI cs.CL 版本更新

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出MetaSICL方法,利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力,在低资源场景下优于直接微调。

详情
AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而,当应用于低资源任务时,它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配,直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型,提供了一种无需训练、推理时的解决方案。在这项工作中,我们首先表明,$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能,这表明这种ICL适应能力可以推广到多模态设置。在此基础上,我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$,这是一种后训练方法,仅利用来自各种任务的高资源语音数据,旨在增强模型的上下文学习能力。实验表明,我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

2512.01556 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Shandong University(山东大学) Tongji University(同济大学) City University of Hong Kong(香港城市大学)

AI总结 提出LEC框架,通过线性期望约束将选择性预测转化为决策问题,在可交换性假设下利用校准集计算风险约束下的保留最大化阈值,并扩展到双模型路由系统,实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情
AI中文摘要

基础模型常常生成不可靠的答案,而启发式不确定性估计器无法完全区分正确与错误输出,导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题,旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此,我们提出了LEC,一个原则性框架,将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率,这对应于选择条件下的边际错误概率。在可交换性下,我们推导出一个仅依赖于保留校准集的有限样本充分条件,从而能够计算风险约束下的保留最大化阈值。此外,我们将LEC扩展到双模型路由系统:如果主模型的不确定性超过其校准阈值,则输入被委托给后续模型,同时保持系统级的选择条件误差控制。在封闭式和开放式问答(QA)以及视觉问答(VQA)上的实验表明,LEC在接受的预测中维持了规定的风险水平,并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

2601.08267 2026-05-27 cs.CL 版本更新

Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

Med-CoReasoner: 通过语言感知的协同推理减少医学推理中的语言差异

Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akiko Aizawa, Irene Li

发表机构 * The University of Tokyo(东京大学) ETH Zürich(苏黎世联邦理工学院) National Institute of Informatics(日本信息处理学会) Siriraj Informatics and Data Innovation Center(Siriraj信息与数据创新中心) Yale University(耶鲁大学)

AI总结 提出Med-CoReasoner框架,通过并行英语和本地语言推理、结构化概念抽象及概念级对齐与检索,将本地临床知识整合到英语逻辑框架中,以缩小医学推理中的多语言差距,在MultiMed-X基准上平均提升5%的多语言推理性能。

详情
AI中文摘要

尽管推理增强的大语言模型在英语医学任务上表现强劲,但多语言差距仍然存在,本地语言的推理能力明显较弱,限制了全球医疗部署的公平性。为弥合这一差距,我们引入了Med-CoReasoner,一种语言感知的协同推理框架,它引出平行的英语和本地语言推理,将其抽象为结构化概念,并通过概念级对齐和检索将本地临床知识整合到英语逻辑框架中。这种设计结合了英语推理的结构稳健性和本地语言编码的实践基础专业知识。为评估超越选择题设置的多语言医学推理,我们构建了MultiMed-X基准,涵盖七种语言,包含专家标注的长文本问答和自然语言推理任务,每种语言350个实例。在三个基准上的实验表明,Med-CoReasoner平均提高了5%的多语言推理性能,在低资源语言上提升尤为显著。此外,模型蒸馏和专家评估分析进一步证实,Med-CoReasoner产生了临床合理且文化扎根的推理轨迹。

英文摘要

While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.

2601.08146 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

超越迁移准确率:用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

发表机构 * Monash University Indonesia(印度尼西亚墨尔本大学) Institute Teknologi Bandung(Bandung理工大学) MBZUAI(MBZUAI研究所) Boston University(波士顿大学)

AI总结 提出基于上下文分解的电路发现方法(CD-T),通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现,并利用电路目标监督微调(CT-SFT)在低资源跨语言情感迁移中最小化灾难性遗忘,优于全局微调。

详情
AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务,限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分,将上下文分解方法适配到非结构化设置(CD-T),实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调(CT-SFT),将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明,CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率,但CT-SFT独特地最小化灾难性遗忘,保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立,表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

2511.02360 2026-05-27 cs.CV cs.CL 版本更新

LaRe: Latent Refocusing for Multimodal Reasoning

LaRe: 用于多模态推理的潜在重聚焦

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络与信息安全学院)

AI总结 提出LaRe范式,在潜在空间内进行视觉重聚焦,结合语义增强训练,在提升推理准确率的同时大幅减少推理所需token数。

详情
AI中文摘要

思维链推理通过分解复杂任务提升逻辑性能,但其多模态扩展面临权衡。主流的“用图像思考”范式通过显式裁剪图像区域实现视觉重聚焦,但导致计算开销快速增长。新兴的潜在空间推理范式减少了token消耗,但缺乏动态重聚焦能力。我们认为这种权衡源于一个默认前提:有效的视觉重聚焦必须以显式token的形式发生。基于此,我们提出潜在重聚焦(LaRe),一种新的多模态推理范式,其中视觉重聚焦完全在潜在空间内进行。我们进一步设计了一种语义增强训练策略,通过视觉重建目标确保潜在空间的语义结构。实验评估表明,与现有基线相比,LaRe将平均准确率提高了7.6%,同时将推理所需的token数量减少了59.7%。当扩展到8B参数的视觉语言模型骨干时,LaRe实现了与最先进方法相当的性能,证明了我们提出的潜在重聚焦范式在多模态推理中的有效性。

英文摘要

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

2601.09886 2026-05-27 cs.CL 版本更新

Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

缩小差距:探究为何语言模型惊奇度优于完形填空惊奇度

Sathvik Nair, Byung-Doh Oh

发表机构 * University of Maryland(马里兰大学) Nanyang Technological University(南洋理工大学)

AI总结 本研究通过三个假设(低分辨率、语义相似词区分、低频词概率准确性)解释了语言模型概率在预测处理努力上优于完形填空数据的原因。

Comments 18 pages, 10 figures, accepted to ACL 2026 Main Conference

详情
AI中文摘要

一个词的可预测性可以通过两种方式量化:使用人类对完形填空任务的响应或使用语言模型(LM)的概率。当用作处理努力的预测因子时,LM概率优于从完形填空数据得出的概率。然而,重要的是要确定LM概率之所以如此是出于正确的原因,因为不同的预测因子可能导致关于预测在语言理解中作用的科学结论不同。我们提供了关于LM概率优势的三个假设的证据:不受低分辨率影响、区分语义相似的词、以及准确分配低频词的概率。这些结果呼吁努力提高完形填空研究的分辨率,同时进行实验以确定类似人类的预测是否也对LM概率所做的细粒度区分同样敏感。

英文摘要

How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.

2601.06580 2026-05-27 cs.CL 版本更新

Stylistic Evolution and LLM Neutrality in Singlish Language

新加坡英语中的文体演变与LLM中立性

Linus Tze En Foo, Weihan Angela Ng, Wenkai Li, Lynnette Hui Xian Ng

发表机构 * Independent Researcher(独立研究者) ETH Zürich(苏黎世联邦理工学院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过分析十年间非正式数字信息的文体变化,研究大型语言模型(LLM)能否生成时间中立的输出,发现文体可分离性随时间距离增加,且LLM在真实性和时间中立性之间存在结构性权衡。

详情
AI中文摘要

新加坡英语是一种根植于新加坡多语言环境的克里奥尔语,随着社会和技术变革持续演变。我们考察了十年间非正式数字信息的历时文体变化,并探究大型语言模型(LLM)能否生成时间中立的输出,以近似该变体的稳定本质。使用词汇、语用、心理语言学和基于编码器的特征,我们发现文体可分离性随时间距离增加而增强,这主要由长度和复杂度等结构特征驱动。与零分布基线相比,大多数LLM未能同时实现真实性和时间中立性,揭示了一种结构性权衡:生成真实新加坡英语的模型继承了其时间偏差,而时间中立的模型则产生不真实的输出。这些发现将时间中立性定位为评估LLM社会方言基础的诊断指标。

英文摘要

Singlish is a creole rooted in Singapore's multilingual environment that continues to evolve alongside social and technological change. We examine diachronic stylistic change across a decade of informal digital messages and ask whether Large Language Models (LLMs) can generate temporally neutral outputs approximating the stable essence of the variety. Using lexical, pragmatic, psycholinguistic, and encoder-based features, we find that stylistic separability increases with temporal distance, driven primarily by structural features such as length and complexity. Evaluated against a null distribution baseline, most LLMs fail to achieve both authenticity and temporal neutrality simultaneously, revealing a structural trade-off: models generating realistic Singlish inherit its temporal biases, while temporally neutral models produce inauthentic outputs. These findings position temporal neutrality as a diagnostic metric for assessing sociolectal grounding in LLMs.

2601.04275 2026-05-27 cs.CR cs.AI cs.CL 版本更新

Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs

影子遗忘:一种面向LLM保真保留的无面孔遗忘的神经语义方法

Dinesh Srivasthav P, Ashok Urlana, Rahul Mishra, Bala Mallikarjunarao Garlapati, Ponnurangam Kumaraguru

发表机构 * TCS Research, Hyderabad(TCS研究院,海得拉巴) IIIT Hyderabad(IIIT海得拉巴)

AI总结 提出影子遗忘范式,通过神经语义投影器遗忘(NSPU)框架在匿名化遗忘数据上实现机器遗忘,保护隐私的同时保持模型效用,计算效率提升至少10倍。

详情
AI中文摘要

机器遗忘旨在选择性地移除特定训练样本的影响,以满足GDPR等隐私法规的“被遗忘权”。然而,许多现有方法需要访问被移除的数据,使其暴露于成员推断攻击和个人身份信息(PII)的潜在滥用。我们通过提出影子遗忘(Shadow Unlearning)来解决这一关键挑战,这是一种新的近似遗忘范式,在不暴露PII的情况下对匿名化遗忘数据进行机器遗忘。我们进一步提出了一种新颖的隐私保护框架——神经语义投影器遗忘(NSPU),以实现影子遗忘。为了评估我们的方法,我们跨五个不同领域构建了多领域虚构遗忘(MuFU)遗忘集,并引入了一个评估栈来量化知识保留与遗忘效果之间的权衡。在各种大型语言模型上的实验表明,NSPU实现了优越的遗忘性能,保持了模型效用,并增强了用户隐私。此外,所提出的方法在计算效率上比标准遗忘方法至少高出10倍。我们的研究为隐私感知的机器遗忘开辟了新方向,平衡了数据保护与模型保真度。

英文摘要

Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR's 'Right to be Forgotten'. However, many existing methods require access to the data being removed, exposing it to membership inference attacks and potential misuse of Personally Identifiable Information (PII). We address this critical challenge by proposing Shadow Unlearning, a novel paradigm of approximate unlearning, that performs machine unlearning on anonymized forget data without exposing PII. We further propose a novel privacy-preserving framework, Neuro-Semantic Projector Unlearning (NSPU) to achieve Shadow unlearning. To evaluate our method, we compile Multi-domain Fictitious Unlearning (MuFU) forget set across five diverse domains and introduce an evaluation stack to quantify the trade-off between knowledge retention and unlearning effectiveness. Experimental results on various LLMs show that NSPU achieves superior unlearning performance, preserves model utility, and enhances user privacy. Additionally, the proposed approach is at least 10x more computationally efficient than standard unlearning approaches. Our findings foster a new direction for privacy-aware machine unlearning that balances data protection and model fidelity.

2601.03089 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

基于受控保留信息的仅解码器LLM归因忠实性评估

Xin Huang, Antoni B. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对现有软扰动忠实性指标因保留词数不同导致评估偏差的问题,提出π-Soft-NC和π-Soft-NS框架,通过控制期望保留概率公平比较归因方法,并引入专用于自回归解码器LLM的梯度归因方法Grad-ELLM。

详情
AI中文摘要

大型语言模型(LLM)越来越多地使用输入归因方法进行评估,但比较这些解释仍然具有挑战性。现有的软扰动忠实性指标,如Soft-NC和Soft-NS,可能将归因质量与扰动期间保留的词数混为一谈:平均得分较高的归因方法可能保留更多词,从而获得膨胀的分数。为解决此问题,我们提出π-Soft-NC和π-Soft-NS,这是一个在相同期望保留概率下比较归因方法的评估框架,从而控制保留词数。我们进一步引入Grad-ELLM,一种针对自回归仅解码器LLM定制的基于梯度的归因方法,该方法在每个解码步骤将梯度导出的通道重要性与注意力导出的标记重要性相结合。在Llama和Mistral上的分类和开放生成任务实验表明,Grad-ELLM在π-Soft-NC下实现了强全面性导向的忠实性,而在π-Soft-NS下没有主导方法。我们的评估指标为比较LLM的可解释人工智能方法提供了一个严格的框架,将支持该领域的进展。

英文摘要

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $π$-Soft-NC and $π$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $π$-Soft-NC, while there is no dominant method under $π$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

2601.01668 2026-05-27 cs.CL cs.AI 版本更新

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

EHRSummarizer:一种隐私感知、FHIR原生的源接地EHR摘要参考架构

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

发表机构 * MedLedger365 MedConnect365 Xylemed Kypath Associates Inc.

AI总结 提出一种隐私感知、FHIR原生的参考架构EHRSummarizer,通过检索HL7 FHIR R4资源并约束生成源接地摘要,以支持临床病历审查。

Comments 15 pages, 2 figures, 2 tables. Version 2 clarifies missing-data status handling, medication-status ambiguity, controlled narrative-document handling, source-grounded resource grouping, and future source-to-summary traceability

详情
AI中文摘要

临床医生通常需要浏览碎片化的电子健康记录(EHR)界面,以整合患者问题、用药、近期就诊和纵向趋势的连贯图像。本文描述了EHRSummarizer,一种用于结构化EHR摘要的隐私感知、FHIR原生参考架构。该架构检索一组目标性的高收益HL7 FHIR R4资源,将其标准化为临床上下文包,并使用受约束的摘要阶段生成源接地摘要,旨在支持病历审查。该架构进一步阐明了缺失数据状态处理、用药状态模糊性、在可用时对叙述性临床文档的受控使用,以及未来的源到摘要可追溯性。本文描述的是参考架构和原型行为,而非经过验证的临床干预、自主临床决策支持系统或临床获益证据。在合成和测试FHIR环境上的原型演示展示了端到端行为和输出格式;然而,本文未报告临床结果、受控工作流研究或基准结果。我们概述了一个评估计划,重点关注忠实性、遗漏风险、时间正确性、可用性、隐私和操作监控,以指导未来的机构评估。

英文摘要

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This manuscript describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture for structured EHR summarization. The architecture retrieves a targeted set of high-yield HL7 FHIR R4 resources, normalizes them into a clinical context package, and uses a constrained summarization stage to produce source-grounded summaries intended to support chart review. The architecture further clarifies missing-data status handling, medication-status ambiguity, controlled use of narrative clinical documents when available, and future source-to-summary traceability. The manuscript describes a reference architecture and prototype behavior rather than a validated clinical intervention, autonomous clinical decision-support system, or evidence of clinical benefit. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes, controlled workflow studies, or benchmark results. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, privacy, and operational monitoring to guide future institutional assessment.

2601.00575 2026-05-27 cs.CL 版本更新

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

InfoSynth: 信息引导的大语言模型基准合成

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出基于信息论(KL散度和熵)的InfoSynth框架,自动生成高难度、多样化的Python编程基准,97%的测试用例和解决方案准确。

详情
AI中文摘要

大型语言模型(LLM)在推理和代码生成方面取得了显著进展,但高效创建新基准来评估这些能力仍然是一个挑战。传统的基准创建依赖人工,成本高且耗时。此外,现有基准常常污染LLM训练数据,因此需要新颖多样的基准来准确评估其真实能力。本文介绍了InfoSynth,一个基于信息论原理自动生成和评估推理基准的新框架。我们提出了基于KL散度和熵的度量标准,无需昂贵的模型评估即可量化基准的新颖性和多样性。在此框架基础上,我们开发了一个端到端的流水线,使用遗传算法和迭代代码反馈从种子数据集中合成稳健的Python编程问题。我们的方法在97%的情况下生成准确的新问题测试用例和解决方案,并且合成的基准在难度上始终高于先前的工作。此外,我们的算法提供了控制生成问题的新颖性/多样性和难度的方法。InfoSynth为LLM构建高质量、具有挑战性的编程基准提供了一个可扩展、自验证的流水线。项目页面:https://ishirgarg.github.io/infosynth_web/

英文摘要

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation, but efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, which is expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher difficulty compared to prior works. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, challenging coding benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

2512.14561 2026-05-27 cs.CL 版本更新

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

大型语言模型与人类评分者在论文评分中的一致性:一项研究综合

Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng

发表机构 * Department of Educational Policy Studies, Georgia State University(乔治亚州立大学教育政策研究系)

AI总结 通过综合65项研究,发现大型语言模型与人类评分者在论文评分中的一致性高度依赖上下文,且跨研究及研究内部差异显著。

详情
AI中文摘要

尽管大型语言模型(LLMs)在自动论文评分(AES)中展现出越来越大的潜力,但关于其与人类评分者相比可靠性的实证结果仍然不一。遵循PRISMA 2020指南,我们综合了2022年1月至2025年8月间65项已发表和未发表的研究,这些研究考察了LLM生成的分数与人类评分之间的一致性。一致性水平在研究之间和研究内部均有显著差异,报告值范围广泛。总体而言,研究结果表明LLM-人类一致性高度依赖上下文。讨论了未来研究的启示、挑战和方向。

英文摘要

Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is highly context-dependent. Implications, challenges, and directions for future research are discussed.

2512.11280 2026-05-27 cs.CL 版本更新

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

AdaSD:面向高效语言模型推理的自适应推测解码

Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu

发表机构 * Institute of Information Science, Academia Sinica(学术院信息研究所) Department of Computer Science and Information Engineering, National Taiwan University(台湾大学计算机科学与信息工程系)

AI总结 提出一种无超参数的自适应推测解码方法AdaSD,通过动态调整生成长度和接受标准,在保持准确率下降低于1.8%的同时实现最高1.46倍加速。

详情
AI中文摘要

大型语言模型(LLM)在广泛任务中取得了显著性能,但其不断增长的参数规模显著拖慢了推理速度。推测解码通过利用较小的草稿模型预测候选令牌,再由较大的目标模型验证,从而缓解这一问题。然而,现有方法通常需要额外训练、大量超参数调整或在部署前对模型和任务进行预先分析。在本文中,我们提出自适应推测解码(AdaSD),一种无超参数的解码方案,在推理过程中动态调整生成长度和接受标准。AdaSD引入两个自适应组件:一个决定何时停止候选令牌生成,另一个决定令牌接受,两者均基于令牌熵和Jensen-Shannon距离实时更新。该方法无需预先分析或微调,且兼容现有模型。在基准数据集上的实验表明,AdaSD相比普通推测解码实现最高1.46倍加速,同时将准确率下降限制在1.8%以内,使其成为高效且自适应的LLM推理的实用解决方案。

英文摘要

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive components: one to determine when to stop candidate token generation and the other to decide token acceptance, updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 1.46x speedup over vanilla speculative decoding while limiting accuracy degradation to under 1.8%, making it a practical solution for efficient and adaptive LLM inference.

2412.20505 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

LiPUP-MA:一种以居住体验为中心的循环参与式城市多智能体规划框架

Hang Ni, Yuzhi Wang, Yizhi Song, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LiPUP-MA多智能体框架,通过模拟居住生活与体验驱动的计划修订循环,利用基于图的经验库和空间约束技能增强规划器,解决参与式城市规划中经验落地与反馈空间化问题。

详情
AI中文摘要

参与式城市规划(PUP)日益得到基于LLM的智能体的支持,但现有方法主要依赖于静态偏好 elicitation 和一次性利益相关者讨论,忽视了现实世界规划的周期性——居住生活、经验收集和计划调整持续互动。我们提出循环参与式城市规划(LiPUP),一种在模拟居住生活和经验驱动的计划修订之间交替的闭环范式,同时面临两个关键挑战:将分散的居住经验锚定到具体的城市背景中,以及将主观反馈转化为空间连贯的规划行动。为实例化LiPUP,我们引入LiPUP-MA,一个基于LLM的多智能体框架,它构建了一个以计划为中心的基于图的经验库,用于组织来自生活模拟的基于城市的居住反馈,并配备了一个空间约束的技能增强规划器智能体,通过协调经验、视觉和地理空间证据来修订计划。实验表明,LiPUP-MA在传统的静态规划指标和基于生活的指标上均持续优于基线,而迭代的LiPUP循环进一步提高了计划质量。

英文摘要

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference elicitation and one-shot stakeholder discussions, overlooking the cyclical nature of real-world planning, where residential life, experience collection, and plan adjustment continually interact. We propose Living-in-the-loop Participatory Urban Planning (LiPUP), a closed-loop paradigm that alternates between simulated residential living and experience-driven plan revision, while posing two key challenges: grounding scattered living experience in concrete urban contexts and translating subjective feedback into spatially coherent planning actions. To instantiate LiPUP, we introduce LiPUP-MA, an LLM-based multi-agent framework that constructs a Plan-centric Graph-based Experience Bank to organize urban-grounded residential feedback from living simulation and equips a Spatially-constrained Skill-augmented Planner agent to revise plans by harmonizing experiential, visual, and geospatial evidence. Experiments show that LiPUP-MA consistently outperforms baselines on both conventional static planning metrics and living-based metrics, while iterative LiPUP cycles further improve plan quality.

2510.17790 2026-05-27 cs.CV cs.CL 版本更新

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA: 一种具有混合动作的计算机使用智能体基础模型

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

发表机构 * Apple(苹果公司) The University of Hong Kong(香港大学)

AI总结 提出UltraCUA基础模型,通过混合动作(融合原始GUI操作与高级工具执行)克服计算机使用智能体仅依赖原始GUI动作的局限性,采用自动化管道、合成数据引擎、混合动作轨迹收集和两阶段训练方法,在OSWorld和WindowsAgentArena上分别实现22%的相对性能提升和21.7%的成功率。

详情
AI中文摘要

计算机使用智能体面临一个根本限制:它们仅依赖原始GUI动作(点击、键入、滚动),导致脆弱的执行链容易发生级联故障。虽然API驱动的智能体通过结构化接口和工具利用丰富的能力,但计算机使用智能体仍然局限于低层视觉交互。我们提出UltraCUA,一种通过混合动作(无缝统一原始GUI操作与高层工具执行)超越这一限制的基础模型。我们的创新基于四个关键进展。首先,一个自动化管道从软件文档和代码仓库中提取并扩展工具能力。其次,一个合成数据引擎生成超过17,000个可验证任务,捕捉真实世界的计算机使用复杂性。第三,全面的混合动作轨迹收集融合了GUI原语和策略性工具调用。第四,一种两阶段训练方法结合了监督微调和在线强化学习,实现了GUI与API之间的智能动作选择。对我们的7B和32B UltraCUA模型的评估揭示了变革性的性能提升。在OSWorld上,UltraCUA平均实现了22%的相对改进,同时执行速度比现有方法快11%。在WindowsAgentArena上的跨域验证展示了鲁棒的泛化能力,成功率达到21.7%,超过了在Windows上训练的基线。混合动作范式被证明至关重要,在减少错误传播的同时提高了执行效率。这项工作建立了一个可扩展的范式,桥接了原始GUI交互与高层工具智能,为多样环境和复杂现实任务提供了更具弹性和适应性的计算机使用智能体。

英文摘要

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.

2512.04868 2026-05-27 cs.CL cs.AI 版本更新

SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

SEAL: 面向知识图谱对话问答的自我演进智能体学习

Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li, Shunyu Yao, Yanzeng Li, Xinchi Li

发表机构 * Institute of Big Data and Artificial Intelligence, China Telecom Research Institute(大数据与人工智能研究院,中国电信研究院) Wangxuan Institute of Computer Technology, Peking University(王宣计算机技术研究所,北京大学) School of Artificial Intelligence, China University of Geosciences (Beijing)(人工智能学院,中国地质大学(北京)) Center for Cognition and Neuroergonomics, State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University(认知与神经工效学中心,认知神经科学与学习国家重点实验室,北京师范大学) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(人工智能与未来网络研究院,北京师范大学)

AI总结 提出SEAL两阶段语义解析框架,通过自我演进智能体学习解决知识图谱对话问答中的指代消解、上下文依赖和复杂逻辑推理问题,在SPICE基准上达到最先进性能。

Comments Accept by NeuroComputing

详情
AI中文摘要

基于知识的对话问答(KBCQA)在解决指代消解、上下文依赖建模和执行复杂逻辑推理方面面临持续挑战。现有方法通常存在不准确性和高昂的计算成本,尤其是在处理大规模知识图谱上的复杂查询时。具体而言,大型语言模型(LLM)倾向于为复杂的多跳或聚合查询生成语法无效或语义错位的逻辑形式,而传统的实体-关系链接方法则面临候选空间指数级增长的问题。为了解决这些限制,我们引入了SEAL,一种基于自我演进智能体学习的新型两阶段语义解析框架。在第一阶段,LLM提取一个捕获核心语义的最小S表达式核心,然后通过智能体校准模块进行修正,以纠正语法不一致性并将实体和关系与知识图谱对齐。第二阶段采用基于问题类型预测的模板补全来构建完全可执行的S表达式。关键的是,SEAL包含一种自我演进机制,将局部和全局记忆与反射模块相结合,能够从对话历史和执行反馈中持续适应,而无需显式重新训练。在SPICE基准上的大量实验表明,SEAL在多跳推理、比较和聚合任务中实现了最先进的性能,验证了在结构准确性和计算效率方面的显著提升。

英文摘要

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches often suffer from inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. Specifically, large language models (LLMs) tend to generate syntactically invalid or semantically misaligned logical forms for complex multi-hop or aggregation queries, while conventional entity-relation linking methods face an exponentially growing candidate space. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, an LLM extracts a minimal S-expression core capturing the essential semantics, which is then refined by an agentic calibration module to correct syntactic inconsistencies and align entities and relations with the knowledge graph. The second stage employs template-based completion guided by question-type prediction to construct a fully executable S-expression. Crucially, SEAL incorporates a self-evolving mechanism integrating local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance in multi-hop reasoning, comparison, and aggregation tasks, validating notable gains in both structural accuracy and computational efficiency.

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.(先进微器件公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Athena-PRM,一种多模态过程奖励模型,通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签,在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情
AI中文摘要

我们提出了 Athena-PRM,一种多模态过程奖励模型(PRM),旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入,主要因为需要推理步骤的逐步标注。传统的自动标注方法,如蒙特卡洛估计,通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM 在仅5000个样本的情况下,在各种场景和基准测试中展现出卓越的效果。此外,我们还开发了两种有效策略来提升PRM的性能:ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是,当使用 Qwen2.5-VL-7B 作为策略模型时,Athena-PRM 在 WeMath 上提升了10.2个百分点,在 MathVista 上提升了7.1个百分点(测试时扩展)。此外,Athena-PRM 在 VisualProcessBench 上取得了最先进(SoTA)结果,比之前的 SoTA 高出3.9个F1分数,展示了其准确评估推理步骤正确性的强大能力。另外,利用 Athena-PRM 作为奖励模型,我们通过奖励排序微调开发了 Athena-7B,在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

2511.14683 2026-05-27 cs.CL 版本更新

Quadratic Term Correction on Heaps' Law

Heap定律的二次项修正

Oscar Fontanelli, Wentian Li

AI总结 针对Heap定律在双对数坐标下仍呈轻微凹形的问题,提出二次函数拟合方法,并通过二十部英文小说验证,发现线性系数略大于1、二次系数约为-0.02,且曲率与“伪方差”相关。

Comments 3 figures

详情
AI中文摘要

Heap或Herdan定律通过幂律函数表征词类型与词例之间的关系,该函数在线性-线性尺度上是凹的,但在双对数尺度上是直线。然而,即使在双对数尺度上,类型-词例曲线仍轻微凹形,使幂律关系失效。作为下一阶近似,我们通过二十部英文小说(部分从其他语言翻译成英文)证明,双对数尺度下的二次函数能完美拟合类型-词例数据。对log(类型)-log(词例)数据同时包含线性和二次项的回归分析一致地得出线性系数略大于1,二次系数约为-0.02。利用“从袋中有放回地随机抽取彩色球”模型,我们证明双对数尺度的曲率等于一个负的“伪方差”。尽管当词例数量较大时,由于伪权重值较大,伪方差计算可能遇到数值不稳定性,但该形式为词例数量较小时提供了曲率的粗略估计。

英文摘要

Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

2510.17759 2026-05-27 cs.CR cs.CL cs.CV cs.LG stat.ML 版本更新

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V:用于破解视觉语言模型的变分推断框架

Qilin Liao, Anamika Lochab, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University, USA(美国普渡大学计算机科学系)

AI总结 提出VERA-V变分推断框架,通过联合后验分布生成隐蔽的文本-图像对抗输入,以系统性地发现视觉语言模型的多模态漏洞,在多个基准上攻击成功率最高提升53.75%。

Comments 18 pages, 7 Figures,

详情
AI中文摘要

视觉语言模型(VLM)通过视觉推理扩展了大语言模型,但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖脆弱的模板,专注于单一攻击设置,并且仅暴露了漏洞的一小部分。为了解决这些限制,我们引入了VERA-V,一个变分推断框架,将多模态越狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得能够生成绕过模型防护的隐蔽、耦合的对抗输入。我们训练一个轻量级攻击者来近似后验分布,从而能够高效采样多样化的越狱方法,并提供对漏洞的分布性洞察。VERA-V进一步整合了三种互补策略:(i)基于排版的文本提示,嵌入有害线索;(ii)基于扩散的图像合成,引入对抗信号;(iii)结构化干扰物,分散VLM的注意力。在HarmBench和HADES基准上的实验表明,VERA-V在开源和前沿VLM上均持续优于最先进的基线方法,在GPT-4o上相比最佳基线实现了高达53.75%的攻击成功率(ASR)提升。我们在项目页面提供了代码,地址为:https://github.com/kxwhiowo/VERA-V

英文摘要

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o. We include the code on the project page available here: https://github.com/kxwhiowo/VERA-V

2510.06843 2026-05-27 cs.CL cs.AI 版本更新

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

自信号驱动的多LLM辩论以实现高效准确的推理

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

发表机构 * University of Cambridge(剑桥大学) Sorbonne Université(索邦大学) University of Science and Technology of China(中国科学技术大学) Beihang University(北航大学) Nanyang Technological University(南洋理工大学)

AI总结 提出一种利用模型级置信度和token级语义焦点两种自信号来自适应引导多LLM辩论过程的方法,在提高准确性的同时减少token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在 diverse 应用领域展现了令人印象深刻的能力。最近的工作探索了多LLM智能体辩论(MAD),通过使多个LLM迭代讨论和细化响应来增强性能。然而,现有的MAD方法主要关注利用外部结构(如辩论图)和LLM作为评判者,而忽略了生成过程中出现的自信号(如token logits和注意力)。这种遗漏导致了冗余计算和潜在的性能下降。在本文中,我们将重点转移到多LLM辩论的自信号上,并引入了一种自信号驱动的多LLM辩论(SID),它利用两种类型的自信号:模型级置信度和token级语义焦点,来自适应地引导辩论过程。我们的方法使高置信度智能体能够在模型级别提前退出,并基于注意力机制压缩冗余辩论内容。我们在多个具有挑战性的基准测试上,对各种LLMs和多模态LLMs评估了我们的方法。实验结果表明,我们的方法不仅在准确性上优于现有的MAD技术,而且还减少了token消耗,突显了利用自信号在提高多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{https://github.com/xuhang2019/SID}{ exttt{https://github.com/xuhang2019/SID}} 上提供。

英文摘要

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

2510.05864 2026-05-27 cs.CL cs.CY 版本更新

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

关于指令微调大语言模型对长输入中有害句子的敏感性

Faeze Ghorbanpour, Alexander Fraser

发表机构 * School of Computation, Information and Technology, TU Munich(计算、信息与技术学院,慕尼黑工业大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 通过构建长输入并系统变化长度、有害比例、显隐性和位置,研究LLM对稀疏嵌入有害句子的敏感性,发现敏感性非单调、随长度下降、早期位置优先、显性危害更易识别。

详情
AI中文摘要

大型语言模型(LLM)越来越多地处理长输入,但当有害句子稀疏地嵌入其中时,其行为仍知之甚少。我们提出了一种敏感性分析,探究LLM如何提取嵌入在长输入中的有害句子。我们通过组合中性和有害句子构建长输入,并系统变化四个因素:输入长度(600–30,000个token)、有害句子比例(0.01–0.50)、危害实现方式(显性与隐性)以及有害句子在输入中的位置(开头、中间、结尾),从而进行受控压力测试评估。针对有毒、冒犯和仇恨内容,以及LLaMA-3.1、Qwen-2.5和Mistral的实验揭示了一致模式:敏感性相对于有害流行率是非单调的,在中等水平达到峰值;敏感性随输入长度增加而下降;位于输入较早位置的有害句子被更强烈地优先处理;显性危害比隐性危害更可靠地被识别。这些发现提供了在受控压力条件下LLM如何优先处理长输入中有害句子的系统视角,突出了安全相关应用中新兴的优势和持续的挑战。

英文摘要

Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.

2510.05141 2026-05-27 cs.CL 版本更新

To model human linguistic prediction, make LLMs less superhuman

为了模拟人类语言预测,让大语言模型不那么超人类

Byung-Doh Oh, Tal Linzen

发表机构 * Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore(南洋理工大学语言学与多语言研究系,新加坡) Department of Linguistics, New York University, New York, USA(纽约大学语言学系,纽约,美国) Center for Data Science, New York University, New York, USA(纽约大学数据科学中心,纽约,美国)

AI总结 本文指出大语言模型因超人类预测能力而无法解释人类阅读行为,主张通过模拟人类记忆来改进模型,并提出新实验方向。

Comments Accepted to Trends in Cognitive Sciences

详情
AI中文摘要

当我们阅读时,我们会预测即将出现的单词;这些预测会影响我们的阅读行为。大语言模型(LLMs)与人类一样,会对即将出现的单词进行预测,它们的成功促使它们被用作人类语言预测的模型。令人惊讶的是,在过去几年中,随着LLMs预测下一个单词的能力提高,它们解释阅读行为的能力却下降了。我们认为这是因为当前的LLMs预测即将出现的单词的能力远高于人类读者。这种“超人类性”是由LLMs大量的训练数据、对训练示例更强的长期记忆以及更强的短期记忆驱动的。我们主张开发具有类人记忆的LLMs,并进行新的实验来衡量人类与LLMs之间的一致性,并概述了实现这些目标的方向。

英文摘要

When we read, we make predictions about upcoming words; these predictions influence our reading behavior. The success of large language models (LLMs), which, like humans, make predictions about upcoming words, has motivated their use as models of human linguistic prediction. Surprisingly, in the last few years, as LLMs' ability to predict the next word has improved, their ability to explain reading behavior has declined. We argue this is because current LLMs can predict upcoming words much better than human readers can. This 'superhumanness' is driven by LLMs' extensive training data, stronger long-term memory of training examples, and stronger short-term memory. We advocate for LLMs with human-like memory and for new experiments to measure the alignment between humans and LLMs, and outline directions towards achieving these goals.

2510.01833 2026-05-27 cs.AI cs.CL 版本更新

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

先规划后行动:面向LLM推理的高层规划引导强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

发表机构 * Case Western Reserve University, Cleveland, OH, USA(凯斯西储大学) Kean University, Union, NJ, USA(凯恩大学) The Ohio State University, Columbus, OH, USA(俄亥俄州立大学) Fudan University, Shanghai, China(复旦大学) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) The University of Hong Kong, Hong Kong, China(香港大学) North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学)

AI总结 提出PTA-GRPO两阶段框架,通过高层规划引导与强化学习联合优化,提升LLM在数学和自然科学推理任务中的准确性和泛化能力。

Comments 19 pages and 5 figures

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)展现出强大的推理能力,但其token级别的生成倾向于局部决策,缺乏全局规划,常常导致冗余或不准确的推理。现有方法(如基于树的搜索和强化学习)试图解决这一问题,但计算成本高,且仍难以产生可靠的推理轨迹。为应对这些挑战,我们提出先规划后行动增强推理与组相对策略优化(PTA-GRPO),这是一个两阶段框架,旨在联合改进高层规划和细粒度CoT推理。具体而言,在第一阶段,给定LLM负责将CoT推理总结为紧凑的高层指导,然后用于监督微调。接着,我们引入一种指导感知的强化学习方法,联合优化最终输出和指导质量,提升推理效果。我们在数学和自然科学的十个推理基准上,使用五个覆盖多种数据模态的多样化基础模型进行评估。结果表明,PTA-GRPO在模型和任务上持续带来显著改进,展现出强大的有效性和泛化能力。

英文摘要

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

2510.01336 2026-05-27 cs.CL cs.AI cs.LG 版本更新

HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec: 分层推测解码用于大语言模型

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校电子与计算机工程系)

AI总结 提出HiSpec框架,利用早期退出模型进行低开销中间验证,通过重用键值缓存和隐藏状态提高吞吐量,平均加速1.28倍,最高2.01倍,且不损失准确性。

详情
AI中文摘要

推测解码通过使用较小的草稿模型推测令牌,再由较大的目标模型验证,从而加速LLM推理。验证通常是瓶颈(例如,当3B模型为70B目标模型推测时,验证速度比令牌生成慢4倍),但大多数先前工作只关注加速草稿生成。“中间”验证通过早期丢弃不准确的草稿令牌来减少验证时间,但现有方法在引入中间验证器时会产生大量训练开销,增加内存占用以协调中间验证步骤,并依赖近似启发式方法损害准确性。我们提出$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$,一种高吞吐量推测解码框架,利用早期退出模型进行低开销中间验证。早期退出模型允许令牌通过跳过层遍历提前退出,并经过显式训练,使得选定层的隐藏状态可解释,从而在不显著增加计算和内存开销的情况下,非常适合中间验证。为了进一步提高资源效率,我们设计了一种方法,使HiSpec能够在草稿模型、中间验证器和目标模型之间重用键值缓存和隐藏状态。为了保持准确性,HiSpec定期针对目标模型验证中间验证器接受的草稿令牌。我们在各种代表性基准和模型上的评估表明,与基线单层推测相比,HiSpec平均提高吞吐量1.28倍,最高达2.01倍,且不损失准确性。

英文摘要

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

2509.26600 2026-05-27 cs.CL cs.AI 版本更新

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

当LLM自我基准测试:解构自动评估中的自我偏见

Wenda Xu, Sweta Agrawal, Vilém Zouhar, Markus Freitag, Daniel Deutsch

发表机构 * Google(谷歌) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM自动创建基准测试时存在的自我偏见问题,发现测试集生成和评估两个环节均产生偏见,导致模型偏爱自身输出,并提出了多样性指标以部分缓解该偏见。

详情
AI中文摘要

随着LLM迅速饱和现有基准测试,使用LLM自动创建基准测试(LLM-as-a-benchmark)——即模型生成测试输入(LLM-as-a-testset)并评估输出(LLM-as-an-evaluator)——已成为人工策划的廉价替代方案。我们表明,这种范式存在一个根本问题:LLM生成的基准测试系统性地偏爱创建它们的模型。以机器翻译为主要测试平台,我们发现自我偏见源于两个叠加来源:LLM-as-a-testset和LLM-as-an-evaluator,它们的组合放大了这种效应。关键的是,即使测试数据在显式多样性控制下生成,每个模型的隐式风格倾向也会产生同质的、模型特定的输出,从而抬高其自身分数。使用我们提出的多样性度量增加源文本多样性,可以部分缓解这种偏见。自我偏见足够强,以至于每个模型都将自己排在首位,覆盖了同行共识排序。我们确认该现象扩展到Chatbot Arena任务上的开放式生成。

英文摘要

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluator, and their combination amplifies the effect. Crucially, even when test data is generated with explicit diversity controls, each model's implicit stylistic tendencies produce homogeneous, model-specific outputs that inflate its own scores. Increasing source text diversity, using our proposed diversity metric, partially mitigates this bias. Self-bias is strong enough to cause each model to rank itself first, overriding the peer-consensus ordering. We confirm that the phenomenon extends to open-ended generation on the Chatbot Arena task.

2509.21552 2026-05-27 cs.CV cs.CL 版本更新

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

通过视觉反馈学习具有空间推理能力的 GUI 定位

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim

发表机构 * University of Edinburgh(爱丁堡大学) Microsoft(微软)

AI总结 本文提出将 GUI 定位重构为交互式搜索任务,利用多步在线强化学习训练 GUI-Cursor 模型,通过光标视觉反馈提升空间推理能力,在 GUI 定位和代理任务上超越强基线。

Comments Accepted at ICML 2026

详情
AI中文摘要

图形用户界面(GUI)定位通常被构建为坐标预测任务——给定自然语言指令,生成屏幕上用于点击和按键等操作的坐标。然而,最近的视觉语言模型(VLM)在处理高分辨率和复杂布局的 GUI 图像时,往往无法预测准确的数字坐标。为了解决这个问题,我们将 GUI 定位重构为交互式搜索任务,其中 VLM 生成动作以移动 GUI 中的光标来定位 UI 元素。在每一步,模型确定目标对象,评估光标与目标之间的空间关系,并根据移动历史将光标移近目标。在这个交互过程中,渲染的光标提供视觉反馈,帮助模型将其预测与相应的屏幕位置对齐。我们使用基于密集轨迹奖励函数的多步在线强化学习来训练我们的 GUI 定位模型 GUI-Cursor。实验结果表明,GUI-Cursor 在 GUI 定位和代理任务上超越了强基线,在相同基础模型下实现了更优性能,同时需要更少的训练数据。进一步分析表明,GUI-Cursor 学会在更困难的示例上自适应地执行更多步骤,并在分布外领域获得更好的空间推理能力。

英文摘要

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Experimental results demonstrate that GUI-Cursor surpasses strong baselines in GUI grounding and agentic tasks, achieving superior performance with the same base models while requiring less training data. Further analysis shows that GUI-Cursor learns to adaptively conduct more steps on more difficult examples, and it obtains better spatial reasoning capability on out-of-distribution domains.

2509.21106 2026-05-27 cs.CL cs.IR 版本更新

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

BESPOKE:通过诊断反馈进行搜索增强型大语言模型个性化的基准测试

Hyunseo Kim, Sangam Lee, Kwangwook Seo, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea(人工智能系,延世大学,首尔,大韩民国)

AI总结 提出BESPOKE基准,通过收集真实用户历史并配对细粒度偏好分数与反馈,系统评估搜索增强型大语言模型在个性化信息检索任务中的表现。

Comments Accepted to ICML 2026

详情
AI中文摘要

搜索增强型大语言模型(LLMs)通过将检索集成到生成中,推进了信息寻求任务,相比传统搜索系统减少了用户的认知负担。然而,它们在充分满足多样化用户需求方面仍显不足,这需要识别同一查询如何反映不同用户的意图,并以偏好的形式传递信息。尽管最近如ChatGPT和Gemini等系统通过利用用户历史尝试个性化,但对这种个性化的系统评估尚不充分。为填补这一空白,我们提出了BESPOKE,一个用于评估搜索增强型LLMs个性化的现实基准。BESPOKE的设计既现实(通过直接从人类收集真实的聊天和搜索历史)又具有诊断性(通过将响应与细粒度的偏好分数和反馈配对)。该基准通过长期、深度参与的人工标注构建,标注者贡献自己的历史、撰写带有详细信息需求的查询,并用分数和诊断反馈评估响应。利用BESPOKE,我们进行了系统分析,揭示了信息寻求任务中有效个性化的关键要求,为个性化搜索增强型LLMs的细粒度评估提供了基础。我们的代码和数据可在https://augustinlib.github.io/BESPOKE/获取。

英文摘要

Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

2508.18444 2026-05-27 cs.CL cs.AI 版本更新

How Reliable are LLMs for Reasoning on the Re-ranking task?

LLMs在重排序任务上的推理有多可靠?

Nafis Tanveer Islam, Zhiming Zhao

发表机构 * Multiscale Networked Systems (MNS) Group, University of Amsterdam(多尺度网络系统(MNS)组,阿姆斯特丹大学)

AI总结 本研究分析不同训练方法对LLMs在重排序任务中语义理解的影响,并探究模型能否生成更知情的文本推理以克服透明度和数据有限的挑战。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情
AI中文摘要

随着大型语言模型(LLMs)语义理解能力的提升,它们表现出对人类更高的认知和一致性,但这以牺牲透明度为代价。尽管通过实验分析取得了有希望的结果,但深入理解LLM的内部工作机制对于理解重排序背后的推理是不可避免的,这为最终用户提供了解释,使他们能够做出明智的决定。此外,在新开发的系统中,用户参与有限且排序数据不足,准确地对内容进行重排序仍然是一个重大挑战。虽然各种训练方法影响LLMs的训练并生成推理,但我们的分析发现,一些训练方法比其他方法表现出更好的可解释性,这意味着并非所有训练方法都学到了准确的语义理解;相反,获得了抽象知识以优化评估,这引发了对LLMs真正可靠性的质疑。因此,在这项工作中,我们分析了不同训练方法如何影响LLMs在重排序任务中的语义理解,并调查这些模型是否能够生成更知情的文本推理,以克服透明度或LLMs以及有限训练数据的挑战。为了分析用于重排序任务的LLMs,我们利用来自环境和地球科学领域的相对较小的排序数据集来对检索到的内容进行重排序。此外,我们还分析了可解释信息,以查看是否可以使用可解释性对重排序进行推理。

英文摘要

With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

2506.00250 2026-05-27 cs.CL cs.IT math.IT 版本更新

PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

PersianMedQA: 在波斯语-英语双语医学问答基准上评估大型语言模型

Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran(德黑兰大学电气与计算机工程学院) Shahid Beheshti University of Medical Sciences(沙希德·贝赫什提医学科学大学) Institute for Research in Fundamental Sciences (IPM)(基础科学研究所(IPM))

AI总结 本文提出PersianMedQA数据集,包含20,785道波斯语医学多选题,用于评估41个LLM在零样本和思维链设置下的双语医学推理能力,发现闭源通用模型表现最佳,而波斯语模型性能较差,且翻译会丢失文化临床线索。

Comments Accepted at LREC 2026 (The Fifteenth Language Resources and Evaluation Conference), Palma, Mallorca, Spain, May 2026

详情
AI中文摘要

大型语言模型(LLM)在广泛的自然语言处理(NLP)基准测试中取得了显著性能,通常超越人类水平。然而,它们在医学等高风险领域中的可靠性,尤其是在低资源语言中,仍未得到充分探索。在这项工作中,我们引入了PersianMedQA,这是一个大规模数据集,包含来自14年伊朗国家医学考试的20,785道经过专家验证的波斯语医学多选题,涵盖23个医学专业,旨在评估LLM在波斯语和英语中的表现。我们对41个最先进的模型进行了基准测试,包括通用型、波斯语和医学LLM,在零样本和思维链(CoT)设置下。我们的结果表明,闭权重通用模型(例如GPT-4.1)持续优于所有其他类别,在波斯语中达到83.09%的准确率,在英语中达到80.7%,而波斯语LLM(例如Dorna)表现显著较差(例如波斯语中34.9%),通常在指令遵循和领域推理方面存在困难。我们还分析了翻译的影响,表明虽然英语性能通常更高,但3-10%的问题只能通过波斯语正确回答,因为翻译丢失了文化和临床上下文线索。最后,我们证明,如果没有强大的领域或语言适应,仅凭模型大小不足以获得稳健的性能。PersianMedQA为评估LLM中的双语和文化基础医学推理提供了基础。该数据集以及双语医学词典可在以下网址获取:https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA 。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 41 state-of-the-art models, including general-purpose, Persian, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-weight general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.7% in English, while Persian LLMs such as Dorna underperform significantly (e.g., 34.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, 3-10% of questions can only be answered correctly in Persian due to cultural and clinical contextual cues that are lost in translation. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating bilingual and culturally grounded medical reasoning in LLMs. The dataset, along with a bilingual medical dictionary, is available: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA .

2508.05305 2026-05-27 cs.CL 版本更新

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

SONAR-LLM:在句子嵌入中思考、以令牌说话的自回归Transformer

Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Nikita Kurdiukov, Aysel Mirzoeva, Anna Borisiuk, Andrey Kuznetsov, Anton Razzhigaev

发表机构 * FusionBrain Lab, AXXX(融合大脑实验室,AXXX) T-Tech MSU(莫斯科大学) DS-NLP Group, AXXX(自然语言处理组,AXXX)

AI总结 提出SONAR-LLM,一种在连续SONAR嵌入空间“思考”但通过令牌级交叉熵监督的解码器-only Transformer,融合语义抽象与似然训练信号,在39M至1.3B参数规模上取得有竞争力的生成质量。

详情
AI中文摘要

最近提出的大型概念模型(LCM)通过预测句子级嵌入序列并使用均方误差或扩散目标进行训练来生成文本。我们提出了SONAR-LLM,一种仅在连续SONAR嵌入空间中“思考”的解码器-only Transformer,但通过冻结的SONAR解码器传播的令牌级交叉熵进行监督。这种混合目标保留了LCM的语义抽象,同时消除了其扩散采样器并恢复了基于似然的训练信号。在从39M到1.3B参数的模型规模上,SONAR-LLM取得了有竞争力的生成质量。我们报告了扩展趋势、消融实验、基准测试结果,并发布了完整的训练代码和所有预训练检查点,以促进可重复性和未来研究。

英文摘要

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

2506.23149 2026-05-27 cs.CL 版本更新

AlignEvoSkill: Towards Knowledge-Aware and Task-Aligned Agent Skill Evolution

AlignEvoSkill: 迈向知识感知与任务对齐的智能体技能进化

Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

发表机构 * Singapore Management University(新加坡国立大学)

AI总结 提出AlignEvoSkill框架,通过联合建模知识覆盖和任务对齐,从失败轨迹中识别知识标签、检索并适配候选技能,再基于知识覆盖和任务对齐分数筛选高质量技能,在3个基准和4个LLM骨干上相对提升34.7%,实现技能进化新SOTA且成本更低。

详情
AI中文摘要

可重用技能在提升基于LLM的智能体中扮演关键角色,但现有技能进化方法往往无法确保进化后的技能既覆盖任务所需的知识,又与目标任务保持对齐。结果,进化后的技能可能不完整或无关。为解决这一局限,我们提出AlignEvoSkill,一个联合建模知识覆盖和任务对齐的技能进化框架。给定失败的任务轨迹,AlignEvoSkill首先识别与任务相关的知识标签,检索互补的先前技能,并将它们适配为弥补缺失知识的候选技能。然后,它使用基于知识覆盖和任务对齐分数的联合过滤标准选择高质量候选技能。在3个基准和4个LLM骨干上的实验表明,AlignEvoSkill相对于非进化基线实现了34.7%的相对增益,并以更低的成本实现了技能进化的新SOTA。

英文摘要

Reusable skills play a key role in improving LLM-based agents, but existing skill-evolution methods often fail to ensure that evolved skills both cover the knowledge required by the task and remain aligned with the target task. As a result, evolved skills could be incomplete or irrelevant. To address this limitation, we propose AlignEvoSkill, a skill-evolution framework that jointly models knowledge coverage and task alignment. Given failed task trajectories, AlignEvoSkill first identifies task-relevant knowledge tags, retrieves complementary prior skills, and adapts them into candidate skills that address missing knowledge. It then selects high-quality candidates using a joint filtering criterion based on knowledge-coverage and task-alignment scores. Experiments on 3 benchmarks with4 LLM backbones show a 34.7% relative gain of AlignEvoSkill over the non-evolution baseline and achieves a new SOTA in skill evolution with lower cost.

2506.21443 2026-05-27 cs.CL cs.AI 版本更新

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

领域知识增强的大语言模型用于欺诈和概念漂移检测

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)(计算与增强智能学院(SCAI),亚利桑那州立大学) Department of Computer Engineering, Tarsus University(计算机工程系,塔鲁斯大学) Minerva CQ and HumaConn AI Consulting(Minerva CQ和HumaConn人工智能咨询) School of Computing and Augmented Intelligence (SCAI), Arizona State University(计算与增强智能学院(SCAI),亚利桑那州立大学)

AI总结 提出一种领域知识增强的大语言模型框架,通过集成结构化领域知识和漂移检测单元,实现高准确率的欺诈对话检测和概念漂移分类。

详情
AI中文摘要

在动态平台上检测欺骗性对话变得越来越困难,原因是语言模式的演变和概念漂移(CD)——即随着时间推移,语义或主题的转变会改变交互的上下文或意图。这些转变可能掩盖恶意意图或模仿正常对话,使得准确分类具有挑战性。尽管大语言模型(LLMs)在自然语言任务中表现出色,但在风险敏感场景中,它们常常面临上下文模糊和幻觉问题。为了解决这些挑战,我们提出了一个领域知识(DK)增强的LLM框架,该框架将预训练的LLM与结构化的、任务特定的见解相结合,以执行欺诈和概念漂移检测。所提出的架构由三个主要组件组成:(1)一个DK-LLM模块,用于检测虚假或欺骗性对话;(2)一个漂移检测单元(OCDD),用于判断是否发生了语义转变;(3)第二个DK-LLM模块,用于将漂移分类为良性或欺诈性。我们首先使用虚假评论数据集验证领域知识的价值,然后将我们的完整框架应用于SEConvo,一个包含多种欺诈和垃圾攻击的多轮对话数据集。结果表明,我们的系统能够高精度地检测虚假对话,并有效分类漂移的性质。在结构化提示的引导下,基于LLaMA的实现达到了98%的分类准确率。与零样本基线的对比研究表明,在高风险NLP应用中,融入领域知识和漂移意识显著提高了性能、可解释性和鲁棒性。

英文摘要

Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

2502.14321 2026-05-27 cs.MA cs.CL 版本更新

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

超越自我对话:基于LLM的多智能体系统的通信中心综述

Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, Xiaoming Zhang

发表机构 * School of Cyber Science and Technology, Beihang University, Beijing 100191, China(信息科学与技术学院,北京航空航天大学,北京100191,中国) The College of Cyber Security/College of Information Science and Technology, Jinan University, Guangzhou, China(网络安全学院/信息科学与技术学院,暨南大学,广州,中国) School of Computer Science and Engineering, Beihang University, Beijing 100083, China(计算机科学与工程学院,北京航空航天大学,北京100083,中国) School of Cyber Science and Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China(信息科学与技术学院,北京邮电大学,北京100876,中国)

AI总结 本文从通信视角综述基于大语言模型的多智能体系统,提出一个整合系统级和系统内部通信的结构化框架,分析智能体交互机制、挑战及未来方向。

Comments The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-50857-y}

详情
AI中文摘要

基于大语言模型的多智能体系统因其在复杂、协作和智能问题解决方面的潜力,近期获得了显著关注。现有综述通常根据应用领域或架构对基于LLM的多智能体系统(LLM-MAS)进行分类,忽略了通信在协调智能体行为和交互中的核心作用。为弥补这一空白,本文从通信中心的视角对LLM-MAS进行了全面综述。具体而言,我们提出了一个结构化框架,将系统级通信(架构、目标和协议)与系统内部通信(策略、范式、对象和内容)相结合,从而详细探索智能体如何交互、协商并实现集体智能。通过对近期文献的广泛分析,我们识别了多个维度的关键组件,并总结了其优势与局限性。此外,我们强调了当前挑战,包括通信效率、安全漏洞、基准测试不足和可扩展性问题,并指出了有前景的未来研究方向。本综述旨在帮助研究人员和实践者清晰理解LLM-MAS中的通信机制,从而促进鲁棒、可扩展且安全的多智能体系统的设计与部署。

英文摘要

Large language model-based multi-agent systems have recently gained significant attention due to their potential for complex, collaborative, and intelligent problem-solving capabilities. Existing surveys typically categorize LLM-based multi-agent systems (LLM-MAS) according to their application domains or architectures, overlooking the central role of communication in coordinating agent behaviors and interactions. To address this gap, this paper presents a comprehensive survey of LLM-MAS from a communication-centric perspective. Specifically, we propose a structured framework that integrates system-level communication (architecture, goals, and protocols) with system internal communication (strategies, paradigms, objects, and content), enabling a detailed exploration of how agents interact, negotiate, and achieve collective intelligence. Through an extensive analysis of recent literature, we identify key components in multiple dimensions and summarize their strengths and limitations. In addition, we highlight current challenges, including communication efficiency, security vulnerabilities, inadequate benchmarking, and scalability issues, and outline promising future research directions. This review aims to help researchers and practitioners gain a clear understanding of the communication mechanisms in LLM-MAS, thereby facilitating the design and deployment of robust, scalable, and secure multi-agent systems.

2506.03627 2026-05-27 cs.CL cs.AI 版本更新

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

提示的鲁棒性:增强大型语言模型对抗提示攻击的鲁棒性

Lin Mu, Guowei Chu, Li Ni, Lei Sang, Yiwen Zhang

发表机构 * School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院)

AI总结 提出RoP(提示鲁棒性)策略,通过错误校正和引导两个阶段,增强LLM对输入扰动的鲁棒性,在算术、常识和逻辑推理任务上显著提升性能。

Comments Accepted by IEEE Transactions on Artificial Intelligence

详情
AI中文摘要

大型语言模型(LLM)通过有效利用提示策略在各种任务中展现了卓越的性能。然而,它们对输入扰动高度敏感,例如拼写错误或轻微字符顺序错误,这些扰动会显著损害其性能。尽管在提示技术方面取得了进展,如思维链和自动提示生成,但开发一种明确减轻此类扰动负面影响的提示策略仍然是一个开放的挑战。为弥补这一差距,我们提出了提示鲁棒性(RoP),一种旨在增强LLM鲁棒性的新型提示策略。RoP包括两个阶段:错误校正和引导。在错误校正阶段,RoP应用多种扰动方法生成对抗样本,用于生成自动纠正输入错误的提示。在引导阶段,RoP基于校正后的输入生成最优引导提示,引导模型生成更鲁棒和准确的推理。通过在算术、常识和逻辑推理任务上的全面实验,我们证明RoP显著提高了LLM对抗对抗扰动的鲁棒性。至关重要的是,与干净输入场景相比,它仅以最小的精度下降保持了模型准确性,从而将RoP确立为在实际应用中增强LLM鲁棒性的实用且有效的方法。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic prompt generation, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy aimed at enhancing the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are used to generate prompts that correct input errors automatically. In the Guidance stage, RoP generates an optimal guidance prompt based on the corrected input, guiding the model to generate more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Crucially, it preserves model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准:揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出OCR-Reasoning基准,包含1069个人工标注样本,覆盖6种核心推理能力和18个实际推理任务,通过双标注(最终答案和逐步推理过程)评估多模态大语言模型在文本丰富图像推理中的能力,发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情
AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而,由于缺乏专门且系统的基准,它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白,我们提出了OCR-Reasoning,一个新颖的基准,旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言,OCR-Reasoning包含1069个人工标注的示例,涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同,本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程,从而全面评估文本丰富推理能力。利用该基准,我们对最新的多模态大语言模型进行了全面评估。结果表明,即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难,在我们的基准上没有一个模型的准确率超过50%,这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

2503.08600 2026-05-27 cs.CL 版本更新

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

NSF-SciFy:从NSF资助数据库中挖掘科学主张

Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出NSF-SciFy数据集,从40万篇NSF摘要中提取280万条科学主张,通过零样本提示联合提取主张和研究提案,并在三个下游任务中微调语言模型取得显著提升。

Comments ACL 2026. 19 pages, 7 figures, 11 tables

详情
AI中文摘要

我们介绍了NSF-SciFy,这是一个从国家科学基金会(NSF)获奖摘要中提取的科学主张和研究提案的综合数据集。虽然以往的科学主张验证数据集在规模和范围上有限,但NSF-SciFy取得了重大进展,包含来自40万篇摘要的280万条主张,涵盖所有科学和数学学科。我们提出了两个重点子集:NSF-SciFy-MatSci,包含来自材料科学奖项的11.4万条主张;以及NSF-SciFy-20K,包含来自五个NSF理事会的13.5万条主张。使用零样本提示,我们开发了一种可扩展的方法来联合提取科学主张和研究提案。我们通过三个下游任务展示了该数据集的实用性:非技术性摘要生成、主张提取和研究提案提取。在我们的数据集上微调语言模型带来了显著改进,相对增益通常超过100%,特别是在主张和提案提取任务中。我们的错误分析表明,提取的主张具有高精度但召回率较低,这表明有进一步改进方法的机会。NSF-SciFy为大规模主张验证、科学发现追踪和元科学分析开辟了新的研究方向。代码和数据可在 https://github.com/darpa-scify/NSFSciFy 获取。

英文摘要

We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.

2407.15073 2026-05-27 cs.AI cs.CL 版本更新

Multi-Agent Causal Discovery Using Large Language Models

多智能体因果发现使用大型语言模型

Hao Duong Le, Xin Xia, Haijie Xu, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 提出多智能体因果发现框架MAC,通过元融合机制结合自主选择SCD算法的辩论编码模块和基于元数据的对抗性辩论模块,在多个基准上取得最优性能。

详情
AI中文摘要

因果发现旨在识别变量之间的因果关系,是各科学领域的基本问题。传统的统计因果发现(SCD)方法仅依赖观测数据,忽略元数据中可用的上下文信息,而近期基于LLM的方法利用元数据但将大型语言模型(LLM)视为单一智能体,使其判断易受记忆或偏见关联影响。为解决这一差距,我们引入MAC(多智能体因果发现框架),将因果发现转化为多智能体辩论与自主选择SCD算法相结合。MAC通过元融合机制桥接两个互补模块:辩论编码模块(DCM)通过自主选择并执行最合适的SCD算法将初始图基于数据,以及元辩论模块(MDM)通过对抗性的肯定-否定-裁判辩论基于元数据精炼图。在五个基准数据集和三个指标(F1、SHD、NHD)上,MAC在五个统计基线和四个基于LLM的基线中取得了最佳综合性能,在使用Gemini-2.0-Flash时在15个评估点中排名第一10次——包括完美重建地震图——并在三个骨干LLM上保持稳健。

英文摘要

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional statistical causal discovery (SCD) methods rely solely on observational data and ignore the contextual information available in metadata, whereas recent LLM-based methods exploit metadata but treat the large language model (LLM) as a single agent, leaving its judgments vulnerable to memorized or biased associations. To address this gap, we introduce MAC (Multi-Agent Causal Discovery Framework), which casts causal discovery as a multi-agent debate coupled with the autonomous selection of an SCD algorithm. MAC combines two complementary modules, bridged by a Meta Fusion mechanism: a Debate-Coding Module (DCM) that grounds an initial graph in data by autonomously selecting and executing the best-suited SCD algorithm, and a Meta-Debate Module (MDM) that refines the graph through an adversarial Affirmative-Negative-Judge debate over the metadata. Across five benchmark datasets and three metrics (F1, SHD, NHD), MAC achieves the best aggregate performance among five statistical and four LLM-based baselines, ranking first on 10 of 15 evaluation points with Gemini-2.0-Flash -- including a perfect reconstruction of the Earthquake graph -- and remains robust across three backbone LLMs.

2408.15787 2026-05-27 cs.CL cs.IR 版本更新

Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions

交互式智能体:通过角色扮演的LLM-LLM交互模拟咨询师-来访者心理咨询

Huachuan Qiu, Zhenzhong Lan

发表机构 * School of Engineering, Westlake University(西lake大学工程学院)

AI总结 提出Interactive Agents框架,通过LLM-LLM角色扮演生成高质量心理咨询对话数据,解决隐私、成本和扩展性问题,并验证其治疗有效性和模型性能。

Comments Accepted to *SEM2026

详情
AI中文摘要

创建有效的心理健康支持对话系统需要高质量的多轮咨询对话数据,然而收集真实的咨询师-来访者对话面临重大挑战,包括隐私问题、高成本和有限的扩展性。我们提出了 extbf{Interactive Agents},一个新颖的框架,通过受控的LLM-LLM交互模拟自然的咨询对话。该框架引入了两个关键创新:(1) 一个个性化的来访者智能体,在整个会话过程中保持一致的心理学特征;(2) 一个咨询师智能体,实现了一个基于理论的三阶段治疗模型,包括探索、洞察和行动阶段。通过使用自动评估指标和基于工作联盟清单的专业咨询师评估进行严格评估,我们证明我们的框架生成了具有治疗有效性的对话,其质量与人类生成的会话相当。在我们的合成数据集(SimPsyDial)上微调的模型,在基于LLM的咨询师的标准成对聊天机器人竞技场评估中达到了最先进的性能。我们的框架提供了一种可扩展、保护隐私的方法,用于生成高质量的咨询对话数据,同时保持专业的治疗标准。

英文摘要

Creating effective dialogue systems for mental health support requires high-quality multi-turn counseling dialogue data, yet collecting real counselor-client conversations presents significant challenges, including privacy concerns, high costs, and limited scalability. We present \textbf{Interactive Agents}, a novel framework that simulates naturalistic counseling dialogues through controlled LLM-to-LLM interactions. The framework introduces two key innovations: (1) a personalized client agent that maintains consistent psychological characteristics throughout a session, and (2) a counselor agent that implements a theoretically grounded three-stage therapeutic model comprising the exploration, insight, and action phases. Through rigorous evaluation using both automatic metrics and professional-counselor assessments based on the Working Alliance Inventory, we demonstrate that our framework generates therapeutically valid dialogues that are comparable in quality to human-generated sessions. Models fine-tuned on our proposed synthetic dataset (SimPsyDial) achieve state-of-the-art performance in a standard pairwise chatbot-arena evaluation of LLM-based counselors. Our framework provides a scalable, privacy-preserving method for generating high-quality counseling dialogue data while maintaining professional therapeutic standards.