arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.CL自然语言处理197

1. 大语言模型与基础模型 40 篇

2606.11206 2026-06-11 cs.CL cs.LG 新提交

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

兼容性感知的动态微调用于大型语言模型

Yucheng Zhou, Junwei Sheng, Qianning Wang, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau(澳门大学科技学院电脑与信息科学系及智慧城市物联网国家重点实验室) Auckland University of Technology(奥克兰理工大学)

AI总结 提出兼容性感知动态微调(CADFT),通过模型似然度动态调整监督更新,抑制不兼容样本的高方差梯度,提升训练稳定性和泛化能力。

详情
Comments
ACL 2026
AI中文摘要

监督微调(SFT)是对齐大型语言模型(LLMs)的主要范式,但它存在优化不稳定和泛化能力有限的问题。最近的研究将这一问题归因于病态的梯度缩放,并提出了动态微调(DFT)来在令牌级别进行修正。然而,DFT假设所有演示都是同样合适的学习目标,这一假设被大规模指令数据的强异质性所违反,其中演示-策略不匹配会在样本级别导致高方差更新。我们引入了兼容性感知动态微调(CADFT),这是DFT的一个原则性扩展,用于控制样本级别的优化方差。CADFT从模型似然度中推导出一个动态的、依赖于策略的兼容性信号,以调节监督更新,抑制来自不兼容演示的高方差梯度。我们进一步提出了一种延迟的、低频的兼容性引导重写策略,将持续不兼容的演示转化为可学习的目标。我们表明,CADFT可以被解释为一个方差控制的估计器,将DFT中的令牌级稳定性推广到样本级别。大量实验表明,CADFT在保持完全监督且不依赖显式奖励建模的同时,提高了稳定性、泛化能力和冷启动强化学习初始化。

英文摘要

Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

2606.11209 2026-06-11 cs.CL cs.AI cs.LG 新提交

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

ProcessThinker: 通过基于展开的过程奖励增强多模态大语言模型推理

Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

发表机构 * LMU Munich(慕尼黑大学) Harvard University(哈佛大学) University of Cambridge(剑桥大学) Mina AI Konrad Zuse School of Excellence in Reliable AI (relAI)(康拉德·楚泽可靠人工智能卓越学校(relAI))

AI总结 提出ProcessThinker,一种无需显式过程奖励模型的后训练方法,通过步骤标记格式和基于展开的过程奖励,为多步推理提供密集的步骤级奖励,提升多模态推理一致性。

详情
Comments
Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure
AI中文摘要

视觉问答越来越需要多步推理。最近在可验证奖励下的强化学习后训练(RLVR)和组相对策略优化(GRPO)可以改善多模态推理,但大多数方法依赖于稀疏的仅结果奖励。因此,它们难以判断错误答案是由于推理后期的一个小错误,还是从一开始就无用的轨迹。一个常见的解决方案是训练一个过程奖励模型(PRM)用于步骤级监督,但这通常需要大规模高质量的思想链注释和额外的训练成本。我们提出ProcessThinker,一种实用的后训练流程,无需训练显式的PRM即可提供步骤级过程奖励。ProcessThinker首先将推理轨迹重写为步骤标记格式以进行冷启动监督微调,然后应用带有标准格式奖励和我们基于展开的过程奖励的GRPO。具体来说,对于每个中间步骤,我们从该步骤采样多个连续步骤,并使用经验成功率(最终答案验证)作为步骤奖励。这提供了密集的信用分配,并鼓励更可靠地支持正确结论的推理步骤,有助于减少跨步骤的不一致或自相矛盾的进展——这是逻辑推理中的一个关键问题。在四个具有挑战性的视频基准测试(Video-MMMU、MMVU、VideoMathQA和LongVideoBench)上,ProcessThinker始终优于基线模型Qwen3-VL-8B-Instruct。

英文摘要

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

2606.11211 2026-06-11 cs.CL cs.AI cs.LG 新提交

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

推理下的校准漂移:思维链预算如何导致大型语言模型过度自信

Prakul Sunil Hiremath, Harshit R. Hiremath

发表机构 * Department of Computer Science and Engineering, Visvesvaraya Technological University, Belagavi(维斯瓦拉亚科技大学计算机科学与工程系,贝拉加维) Department of Computer Science and Business System, SG Balekundri Institute of Technology, Belagavi(SG巴莱昆德里理工学院计算机科学与商业系统系,贝拉加维)

AI总结 研究发现,增加思维链推理预算超过任务特定阈值会导致模型对错误答案过度自信,提出校准漂移现象并引入CABStop停止规则。

详情
Comments
31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available
AI中文摘要

大型语言模型(LLMs)表达校准不确定性的能力对于安全部署至关重要。思维链(CoT)推理被广泛用于提高准确性和可靠性,但其对校准的影响尚未完全理解。我们表明这一图景是不完整的:在某些设置中,将推理预算增加到任务特定阈值以上会导致模型系统性地变得过度自信,对错误答案赋予高置信度。我们将此现象称为推理下的校准漂移(CDUR),并从理论和实证两方面进行研究。我们定义推理预算B,并分析预期校准误差ECE(B)呈现非单调模式的条件:它首先随着推理纠正错误而下降,然后随着更长推理产生内部一致但错误的解释而上升。我们提出一个基于自回归生成的假设锁定模型来解释这种行为。我们在47个推理陷阱问题上评估了Llama-3.1-8B和Llama-3.3-70B,跨越四个推理预算和三个随机种子(1,368次API调用;574个有效响应)。8B模型显示出非单调的校准行为,而70B模型的结果仅限于基线评估,对于预算依赖效应尚无定论。我们引入CABStop,一种校准感知的停止规则,当置信度偏离辅助准确性估计时停止推理。这些结果表明,增加推理深度并不总是提高可靠性,应谨慎监控。

英文摘要

The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully.

2606.11375 2026-06-11 cs.CL cs.AI cs.LG 新提交

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

当探测精度饱和时,脆弱性揭示问题:LLM预训练分析的互补度量

Orion Reblitz-Richardson

发表机构 * Distiller Labs

AI总结 针对线性探测在预训练中精度快速饱和的问题,提出脆弱性度量,通过激活噪声水平衡量探测鲁棒性,揭示精度无法捕捉的表示结构演化。

详情
Comments
22 pages, 5 figures. Code and datasets at this https URL
AI中文摘要

标准线性探测在隐藏状态上的分类器达到高精度时,宣称属性被“编码”。该协议在快照上表现良好,但在预训练过程中失效:探测精度在最初几千步内饱和,使得大部分训练过程对仪器不可见。我们引入脆弱性,一种互补的逐层度量,定义为探测精度崩溃时的激活噪声水平。脆弱性对可分性边际和表示冗余均敏感,这两者在精度平台期后仍持续演化。应用于开放检查点语言模型时,脆弱性恢复了精度单独无法看到的结构。道德化表示沿着词汇→组合梯度出现:词汇道德检测在先,组合道德编码在后。由于探测精度本身跟踪数据集在词汇层面的可分性,我们通过证明其在共享无对比标记的构造类型间转移,直接建立了组合编码。层深度鲁棒性梯度在训练中单调发展,而精度保持平坦。匹配的微调语料库产生相同的探测精度,却留下不同的脆弱性指纹,表明数据整理在不改变探测精度的情况下重塑了探测鲁棒性。在我们测试的每个比较中,当探测精度返回平坦答案时,脆弱性返回结构化答案。

英文摘要

Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 新提交

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX: 具有动态数据选择的自动提示工程专家

Fei Wang, Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * Google(谷歌) UCLA(加州大学洛杉矶分校)

AI总结 提出APEX框架,通过动态数据分层(易、难、混合)优先选择高杠杆子集,在固定预算下提升提示优化效率,在三个基准上平均提升11.2%和6.8%。

详情
AI中文摘要

大型语言模型对提示表述高度敏感,需要自动提示优化以释放其全部潜力。尽管进化算法已成为主导范式,但它们面临一个关键瓶颈:数据效率。当前方法将开发数据集视为静态基准,在无信息数据上浪费大量计算预算。在这项工作中,我们引入了APEX(自动提示工程专家),这是一个新颖的框架,它在提示搜索的同时优化数据使用。APEX根据优化谱系将数据集动态分层为易、难和混合三个层级。通过优先考虑混合层级(即识别出LLM性能混合的数据),我们确定了两个高杠杆子集:用于生成信息性变异的可寻址前沿和用于区分候选质量的排名敏感前沿。我们在三个不同的基准上评估APEX:IFBench、SimpleQA Verified和FACTS Grounding。在固定5000次评估调用的预算下,由于其数据效率,APEX在Gemini 2.5 Flash上平均比初始提示高出11.2%,在Gemma 3 27B上高出6.8%,这表明以数据为中心的方法是高效且有效的提示优化的关键。

英文摘要

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

2606.11470 2026-06-11 cs.CL 新提交

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

LLM推理的周期表:推理范式、方法与失败模式的结构化综述

Avinash Anand, Mahisha Ramesh, Avni Mittal, Ashutosh Kumar, Erik Cambria, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah

发表机构 * Singapore Institute of Technology(新加坡理工大学) Nvidia AI Center (SNAIC)(英伟达人工智能中心(SNAIC)) MIDAS Lab, IIIT Delhi(IIIT德里MIDAS实验室) MIDAS Lab, IIT Mandi(IIT曼迪MIDAS实验室) Owl Autonomous Imaging, Inc.(Owl自主成像公司) College of Computing & Data Science, NTU Singapore(新加坡南洋理工大学计算与数据科学学院) NVIDIA AI Technology Centre, Singapore(英伟达新加坡人工智能技术中心) Department of Computer Science and Engineering, IIT Kanpur(IIT坎普尔计算机科学与工程系)

AI总结 本文系统综述了300多篇论文,提出LLM推理研究的结构化分类法,涵盖多种推理范式,分析方法论趋势,并总结常见限制与失败模式,旨在为开发更鲁棒、可解释和可泛化的推理系统提供参考。

详情
AI中文摘要

大型语言模型(LLM)在自然语言处理任务中取得了强劲表现,但可靠推理仍是一个开放挑战。尽管现代LLM在结构化推理、多步问题求解和上下文理解方面显示出进展,但其推理行为往往不一致,且对提示策略、任务设计和模型规模敏感。本综述对来自arXiv、Semantic Scholar、Google Scholar、Papers with Code和ACL Anthology的300多篇近期论文进行了系统分析,以考察推理能力如何在LLM中涌现以及它们在何处失败。我们做出三项主要贡献。首先,我们引入了LLM推理研究的结构化分类法,涵盖思维链推理、多跳推理、数学推理、常识推理、视觉与时间推理、代码与算法推理、检索增强推理、工具增强与智能体推理以及基于强化学习的推理。其次,我们分析了这些范式中的方法论趋势,包括提示方法、模型架构、训练目标、奖励建模和评估基准。第三,我们综合了反复出现的局限性和失败模式,例如推理幻觉、脆弱的多步推理、弱的因果抽象以及差的跨域泛化。通过组织快速扩展的文献,本综述提供了LLM推理当前能力和局限性的统一视图。我们还识别了新兴研究方向,包括元推理、自进化推理框架、多模态推理和社会基础推理。总体而言,本工作旨在为未来语言模型中开发更鲁棒、可解释和可泛化的推理系统提供参考。

英文摘要

Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.

2606.11512 2026-06-11 cs.CL 新提交

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

SAGE: 面向言语不确定性对齐的答案条件不确定性目标

Kaiwen Shi, Zheyuan Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学)

AI总结 提出SAGE目标,通过答案条件不确定性几何从模型采样响应中构建群组级不确定性目标,结合GUPO训练框架优化言语不确定性表达,在多项推理任务中提升不确定性排序、降低校准误差和过度自信。

详情
AI中文摘要

大型语言模型越来越多地通过自然语言语句表达不确定性,但这些表达往往无法反映模型的采样行为。我们将言语不确定性对齐作为一个分布校准问题:提示的适当不确定性目标应从重复模型输出中估计,而非来自孤立响应。然而,仅靠群组展开是不够的,因为由此产生的目标必须提供有用的训练信号。现有目标仅部分满足这一要求。我们提出SAGE(语义答案引导熵),一种群组级不确定性目标,它在采样响应上构建答案条件不确定性几何。SAGE保留了分类、数值和符号答案的区别,同时保持平滑且尺度保持的校准信号。我们进一步通过群组不确定性偏好优化(GUPO)应用该目标,这是一种不确定性通道训练框架,监督言语不确定性表达而非完整响应。在事实、数学和多项选择推理任务上的实验表明,不确定性排序得到改善,校准误差降低,过度自信减少。

英文摘要

Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional calibration problem: the appropriate uncertainty target for a prompt should be estimated from repeated model outputs rather than from an isolated response. However, group rollouts alone are insufficient, since the resulting target must provide a useful training signal. Existing targets only partially satisfy this requirement. We propose SAGE, Semantic-Answer Guided Entropy, a group-level uncertainty target that constructs an answer-conditioned uncertainty geometry over sampled responses. SAGE preserves categorical, numeric, and symbolic answer distinctions while maintaining a smooth and scale-preserving calibration signal. We further apply this target through Group-Uncertainty Preference Optimization, or GUPO, an uncertainty-channel training framework that supervises verbal uncertainty expressions rather than the full response. Experiments across factual, mathematical, and multiple-choice reasoning tasks show improved uncertainty ranking, lower calibration error, and reduced overconfidence.

2606.11552 2026-06-11 cs.CL cs.LG 新提交

Teaching Diffusion to Speculate Left-to-Right

教导扩散模型从左到右推测

Lexington Whalen, Yuki Ito, Ryo Sakamoto

AI总结 针对自回归解码的推理瓶颈,提出三种训练时干预方法(位置加权、首次错误焦点损失、链损失)来弥合块扩散草稿模型的双向生成与自回归目标模型从左到右验证之间的不对称性,显著提升接受草稿长度。

详情
Comments
13 pages, technical report
AI中文摘要

大型语言模型(LLMs)在广泛任务中表现出色,但其自回归解码过程由于固有的顺序令牌生成而带来大量推理成本。推测解码通过使用轻量级草稿模型提出多个未来令牌,随后由更大的目标模型并行验证,从而解决这一瓶颈。近期工作表明,扩散语言模型非常适合此设置,因为它们可以并行生成整个草稿令牌块,从而缓解自回归草稿的顺序约束。该机制的一个微妙之处在于,块扩散草稿生成器在块内双向生成令牌,而验证由自回归目标模型以严格从左到右的方式评估令牌,导致对称的训练目标与非对称的验证奖励之间存在差距。在本工作中,我们对三种缩小这一差距的训练时干预措施进行了实证分析:令牌位置加权、针对每个块内破坏已接受前缀位置的首次错误焦点损失,以及用可微替代项替代期望接受长度的链损失项。这三种干预措施沿正交轴(位置、块条件首次错误、联合前缀)起作用,并且可加性组合;它们同样与测试时对齐机制(如多草稿自选)正交,原则上可以与之结合。在四个目标模型和六个推理、代码及对话基准测试中,与位置均匀基线相比,这三种干预措施使每个基准测试的接受草稿长度提高了21-76%,且无需增加额外前向传递,也无需改变推理流程或拒绝采样精确性约束。

英文摘要

Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.

2606.11599 2026-06-11 cs.CL cs.LG 新提交

When is Your LLM Steerable?

你的大模型何时可操控?

Chenrui Fan, Yize Cheng, Ming Li, Soheil Feizi, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) MBZUAI, UAE(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出通过模型生成初期的内部状态预测激活操控是否成功,并利用该预测器优化操控强度搜索,降低解码成本。

详情
AI中文摘要

激活操控提供了一种轻量级的方法来控制语言模型在推理时的行为,但其成功与否严重依赖于提示、概念、模型和操控配置。寻找成功操控的范围和边界通常需要昂贵的网格搜索和对完整自回归生成的后验评估。在这项工作中,我们研究了是否可以从模型在生成过程初期(例如,生成前几个token后)的内部状态预测可操控性,以及如何利用这样的预测器来提高操控成功率。为此,我们首先引入了ASTEER,一个包含140万次操控生成的测试平台,涵盖150个概念,每个操控成功/失败均已标注。利用该测试平台,我们通过提取特征来比较操控前后跨层和初始解码步骤的隐藏状态,分析模型的早期解码动态。这些特征帮助我们理解操控效果如何沿层和token位置传播,为可操控性预测提供关键信息。然后,我们在这些特征上训练梯度提升决策树(GBDT)分类器,以预测干预是否会欠操控、成功或过操控,而无需完整生成。我们的预测器在未见过的概念上达到了约0.7的宏F1分数,表明早期隐藏状态编码了关于最终操控效果的大量结构化信息。我们进一步利用该可操控性预测器作为操控强度搜索的指导,以极小的解码成本实现了接近最优的性能。

英文摘要

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

2606.11643 2026-06-11 cs.CL 新提交

Improving Cross-Format Robustness in Language Models with Multi-Format Training

通过多格式训练提升语言模型的跨格式鲁棒性

June M. Liu, Shaomian Zheng, He Cao, Dingnan Jin, Qing Cui, Jun Zhou

发表机构 * Ant Group(蚂蚁集团) International Digital Economy Academy (IDEA)(国际数字经济学院(IDEA))

AI总结 提出FormatMix方法,通过将部分训练数据扩展为多种等价格式,显著提升大语言模型在不同答案格式下的一致性,仅需30%数据即可接近全格式训练效果。

详情
AI中文摘要

大型语言模型通常对答案格式仍然敏感:一种格式下正确解答的问题可能在另一种语义等价的格式下失败。为了研究这一差距,我们将跨格式鲁棒性定义为模型在不同格式下一致回答相同潜在问题的程度。然后,我们比较了全格式训练与FormatMix,后者使用随机或目标选择将仅一部分训练项扩展为多种等价格式。在GLM4和Llama-3.1上,多格式监督一致地提升了任务性能和跨格式鲁棒性,而仅使用多项选择题(MCQ)监督几乎无益,甚至可能降低鲁棒性。我们进一步发现,仅将约30%的训练集扩展为多种格式通常能恢复全格式训练的大部分收益,并且这一效果在我们研究的模型族和规模中均存在。这些结果表明,格式多样性(而非额外的监督本身)是鲁棒性的关键驱动因素。轻量级的多格式增强是一种实用的方法,可以在不改变基础模型的情况下使LLM对答案格式不那么敏感。

英文摘要

Large language models often remain sensitive to answer format: a question solved correctly in one form may fail in another semantically equivalent form. To study this gap, we define cross-format robustness as the extent to which a model answers the same underlying question consistently across formats. We then compare full-format training with FormatMix, which expands only a subset of training items into multiple equivalent formats using either random or targeted selection. Across GLM4 and Llama-3.1, multi-format supervision consistently improves both task performance and cross-format robustness, whereas Multiple-choice question (MCQ)-only supervision alone brings little benefit and can even reduce robustness. We further find that expanding only about 30% of the training set into multiple formats often recovers most of the gain from full-format training, and this effect appears across the model families and sizes we study. These results suggest that format diversity, rather than additional supervision alone, is the key driver of robustness. That lightweight multi-format augmentation is a practical way to make LLMs less sensitive to answer format without changing the base model.

2606.11712 2026-06-11 cs.CL cs.AI cs.LG 新提交

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

用户侧记忆中的子模块不对称性:一个诊断框架

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出一个诊断框架,将LLM用户侧记忆分解为行为一致性、事实存在和事实缺失三个正交子模块,发现参数记忆与检索记忆在不同子模块上存在不对称性,且RLHF调优加剧了这种不对称性。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

LLM中的用户侧记忆通常被评分为单一的“个性化”能力:给定用户历史,输出是否更了解用户?我们表明这种聚合指标隐藏了相反方向的失败。记忆至少可分解为三个正交轴——行为一致性(风格、语气)、事实存在(回忆历史中的事实)和事实缺失(当事实缺失时弃权)——并且没有单一子模块能在所有三个轴上获胜。在受控的50用户合成语料库和真实数据探针(LaMP-3)上,比较每个用户的gamma-LoRA(在每个用户历史上训练的小型LoRA适配器;gamma表示每个用户,而非每个任务)与BGE-large密集top-K检索,我们发现gamma-LoRA在行为风格上决定性获胜,而RAG在事实缺失上决定性获胜——并且注意力层21-35中的相同查询投影细胞因果地承载了这两个相反方向的效果(将这些LoRA权重归零会使缺失探针TPR提高33个百分点,并使存在探针TPR下降20个百分点)。在更经过RLHF调优的Llama-3.1-8B-Instruct上,不对称性增强而非愈合:参数记忆的行为优势崩溃,而其相对于检索的缺失校准赤字扩大——这是对参数用户记忆的对齐税。在真实数据LaMP-3上,gamma-LoRA表现低于多数基线;一个9条件缓解扫描诊断出这是指令遵循崩溃,而非子模块失败(9x2交叉乘积显示评估时的{1..5} logit掩码使每个配方的主准确率达到>=0.995),并且最佳训练时修复在Llama上逐位复制。最后,子模块选择路由是问题分类,而非校准:仅基于问题文本的110M DistilBERT击败了每个基于logit的路由器。我们贡献了诊断框架、诊断出的真实数据负例、对齐税复制以及路由即分类的发现。

英文摘要

User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

2606.11806 2026-06-11 cs.CL 新提交

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

生产级LLM系统中的外部经验服务:面向部署的质量-成本权衡研究

Lin Sun, Heming Zhang, Xiangzheng Zhang

发表机构 * Qiyuan Tech(奇元科技)

AI总结 研究生产级LLM系统中注入外部经验的质量-成本权衡,发现选择性检索优于全局注入,且检索质量比增加Top-K更重要,成本效益因任务输出长度而异。

详情
AI中文摘要

生产级LLM系统会积累可重用的操作经验,但实际部署问题不仅仅在于这种经验是否有帮助,更在于不同的服务策略如何在现实约束下权衡质量与在线成本。注入外部经验可以提升任务质量,但也会增加提示负担、延迟和服务压力。我们将\textit{外部经验服务}作为一个面向部署的质量-成本权衡问题进行研究。我们在一个真实的审核场景中评估该问题,并使用工具使用和GPQA作为辅助对比任务,这些任务暴露了不同的输出-成本区间。我们比较了无经验基线、随机经验控制、全局提示注入和基于检索的选择性注入,并分析了任务质量和服务成本。结果表明,一旦经验变得依赖于具体案例,选择性检索比无条件的全局注入提供了更强的操作点。进一步表明,检索质量比单纯增加Top-$K$更重要,并且相同的服务策略在短输出和密集解码场景下可能表现出截然不同的成本效益曲线。这些发现表明,外部经验最好被视为一种选择性的、成本感知的服务决策,而不是通用的附加组件。总体而言,在所研究的设置中,只有当服务接口和任务特定的成本结构使其质量提升值得在线成本时,外部经验才是有价值的。

英文摘要

Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top-$K$, and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.

2606.11898 2026-06-11 cs.CL cs.LG 新提交

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Meiyi Qiang, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.

2606.12191 2026-06-11 cs.CL cs.AI 新提交

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

面向大语言模型的智能体环境工程:环境建模、合成、评估与应用综述

Jiachun Li, Zhuoran Jin, Tianyi Men, Yupu Hao, Kejian Zhu, Lingshuai Wang, Dongqi Huang, Longxiang Wang, Shengjia Hua, Lu Wang, Jinshan Gao, Hongbang Yuan, Ruilin Xu, Kang Liu, Jun Zhao

AI总结 本文从环境工程生命周期出发,系统综述了智能体环境的建模、合成、评估与应用,涵盖八种属性与领域、两种合成范式、四种智能体演化路径及三种环境演化范式。

详情
Comments
63 pages, 10 figures
AI中文摘要

环境作为基于大语言模型(LLM)的智能体在不同场景下的交互系统,在推动模型能力持续演进中扮演关键角色。尽管重要性显著,现有工作缺乏系统分类与深入分析。本文从环境工程生命周期的视角系统研究了当前关于智能体环境的研究,涵盖其建模、合成、评估与应用。具体而言,本文首先从八个属性和八个领域引入代表性环境,详细分析其发展路径并突出核心能力。其次,针对自动化环境合成,介绍了两种范式,如符号合成和神经合成。本文还展示了每种范式下的不同环境评估方法。第三,从智能体-环境协同演化的角度讨论了相应的环境应用。具体来说,本文从四个互补视角描述了动态环境中智能体演化的主要路径:以记忆为中心的经验演化、以编排为中心的工作流演化、以轨迹为中心的离线演化和以探索为中心的在线演化。并识别了三种环境演化范式,即神经驱动、难度驱动和规模驱动方法。最后,讨论了几个有前景的未来方向,包括环境即服务、多智能体环境和神经符号环境。

英文摘要

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

2606.12203 2026-06-11 cs.CL 新提交

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

自适应多分辨率程序性知识压缩用于大型语言模型

Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Runzhong Qiao, Xuancheng Li, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出SKIM框架,通过自适应多分辨率软令牌压缩程序性技能,在保持任务性能的同时将技能令牌长度压缩至30%-60%。

详情
AI中文摘要

大型语言模型(LLM)被广泛用于处理具有自主工作流的复杂任务。最近,可重用的自然语言技能作为一种流行的范式出现,用于向LLM应用程序注入程序性知识。由于流行的技能经常被重复调用,将它们的完整文本放在每个上下文中会显著增加预填充成本和延迟。虽然文本压缩技术有潜力解决这个问题,但大多数现有方法旨在压缩文档中的事实性知识而非程序性知识,这使得它们不足以用于技能压缩。在本文中,我们认为有效的技能压缩方法应该:1)保留工作流和工具协议之间的逻辑依赖关系;2)支持对频繁更新的社区技能进行轻量级、离线压缩;3)能够适应不同技能之间的复杂性变化。为了解决这个问题,我们提出了SKIM(SKIll coMpression),一个用于程序性技能的自适应多分辨率软令牌压缩框架。根据每个技能的复杂性,SKIM创建不同数量的软令牌,这不仅提高了LLM推理的效率,而且保留了技能使用的有效性。实验表明,SKIM将技能压缩到其原始令牌长度的30%到60%,同时比现有的压缩方法更好地保持了任务性能。我们已在https://this URL发布了我们的代码。

英文摘要

Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression this http URL have released our code at this https URL.

2606.12234 2026-06-11 cs.CL 新提交

On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

论LLM条件控制中的效果-流畅性权衡:一项系统性研究

Iuri Macocco, Pau Rodríguez, Arno Blaas, Luca Zappella, Marco Baroni, Xavier Suau

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Apple(苹果公司) ICREA(加泰罗尼亚研究与高级研究所)

AI总结 系统研究LLM条件控制方法在注入和移除目标概念时的效果与流畅性权衡,发现高效引导方法常以牺牲流畅性为代价,且激活引导方法在指令调优模型上效果较差。

详情
Comments
8 pages, 2 figure
AI中文摘要

控制大型语言模型(LLM)的输出是其可靠部署的核心挑战,然而对所涉及权衡的清晰理解仍然难以捉摸。当前的条件控制方法通常在评估时狭隘地关注其注入或移除目标概念的有效性,而忽略了生成质量。我们系统性地研究了注入和移除场景中的一系列条件控制方法。我们发现,高效的引导方法通常以流畅性的大幅损失为代价来实现条件控制。此外,我们识别出一个关键但先前被忽视的与训练范式的交互:激活引导方法在指令调优模型上的效果远不如在基础模型上。另一方面,简单的提示和全面的监督微调是概念注入的可行选择,但在概念移除方面效果不佳。最后,廉价计算的文本指标与昂贵的LLM作为评判者的评分高度相关,并为条件控制方法的行为提供了见解。

英文摘要

Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.

2606.12243 2026-06-11 cs.CL cs.AI 新提交

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

VIA-SD:通过模型内路由进行推测解码的验证

Yuchen Xian, Yang He, Yunqiu Xu, Yi Yang

AI总结 提出VIA-SD多级验证框架,利用从完整验证器派生的精简验证器处理中等置信度令牌,减少大模型调用,在多个任务上实现10-20%加速。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

推测解码(SD)通过让轻量级草稿模型生成候选,由大型验证器并行验证,解决了LLM的高推理成本问题。现有的草稿-验证方法使用二元决策:接受或完全重新计算。然而,我们发现许多被拒绝的令牌可以通过从完整验证器通过模型内路由派生的精简子模型正确验证,而不是完整验证器。这促使我们使用精简验证器来处理需要中等验证资源的令牌,减少昂贵的大模型调用。我们提出了VIA-SD(通过模型内路由进行推测解码的验证),一种使用路由精简验证器的多级框架。草稿令牌分层处理:高置信度情况直接接受,中等置信度情况由精简验证器重新生成,不确定情况由完整模型验证。在四个代表性任务和多个模型家族中,VIA-SD将拒绝率降低了0.10-0.22,并在强SD基线基础上实现了10-20%的加速,同时相比非草稿解码实现了2.5-3倍的加速。此外,VIA-SD与现有SD框架兼容,无需修改其训练过程。我们的结果表明,多级SD是一种可扩展且高效的LLM推理通用范式。项目页面:此https URL

英文摘要

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: this https URL

2606.12373 2026-06-11 cs.CL 新提交

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

可验证环境是乐高积木:递归组合实现推理泛化

Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu

AI总结 提出RACES框架,将可验证环境视为可递归组合的构建块,通过定义四种组合算子自动生成复合环境,在六个未见基准上平均提升DeepSeek-R1-Distill-Qwen-14B 3.1分,且仅用50个基础环境即可达到300个环境的性能。

详情
AI中文摘要

基于可验证环境的强化学习已成为增强大语言模型推理能力的有效方法。虽然先前研究表明扩展环境数量可提升强化学习性能,但现有手动或单独构建方法受限于线性扩展瓶颈,阻碍了可扩展的推理泛化。本文提出RACES(递归自动组合环境扩展)框架,将可验证环境视为可递归组装的可组合构建块。关键洞察是:当一个环境的余域(输出类型)与另一个环境的定义域(输入类型)匹配时,它们可以自动融合为新的可验证环境,从而实现递归组合。RACES使用300个独立环境实现,并定义了四种组合算子(SEQUENTIAL、PARALLEL、SORT和SELECT),诱导出多样化的推理模式。大量实验表明,在这些复合环境上进行强化学习训练持续提升了推理泛化能力。具体而言,RACES在六个未见基准上平均提升DeepSeek-R1-Distill-Qwen-14B 3.1分(从48.2到51.3),并将Qwen3-14B的性能从58.8提升至61.1。此外,RACES仅使用50个基础环境即可达到与使用300个独立环境训练相当的性能,展现了显著的环境利用效率。

英文摘要

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

2606.11201 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

干预还是不干预:通过概率模型混合指导推理时对齐

Jin Gan, Xin Li, Jun Luo

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院)

AI总结 提出BlendIn框架,通过质量感知对齐和按可靠性加权混合模型知识,解决推理时对齐中指导有效性差异大的问题,在困难模型对上实现最高50%的性能提升。

详情
Comments
Accepted by ACL 2026
AI中文摘要

LLM的广泛部署使得模型对齐成为必要,以确保新训练的模型能够安全有效地响应用户指令。在不同方法中,推理时对齐通常更便宜,因为它仅在输出生成期间进行干预(即提供指导)。现有提案从某些对齐模型中提取指导,但没有适当评估其可靠性。然而,我们的系统评估显示,指导有效性在不同模型间差异很大;由于无效指导会导致进一步混乱和更多干预,由此产生的过度干预通常表明性能较差。为了使干预更有效且更高效,我们引入了BlendIn,一个推理时对齐框架,从二元决策转向创建整合两个模型知识的混合分布。BlendIn通过执行质量感知对齐并根据可靠性按比例加权每个模型的贡献来稳定推理时对齐。与现有工作相比,它保留了有益的指导,同时降低了不可靠建议的权重。BlendIn为未对齐的指导提供了诊断信号和缓解策略,在困难模型对上实现了一致且高达50%的性能提升。我们的代码可在以下网址获取:this https URL。

英文摘要

The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.

2606.11585 2026-06-11 cs.LG cs.CL nlin.AO 交叉投稿

Kuramoto Attention: Synchronizing Self-Attention on the Torus

Kuramoto注意力:在环面上同步自注意力

Joshua Nunley

发表机构 * Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Cognitive Science Program, Indiana University Bloomington(印第安纳大学伯明顿分校信息学系,卢迪信息学、计算与工程学院,认知科学项目)

AI总结 提出Kuramoto注意力层,将隐藏坐标视为角度,通过门控余弦相似度和环形均值更新实现自注意力,等价于Kuramoto耦合项,在字符级语言建模中达到与强基线相近的性能。

详情
Comments
13 pages, 2 figures, 3 tables
AI中文摘要

我们引入了Kuramoto注意力,一种自注意力层,其中每个隐藏坐标是一个角度。该层通过门控余弦相似度对令牌进行评分,关注先前的相位状态,并通过注意力加权的环形均值的切线分量更新每个令牌。由于值是原始相位状态,该更新恰好是Kuramoto耦合项$\sum_u A_{t,u}\sin(\theta_u-\theta_t)$,其中注意力矩阵充当自适应、内容相关的耦合核。等价地,门控分数是环面上的学习度量,用于选择哪些令牌耦合,更新将每个令牌拉向其选择的令牌的环形均值,从而收紧它们的相位一致性。相同的两个成分,即不变相似度分数和流形上的均值,定义了任何紧致群上的此类层;环面是阿贝尔情形,两者都有闭式解。softmax权重解决了一个熵正则化的相位检索问题,旋转位置编码作为分数中与位置相关的相位漂移进入。在enwiki8字符级语言建模中,该层作为功能语言模型训练,其每字符比特数接近强匹配的RoPE+SwiGLU Transformer:在100万参数时相差0.02 BPC(1.637±0.010对比1.616±0.004),在500万参数时中位数持平(五个种子下1.448对比1.452),Transformer在均值上领先(1.468对比1.456)。这些实验表明,受约束的几何结构在此规模下是可行的语言模型;结构本身及其同步解释是贡献。消融实验隔离了承重组件,结果给出了自注意力和相位同步之间的紧凑桥梁。

英文摘要

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(\theta_u-\theta_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.

2606.11709 2026-06-11 cs.LG cs.CL 交叉投稿

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

RLCSD: 基于对比策略自蒸馏的强化学习

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen

发表机构 * Tsinghua University(清华大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室) Peking University(北京大学)

AI总结 针对策略自蒸馏中特权诱导的风格漂移问题,提出RLCSD方法,通过对比正确与错误提示下的师生差距来抑制风格偏移,提升推理模型在数学和逻辑推理任务上的性能。

详情
Comments
20 pages, 9 figures, 9 tables
AI中文摘要

策略自蒸馏(OPSD)通过将模型自身的分布与在特权上下文(通常是已验证的解决方案)下产生的分布对齐,为推理模型提供密集的令牌级监督。然而,我们表明从这种分布差距中提取的学习信号集中在风格令牌而非任务承载令牌上,因为提示模型倾向于产生更直接、更短的输出。我们将这种病理现象称为\emph{特权诱导的风格漂移},它会破坏训练稳定性或导致响应长度缩短。为了解决这个问题,我们提出\textbf{RLCSD}(基于对比策略自蒸馏的强化学习),通过对比正确提示下的师生差距与错误提示下的师生差距来缓解这种漂移,抑制无论正确与否,条件于提示往往诱发的风格转变,并产生更集中于任务承载令牌的信号。在Qwen3(1.7B/4B/8B)和Olmo-3-7B-Think上的数学和逻辑推理实验表明,RLCSD始终优于GRPO和先前的OPSD方法。我们进一步表明,对比原则是通用的:它可以嵌入现有的OPSD方法中以提高它们,并且其潜在见解可扩展到更广泛的跨模型策略蒸馏设置。

英文摘要

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

2606.11722 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ICA Lens: 无需训练另一本词典即可解释语言模型

Sida Liu, Feijiang Han

发表机构 * Independent Researcher(独立研究员) University of Maryland(马里兰大学)

AI总结 提出ICALens,基于独立成分分析(ICA)高效提取语言模型表示中可解释方向,无需训练稀疏自编码器,在SAEBench上表现竞争力。

详情
Comments
Ongoing Project
AI中文摘要

在语言模型表示中找到可解释方向对于理解和控制模型行为至关重要。稀疏自编码器(SAE)已成为此目的的标准工具,但将其作为默认的第一透镜通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索,并提出了一个基本问题:在训练另一个神经字典之前,从激活几何中已经可以看到多少可解释结构?我们的直觉很简单:许多可解释方向对令牌具有选择性,这些方向看起来比随机方向更不服从高斯分布。因此,我们重新审视独立成分分析(ICA),这是一种寻找非高斯方向的经典方法,作为语言模型可解释性的紧凑透镜。我们发现ICA在LLM可解释性中被低估了,因为先前的使用通常依赖于现成的ICA实现,这些实现在LLM激活上不稳定,并且缺乏用于检查和评估恢复方向的系统工具。为弥补这些差距,我们引入了ICALens,这是第一个用于LLM表示的稳定、高效和可审计ICA分析的实用工作流。它结合了优化的GPU并行FastICA流水线、LLM特定的稳定性配方和更好的拟合诊断,实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上,ICALens高效地恢复了紧凑、人类可解释的方向,无需逐层基于梯度的字典训练。在SAEBench上,ICA在稀疏探测中与公共SAE竞争,并在中小预算下的目标探测扰动中优于它们。这些结果表明,ICA不应被视为弱基线,而应被视为探索语言模型表示的高效且互补的第一透镜。

英文摘要

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

2606.11854 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

使用ART微调多模态大语言模型:基于艺术的强化训练

Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski

发表机构 * University of Stavanger(斯塔万格大学) NORCE Research(NORCE研究机构)

AI总结 提出ART方法,通过优化原始视觉输入将信息注入冻结的多模态大语言模型,实现软提示微调,无需修改计算图,在数学和工具使用基准上达到与LoRA相当的精度。

详情
AI中文摘要

大语言模型有两种主要的参数高效微调技术。低秩适应在LLM层之间引入额外权重,而软提示则向LLM输入引入额外的微调特定原始token。然而,两者都需要修改预编译、预优化LLM的计算图。因此,两者在vLLM等高吞吐引擎中均未得到完全支持。我们提出使用ART(基于艺术的强化训练)进行微调。该方法通过仅优化冻结的多模态大语言模型的原始视觉输入来注入信息,从而在预编译计算图上实现软token方法。它依赖于将梯度反向传播到普通像素阵列,因此支持任何微调目标。此外,优化的视觉输入可以风格化为与任务相关的计算艺术品。该方法在流行的开源Qwen架构的不同规模以及多个文本基准上的有效性得到确认。具体而言,ART在数学和结构化工具使用基准上达到了与LoRA竞争的精度。

英文摘要

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 交叉投稿

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐:基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室、智能科学与技术学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Microsoft Research Asia(微软亚洲研究院)

AI总结 研究通过fMRI信号增强大型语言模型推理能力,提出脑引导框架,在10个模型上实现最高13%的准确率提升。

详情
AI中文摘要

大型语言模型(LLMs)与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的,一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐,以及这些信号是否能够改进它们。在此,我们聚焦于演绎推理,表明LLM内部表征不仅与任务fMRI活动部分对齐,而且可以直接通过这些信号增强。使用神经预测性度量,我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分,而在特定推理类型内的预测性较低,表明对齐和分歧并存。基于此,我们提出一个脑引导框架:我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征,在推理时进行干预,在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理,在10个LLM(1.5B-72B)上产生与仅语言监督正交的增益,具有跨推理类型的迁移,以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导,建立了一条由脑信号驱动的路径,通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

2606.12138 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

不稳定特征,可复现子空间:理解稀疏自编码器中的种子依赖性

Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

发表机构 * T-Tech

AI总结 研究稀疏自编码器特征的可复现性,发现稳定特征承载主要信号,不稳定特征集中于可复现的低秩子空间,反映基歧义而非纯噪声。

详情
AI中文摘要

稀疏自编码器(SAE)被广泛用于解释神经网络表示,但其效用取决于学习到的特征是否在不同训练运行间可复现。我们通过\textit{特征稳定性}研究这一问题:对于每个SAE特征,我们估计其在独立训练的SAE中再次出现的概率。这产生了一个可扩展的每特征信号,将稳定特征与不稳定特征区分开来。在一项跨种子、模型、层、字典大小和SAE变体的大规模研究中,我们发现显著的功能不对称性:稳定特征承载了大部分重建和预测相关信号,而不稳定特征的边际影响较弱,并且在激活统计和自动解释中主要由低频表面形式触发主导。在几何上,不稳定特征个体不可复现,但集中在可复现的低秩子空间中,这表明种子依赖性通常反映了共享激活空间区域内的基歧义,而非纯噪声。一个受控的合成模型使这一机制明确,表明低秩真实特征可以在子空间级别被恢复,而作为个体SAE潜在变量跨种子仍不可识别。最后,通过汇集独特的跨种子特征,我们构建了更稳定的SAE,同时在此设置中保留了解释方差。这些结果共同表明,不稳定特征不仅仅是失败或噪声潜在变量:它们个体功能影响较弱,但反映了标准SAE跨种子不同解析的可复现低维结构。

英文摘要

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

2606.12370 2026-06-11 cs.LG cs.CL 交叉投稿

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

打破熵界:通过带拒绝采样的多令牌预测加速强化学习训练

Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou

发表机构 * Qwen Team, Alibaba Inc(阿里巴巴集团 Qwen 团队)

AI总结 针对强化学习训练中多令牌预测接受率因熵波动而下降的问题,提出Bebop方法,采用概率拒绝采样和端到端TV损失优化,实现高达95%接受率和1.8倍加速。

详情
AI中文摘要

强化学习(RL)已成为现代大型语言模型的关键组成部分,但展开阶段仍是RL训练流程中的主要瓶颈。尽管多令牌预测(MTP)通过推测解码提供了一种自然的加速方案,但许多研究观察到MTP接受率在RL训练期间显著下降,导致加速效果有限。为解决这一瓶颈,我们提出Bebop,对LLM后训练中的MTP进行系统研究,并提供将MTP集成到大规模RL流水线中的实用方案。首先,我们揭示MTP接受率根本上受模型熵波动的限制,其与RL阶段熵的上升呈现清晰的负线性关系。其次,我们证明与贪婪草稿采样相比,概率拒绝采样在很大程度上减轻了RL中熵引入的干扰。我们进一步发现,传统的MTP训练目标(交叉熵或KL)在此类设置中次优,因此我们提出一种新颖的端到端TV损失,直接优化多步拒绝采样接受率,带来约10%的接受率提升,在数学推理、代码生成和智能体任务中实现高达95%的接受率和高达25%的额外推理吞吐量增益。第三,我们测试了RL期间的各种在线MTP训练策略,并表明使用端到端TV损失和拒绝采样的预RL MTP训练在整个RL过程中保持一致的接受率和加速,消除了昂贵的在线MTP更新需求。我们提供了大量实验和分析来验证我们的发现。实验结果表明,我们的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步RL训练中实现了高达1.8倍的端到端加速。

英文摘要

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

2606.12397 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

重新设计混合专家模型的路由器:基于流形幂迭代

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Large Language Model Department, Tencent(腾讯大型语言模型部门)

AI总结 提出将路由器行与专家矩阵主奇异方向对齐,并基于流形幂迭代(MPI)重新设计路由器,通过“幂迭代-收缩”范式实现对齐,理论证明收敛性,实验验证1B至11B参数规模下模型效果提升。

详情
Comments
Preprint
AI中文摘要

路由器是混合专家模型的核心组件。作为专家代理,路由器矩阵的行计算与MoE输入的相似度,以确定激活哪些专家子集。理想情况下,每个路由器行被设计为将专家矩阵编码到该代表性向量中,使得其与token的点积能更好地反映token-专家亲和性。然而,目前没有设计原则来强制这种压缩。在本文中,我们提出将每个路由器行与相关专家的主奇异方向对齐,因为该方向提供了矩阵最具表现力的数学描述。基于这一原则,我们提出了一种基于流形幂迭代(MPI)的路由器重新设计。具体来说,它引入了一种“幂迭代-收缩”范式,其中对路由器权重执行幂迭代步骤,然后进行收缩以施加范数约束,确保效率和稳定性。理论上,我们证明MPI驱动路由器行收敛到相关专家的主奇异方向。实验上,我们在1B到11B参数规模的MoE模型上进行预训练,证实这种对齐有助于更有效的MoE模型。

英文摘要

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

2410.12327 2026-06-11 cs.CL 版本更新

Neuron-based Personality Trait Induction in Large Language Models

基于神经元的大语言模型人格特质诱导

Jia Deng, Tianyi Tang, Yanbin Yin, Wenhao Yang, Wayne Xin Zhao, Ji-Rong Wen

AI总结 提出基于神经元的大语言模型人格特质诱导方法,通过构建PersonalityBench数据集、识别人格相关神经元并调整其值,实现无需训练的参数级控制,效果媲美微调模型。

详情
Comments
25 pages. Published at ICLR 2025
AI中文摘要

大型语言模型(LLMs)在模拟各种人格特质方面变得越来越熟练,这是支持相关应用(例如角色扮演)的重要能力。为了进一步提高这种能力,在本文中,我们提出了一种基于神经元的方法,用于LLMs的人格特质诱导,包含三个主要技术贡献。首先,我们构建了PersonalityBench,一个用于识别和评估LLMs人格特质的大规模数据集。该数据集基于心理学中的大五人格特质,旨在评估LLMs对特定人格特质的生成能力。其次,通过利用PersonalityBench,我们提出了一种高效的方法,通过检查给定特质的对立面来识别LLMs中与人格相关的神经元。第三,我们开发了一种简单而有效的诱导方法,通过操纵这些已识别的人格相关神经元的值。该方法无需训练和修改模型参数即可实现对LLMs所表现特质的细粒度控制。大量实验验证了我们神经元识别和特质诱导方法的有效性。值得注意的是,我们的方法实现了与微调模型相当的性能,为LLMs的人格特质诱导提供了更高效、更灵活的解决方案。我们在以下网址提供所有提到的资源:此 https URL。

英文摘要

Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at this https URL.

2509.23982 2026-06-11 cs.CL cs.AI cs.CY cs.LG cs.NE 版本更新

Toward Preference-aligned Large Language Models via Residual-based Model Steering

基于残差模型引导的偏好对齐大型语言模型

Lucio La Cava, Andrea Tagarelli

AI总结 提出PaLRS方法,利用残差流中的偏好信号提取轻量级引导向量,无需训练即可在推理时对齐模型偏好,在数学推理和代码生成任务上取得一致提升,同时节省大量时间。

详情
Comments
Accepted at IJCAI 2026
AI中文摘要

偏好对齐是使大型语言模型(LLMs)有用且与(人类)偏好一致的关键步骤。现有方法如基于人类反馈的强化学习或直接偏好优化通常需要精心策划的数据和对数十亿参数进行昂贵的优化,最终导致持久性的任务特定模型。在这项工作中,我们引入了基于残差引导的LLM偏好对齐(PaLRS),这是一种无需训练的方法,利用LLM残差流中编码的偏好信号。从仅一百个偏好对中,PaLRS提取出轻量级、即插即用的引导向量,可在推理时应用以将模型推向偏好行为。我们在各种中小型开源LLM上评估了PaLRS,显示PaLRS对齐的模型在数学推理和代码生成基准上取得了一致的提升,同时保持了基线通用性能。此外,与使用DPO和SimPO对齐的模型相比,它们表现更好且节省大量时间。我们的发现强调,PaLRS为标准偏好优化流程提供了一种有效、更高效且灵活的替代方案,提供了一种无需训练、即插即用的对齐机制,且数据需求极少。

英文摘要

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

2602.00945 2026-06-11 cs.CL cs.AI 版本更新

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Neural FOXP2——面向大型语言模型目标语言改进的语言特定神经元引导

Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das

AI总结 提出Neural FOXP2方法,通过定位语言神经元、计算引导方向和施加稀疏激活偏移,将模型默认语言从英语切换为印地语或西班牙语,实现可控的语言主导性。

详情
AI中文摘要

LLMs通过训练成为多语言模型,但其通用语言通常是英语,反映了英语在预训练中的主导地位。其他语言保留在参数记忆中,但被系统性抑制。我们认为语言默认性由稀疏、低秩的控制电路(语言神经元)支配,可以机械地隔离并安全引导。我们引入Neural FOXP2,通过引导语言特定神经元,使模型以选定语言(印地语或西班牙语)为主。Neural FOXP2分三个阶段进行:(i) 定位:我们训练每层的SAE,使每个激活分解为一小组活跃特征组件。对于每个特征,我们量化英语与印地语/西班牙语的选择性,基于整体logit质量向目标语言令牌集的提升。将排名靠前的特征追溯回其最强贡献单元,得到紧凑的语言神经元集。(ii) 引导方向:我们通过谱低秩分析定位可控的语言转换几何。对于每层,我们构建英语到目标激活差异矩阵,并执行逐层SVD以提取主导语言变化的奇异方向。特征间隙和有效秩谱识别出紧凑的引导子空间和经验选择的干预窗口(这些方向最强且最稳定)。(iii) 引导:我们对语言神经元应用有符号的稀疏激活偏移。具体地,在低到中层,我们沿目标语言主导方向添加正向引导,并对英语神经元在零空间施加补偿性负偏移,实现可控的目标语言默认性。

英文摘要

LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

2602.03141 2026-06-11 cs.CL 版本更新

Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

短链深思:通过拆分-合并优化平衡推理效率与段内能力

Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu

AI总结 提出CoSMo框架,通过拆分-合并算法动态优化推理链,结合段级预算的结构对齐强化学习,在保持准确率的同时显著减少冗余段,平均提升准确率3.3点并减少28.7%段使用。

详情
Comments
camera ready version upload
AI中文摘要

尽管大型推理模型(LRMs)通过生成长推理链在解决复杂任务方面展示了令人印象深刻的能力,但这种依赖冗长生成的方式导致了显著的延迟和计算开销。为了解决这些挑战,我们提出了\textbf{CoSMo}(\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization),一个旨在消除结构冗余而非不加区分地限制令牌数量的框架。具体来说,CoSMo利用一种拆分-合并算法,通过合并冗余段和拆分逻辑缺口来动态优化推理链,以确保连贯性。然后,我们采用结构对齐的强化学习,配合一种新颖的段级预算,在整个训练过程中监督模型保持高效的推理结构。跨多个基准和骨干网络的广泛实验表明,CoSMo实现了优越的性能,与推理效率基线相比,平均准确率提高了\textbf{3.3}个百分点,同时段使用量减少了\textbf{28.7\%}。

英文摘要

While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7\%} on average compared to reasoning efficiency baselines.

2602.09591 2026-06-11 cs.CL cs.AI cs.LG 版本更新

On the Optimal Reasoning Length for RL-Trained Language Models

关于RL训练的语言模型的最优推理长度

Daisuke Nohara, Taishi Nakamura, Rio Yokota

AI总结 研究强化学习训练的语言模型中推理长度与准确率的非单调关系,发现存在最优中间长度,并通过模式准确率分析揭示其成因。

详情
Comments
18 pages, 12 figures
AI中文摘要

强化学习显著提高了大型语言模型的推理能力,但也倾向于延长思维链输出并增加计算成本。尽管已经提出了长度控制方法,但它们所引发的长度-准确率关系仍不清楚。我们在受控设置下,在多个基础模型上使用几种长度控制方法训练策略,发现在数学推理和代码生成中,准确率随输出长度呈非单调变化,在中间值达到峰值。然而,即使在样本准确率趋于平稳或下降的情况下,模式准确率仍随长度持续提高,这表明非单调的长度-准确率关系是由围绕越来越正确的中心的分散性驱动的。

英文摘要

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

2605.12288 2026-06-11 cs.CL cs.AI 版本更新

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

TokenRatio: 通过比率匹配实现原理化的token级偏好优化

Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen, Khoa Doan, Trung Le

AI总结 本文提出TBPO方法,通过比率匹配恢复token级偏好最优性,改进对齐质量和训练稳定性,并增加输出多样性。

详情
AI中文摘要

直接偏好优化(DPO)是一种广泛使用的无强化学习方法,用于对齐语言模型,但其在完整序列上建模偏好,尽管生成过程由逐token决策驱动。现有token级扩展通常将序列级Bradley-Terry目标分解到时间步,使前缀(状态级)最优性隐含。我们研究如何仅使用标准序列级成对比较恢复token级偏好最优性。我们引入token级Bregman偏好优化(TBPO),提出一个基于前缀的token级Bradley-Terry偏好模型,推导出Bregman散度密度比率匹配目标,该目标扩展了logistic/DPO损失,同时保持由token级模型诱导的最佳策略,并维持DPO-like的简洁性。我们引入两个实例:TBPO-Q,显式学习轻量级状态基线;TBPO-A,通过优势归一化移除基线。在指令跟随、有用性/无害性以及摘要基准上,TBPO相比强序列级和token级基线提高了对齐质量和训练稳定性,并增加了输出多样性。

英文摘要

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

2505.15201 2026-06-11 cs.LG cs.AI cs.CL stat.ML 版本更新

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Pass@K 策略优化:解决更困难的强化学习问题

Christian Walder, Deep Karkhanis

AI总结 提出 Pass-at-k 策略优化 (PKPO),通过变换奖励直接优化 pass@k 性能,利用低方差无偏估计器,在训练中退火 k 可同时提升 pass@1 和 pass@k,解决更难问题。

详情
AI中文摘要

强化学习算法对每个问题采样多个 n>1 的解决方案尝试并独立奖励它们。这优化了 pass@1 性能,优先考虑孤立样本的强度,而牺牲了样本集的多样性和集体效用。这未充分利用采样能力,限制了探索和在更难示例上的最终改进。作为修复,我们提出 Pass-at-k 策略优化 (PKPO),一种对最终奖励的变换,导致直接优化 pass@k 性能,从而优化联合考虑时最大化奖励的样本集。我们的贡献是推导出 pass@k 及其梯度在二元和连续奖励设置中的新型低方差无偏估计器。我们展示了使用我们的估计器进行优化简化为标准强化学习,其中奖励经过稳定高效的变换函数联合变换。虽然先前的工作仅限于 k=n,但我们是第一个能够对任意 k ≤ n 实现 pass@k 鲁棒优化的。此外,我们的方法不是以 pass@1 性能换取 pass@k 增益,而是允许在训练中退火 k,同时优化两个指标,通常能在显著 pass@k 增益的同时获得强大的 pass@1 数值。我们在玩具实验上验证了我们的奖励变换,揭示了我们的公式的方差减少特性。我们还使用开源 LLM GEMMA-2 包含了真实世界的例子。我们发现我们的变换有效地优化了目标 k。此外,更高的 k 值能够解决更多和更难的问题,而退火 k 则同时提升了 pass@1 和 pass@k。关键的是,在传统 pass@1 优化停滞的具有挑战性的任务集上,我们的 pass@k 方法解锁了学习,这可能是由于通过优先考虑联合效用而非单个样本的效用实现了更好的探索。

英文摘要

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

2512.22088 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

统一Transformer缩放定律中的学习动力学与泛化

Chiwun Yang

AI总结 本文通过将Transformer学习动力学形式化为ODE系统并近似为核行为,严格分析了随机梯度下降训练下的泛化误差,揭示了计算资源缩放时泛化误差的指数衰减与幂律衰减的两阶段相变,并建立了紧的上下界。

详情
Comments
87 pages, 10 figures, 3 tables
AI中文摘要

缩放定律是大语言模型(LLM)发展的基石,预测了模型性能随计算资源增加而提升。然而,尽管经验上得到验证,其理论基础仍不清晰。本文形式化了基于Transformer的语言模型的学习动力学为一个常微分方程(ODE)系统,然后将该过程近似为核行为。与之前的玩具模型分析不同,我们严格分析了在序列到序列数据上具有任意数据分布的多层Transformer的随机梯度下降(SGD)训练,紧密反映了真实世界条件。我们的分析刻画了随着计算资源随数据缩放时,泛化误差收敛到不可约风险的过程,特别是在优化过程中。我们建立了过剩风险的匹配上下界,其特征是明显的相变。在初始优化阶段,过剩风险相对于计算成本${\sf C}$呈指数衰减。然而,一旦超过特定的资源分配阈值,系统进入统计阶段,泛化误差遵循$\Theta(\mathsf{C}^{-1/7})$的幂律衰减。这些速率通过互补的下界得到证实——统计方面通过信息论的两点约简,优化方面通过一阶预言机论证——使得两阶段定律在常数、对数因子和条件数差距内是紧的。除了这个统一框架,我们的理论还推导了模型大小、训练时间和数据集大小的独立缩放定律,阐明了每个变量如何独立地控制泛化的边界。

英文摘要

The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish matching upper and lower bounds on the excess risk, characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $\Theta(\mathsf{C}^{-1/7})$. These rates are certified by complementary lower bounds -- statistical, via an information-theoretic two-point reduction, and optimization-side, via a first-order oracle argument -- rendering the two-stage law tight up to constants, logarithmic factors, and a condition-number gap. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the bounds of generalization.

2601.23278 2026-06-11 cs.LG cs.AR cs.CL 版本更新

FOCUS: DLLMs Know How to Tame Their Compute Bound

FOCUS: DLLMs 知道如何驯服它们的计算瓶颈

Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini

AI总结 针对扩散大语言模型解码中大部分计算浪费在不可解码令牌上的问题,提出 FOCUS 推理系统,通过动态聚焦可解码令牌并驱逐不可解码令牌,提升有效批大小,实现高达 3.52 倍的吞吐量提升。

详情
Comments
ICML 2026 camera-ready version
AI中文摘要

扩散大语言模型(DLLMs)为自回归模型提供了一种引人注目的替代方案,但其部署受到高解码成本的制约。在这项工作中,我们识别出 DLLM 解码中的一个关键低效问题:虽然计算在令牌块上并行化,但每个扩散步骤中只有一小部分令牌是可解码的,导致大部分计算浪费在不可解码的令牌上。我们进一步观察到注意力导出的令牌重要性与逐令牌解码概率之间存在强相关性。基于这一洞察,我们提出了 FOCUS,一个专为 DLLMs 设计的推理系统。通过动态地将计算聚焦于可解码令牌并实时驱逐不可解码令牌,FOCUS 增加了有效批大小,缓解了计算限制并实现了可扩展的吞吐量。实验评估表明,在大批量设置下,FOCUS 相比生产级引擎 LMDeploy 实现了高达 3.52 倍的吞吐量提升,同时在多个基准测试中保持或提升了生成质量。

英文摘要

Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS, an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy in large-batch settings, while preserving or improving generation quality across multiple benchmarks.

2602.02726 2026-06-11 cs.LG cs.CL 版本更新

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

向量量化潜在概念:聚类式概念发现的可扩展替代方案

Xuemin Yu, Ankur Garg, Samira Ebrahimi Kahou, Hassan Sajjad

AI总结 提出VQLC框架,通过向量量化学习离散潜在概念,在保持可解释性的同时,实现与K-Means相当的计算效率,并优于层次聚类在大规模数据上的扩展性。

详情
AI中文摘要

大型语言模型(LLMs)在其隐藏状态中编码了丰富的语义信息,但理解这些内部表示捕获了哪些信息仍然困难。从隐藏状态中提取的潜在概念为解释LLMs提供了有希望的方向,但现有的基于聚类的方法面临权衡:层次聚类产生连贯的概念,但由于其二次内存成本而仅限于小数据集,而K-Means高效扩展但可能产生语义连贯性较差的概念。我们提出向量量化潜在概念(VQLC),一种离散概念学习框架,在冻结的隐藏状态上学习潜在概念的码本。在12个数据集-模型设置中,VQLC在计算成本上接近K-Means,扩展性优于层次聚类,并在忠实度上保持竞争力,在仅解码器模型上增益最明显。基于LLMs的评估、定性分析和稀疏自编码器(SAE)比较表明,学习到的概念是可解释且任务相关的。

英文摘要

Large language models (LLMs) encode rich semantic information in their hidden states, yet it remains difficult to understand what information these internal representations capture. Latent concepts extracted from hidden states offer a promising direction for interpreting LLMs, but existing clustering-based methods face a trade-off: hierarchical clustering produces coherent concepts but is limited to small datasets due to its quadratic memory cost, while K-Means scales efficiently but may yield less semantically coherent concepts. We propose Vector Quantized Latent Concept (VQLC), a discrete concept learning framework that learns a codebook of latent concepts on frozen hidden states. Across 12 dataset-model settings, VQLC stays close to K-Means in computational cost, scales better than hierarchical clustering, and remains competitive in faithfulness, with the clearest gains on decoder-only models. LLMs-based evaluation, qualitative analysis, and a Sparse Autoencoder (SAE) comparison demonstrate that the learned concepts are interpretable and task-relevant.

2605.04893 2026-06-11 cs.LG cs.CL stat.ML 版本更新

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

自注意力作为传输:对称谱诊断的极限

Dominik Dahlem, Diego Maniloff, Mac Misiura

AI总结 研究语言模型注意力路由的两种失效形状(过度集中或过度分散),证明对称谱诊断对方向不敏感,并揭示因果注意力中传输容量的理论下限,提出基于容量和方向的双轴诊断方法。

详情
Comments
48 pages, 6 figures, 7 tables; 81-page online supplement (proofs, additional experiments, dataset statistics) as an ancillary file
AI中文摘要

当语言模型处理幻觉响应时,其注意力路由往往以两种形状之一失效:过度集中在狭窄的位置集合上,或者分散得如此广泛以至于相关性被稀释,而失效的形状携带诊断信号。我们研究这些形状作为诊断特征,从在基准标记响应的\emph{强制评分}下计算的注意力矩阵中得出,而不是在实时生成期间。一类广泛使用的谱方法分析度归一化注意力算子的对称分量,该算子控制传输\emph{容量};我们证明该算子的每个转置不变谱诊断在结构上是\emph{方向盲的}(它无法区分算子与其转置,因此无法检测信息流方向),并且盲定理的逆定理将任何Lipschitz诊断的转置敏感性限制为不对称系数$G$。将其与规范因果架构的闭式二分-Cheeger景观配对,我们证明均匀因果注意力满足一个与$n$无关的下界$\phi \ge 1/5$,而窗口注意力以$O(w/n)$穿透下界;失效模式在形状上不同,而不仅仅在数值上不同。这个下界是一个理想化架构的基准,而不是经验吸引子:穿透它的真实注意力头的比例本身就是一个架构特征。由此产生的双轴诊断($\phi$表示容量,$G$表示方向)产生一个可证伪的极性预测:瓶颈主导和分散主导的基准应表现出相反的极性。在长度控制评估下,传输特征在测试的仅解码器、仅编码器和编码器-解码器模型中保持可解释的信号(0.62-0.84 LC-AUROC),极性在HaluEval和MedHallu之间如预测般反转。

英文摘要

When a language model processes a hallucinated response, its attention routing tends to fail in one of two shapes: over-concentrating on a narrow set of positions, or spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. We study these shapes as a diagnostic characterization, computed from attention matrices under \emph{forced scoring} of benchmark-labeled responses rather than during live generation. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport \emph{capacity}; we prove that every transpose-invariant spectral diagnostic of this operator is structurally \emph{orientation-blind} (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a converse to the blindness theorem bounding any Lipschitz diagnostic's transpose sensitivity by the asymmetry coefficient $G$. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $\phi \ge 1/5$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. This floor is an idealized-architecture benchmark, not an empirical attractor: the fraction of real attention heads that pierce it is itself an architectural signature. The resulting two-axis diagnostic ($\phi$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (0.62-0.84 LC-AUROC) across the tested decoder-only, encoder-only, and encoder-decoder models, with polarity reversing as predicted between HaluEval and MedHallu.

2606.10820 2026-06-11 cs.LG cs.AI cs.CL 版本更新

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

K-Forcing:通过前推语言建模进行联合下一K词解码

Zhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao, Jiasheng Tang, Fan Wang, Bohan Zhuang

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出K-Forcing范式,通过前推映射将自回归模型蒸馏为单次前向传播生成多个未来词,实现2.4-3.5倍加速,质量损失小。

详情
Comments
Code: this https URL
AI中文摘要

自回归语言建模是文本生成的主导范式,但其逐词顺序解码使得推理受限于内存且效率低下。现有的加速方法(如推测解码和扩散语言模型)在特定条件下可提升速度,但并未直接解决高负载批量服务——这一对工业级部署最为关键的场景。我们提出K-Forcing,一种用于联合下一k词解码的前推语言建模范式。K-Forcing将现有自回归模型蒸馏为条件前推映射——该映射在单次前向传播中将独立均匀噪声变量转换为多个未来词的联合样本。该设计保留了固定长度输出,复用了自回归教师模型的主干,并与标准自回归服务基础设施兼容。我们通过渐进式自强迫蒸馏训练该映射,逐步扩展预测窗口,同时使学生模型紧密匹配自回归教师模型的序列分布。我们在LM1B和OpenWebText上使用标准因果Transformer主干评估K-Forcing。当激进配置为每次前向传播生成k=4个词时,K-Forcing在不同批量大小下实现约2.4-3.5倍加速,同时相对于自回归教师模型仅带来轻微的质量下降。随着推理在现代LLM的生命周期计算成本中占据主导地位,K-Forcing为在现实高负载部署下加速自回归生成提供了一条有前景的途径。

英文摘要

Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

2606.07537 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

从架构到输出:大语言模型中幻觉的结构性起源及数据的放大作用

Md. Rejaul Korim Sadi, Toufiqur Rahman Tasin, Golam Mostofa Naeem

AI总结 本文分析大语言模型幻觉的结构性根源,指出自注意力、最大似然估计训练目标和自回归解码三个架构决策构成复合失效系统,并揭示数据病理如何放大这些脆弱性。

详情
Comments
11 pages, 7 figures, 15 references
AI中文摘要

大语言模型会产生幻觉——生成流畅、自信但事实错误的输出——这种一致性跨越代际和规模。现有分类法按输出类型对幻觉进行分类,区分内在与外在失败以及忠实性与事实性偏差。这些框架在描述上严谨,但未能识别产生特定实例的内部机制。本文将幻觉分析为三个架构决策的结构性后果,这些决策共同构成一个复合失效系统。自注意力的共现学习用统计邻近性替代语义含义,导致实体混淆、事实错误归因和语义漂移。最大似然估计训练目标在无事实约束下优化下一个词元概率,奖励统计上合理的输出,无论其真值如何。自回归解码在暴露偏差下的永久从左到右承诺确保单个错误词元级联向前传递整个输出序列而无法修正。数据集病理——长尾缺陷、训练偏差和合成污染——放大了这些脆弱性,但并非独立导致它们。我们做出三项贡献。首先,我们将每个机制映射到Alansari和Luqman分类法中的特定输出类别,将内在幻觉定位于自注意力,外在幻觉定位于MLE,逻辑不一致定位于自回归解码。其次,我们表明每个常被引用的数据集病理利用这些机制之一,而非独立产生幻觉。第三,我们识别出仅基于输出类型分类的诊断局限性,并将其与推理层缓解方法进行对比。

英文摘要

Large language models hallucinate--producing fluent, confident, factually wrong outputs--with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention's co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding's permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies--long-tail deficiencies, training bias, and synthetic pollution--amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.

2. 机器翻译与跨语言处理 3 篇

2606.11786 2026-06-11 cs.CL 新提交

Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

Lius:基于持续指令调优的库邦马来语教学语言学翻译模型

Joanito Agili Lopo, Yunita Sari, Guntur Budi Herwanto

发表机构 * Universitas Gadjah Mada(加札马达大学)

AI总结 针对低资源语言库邦马来语,提出利用双语词典的词汇和语义特征设计指令,并采用持续指令调优(CIT)范式微调大语言模型,在多个指标上超越基线4-6分,优于NMT和多语言LLM 10-13分。

详情
Comments
This paper is the result of the Master Thesis in Master of Artificial Intelligence at Universitas Gadjah Mada
AI中文摘要

大语言模型(LLM)为翻译任务提供了新的潜力,但在处理低资源语言时常常出现性能下降。为了解决这一限制,我们提出了一种针对低资源语言库邦马来语微调LLM的方法。我们的方法涉及利用双语词典的显式词汇和语义特征设计一组指令,并引入持续指令调优(CIT),一种支持基于迭代指令训练的训练范式。实验结果表明,我们名为Lius的模型在多个评估指标上比标准指令调优模型提高了4-6分,并超越了神经机器翻译(NMT)和多语言LLM模型10-13分。这些发现突显了我们的方法在减轻低资源语言翻译中对大规模并行数据依赖的潜力。

英文摘要

Large Language Models (LLMs) offer new potential for translation tasks but often experience performance degradation when handling low-resource languages. To address this limitation, we propose an approach for fine-tuning LLMs on a low-resource language, Kupang Malay. Our approach involves designing a set of instructions by leveraging explicit lexical and semantic features from a bilingual dictionary, and introducing Continual Instruction Tuning (CIT), a training paradigm that enables iterative instruction-based training. Experimental results demonstrate that our model, named Lius, yields notable improvements over standard instruction-tuned models by outperforming 4-6 points, and surpassing both Neural Machine Translation (NMT) and Multilingual LLM models by 10-13 points on several evaluation metrics. These findings highlight the potential of our approach to mitigate the reliance on large-scale parallel data in low-resource language translation.

2606.04694 2026-06-11 cs.CL 版本更新

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

DuDi: 跨语言动词化的双信号蒸馏

Patomporn Payoungkhamdee, Tinnakit Udsa, Jian Gang Ngui, Sarana Nutanong, Alham Fikri Aji, Peerat Limkonchotiwat

AI总结 提出DuDi框架,通过结合序列级和词元级信号以及跨语言动词化器,提升小语言模型在多语言(尤其是东南亚语言)上的性能。

详情
AI中文摘要

小语言模型(SLM)高效且可扩展,但其多语言能力在十亿以下规模时严重下降,尤其是对于东南亚(SEA)语言。我们引入了DuDi,一个双信号多语言蒸馏框架,它结合了在线序列级信号与离策略和在线词元级信号。DuDi进一步使用跨语言动词化器来优化教师反馈并提高多语言设置下教师-学生的可迁移性。在SEA-HELM上跨多个模型家族、规模和教师-学生设置的实验表明,DuDi始终优于竞争性的蒸馏基线。消融和分析证实,序列级优化、词元级监督和跨语言动词化为多语言SLM提供了互补且可迁移的学习信号。

英文摘要

Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.

2606.08011 2026-06-11 cs.CL cs.AI 版本更新

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

改写以翻译,翻译以奖励:机器翻译中源端改写的强化学习

Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学大学) Preferred Networks Inc(Preferred Networks 公司) Nara Institute of Science and Technology(奈良先端科学技术大学院大学)

AI总结 提出RLSR框架,通过强化学习训练源端改写模型,以翻译质量提升为奖励,无需为每个MT模型调提示,在6个MT模型和16个语言对上超越无改写和同规模提示基线,与235B LLM提示基线性能相当。

详情
AI中文摘要

尽管直接提示现成的大语言模型(LLM)生成保留意义的源端改写可以有效提升机器翻译(MT)质量,但这样做需要为不同的MT模型手动调整提示。在这项工作中,我们提出了RLSR(用于源端改写的强化学习),一种新颖的基于强化学习的框架,用于训练源端改写模型,而无需为每个MT模型调整提示。RLSR通过直接使用每个改写源端所带来的下游翻译质量的提升作为奖励来优化改写模型。跨六个MT模型和16个语言对的广泛实验表明,我们通过RLSR训练的4B改写模型显著优于无改写基线和现有的同规模基于提示的改写基线,同时与基于235B LLM的提示基线相比取得了具有竞争力的性能。

英文摘要

Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

3. 信息抽取、检索与问答 14 篇

2606.11198 2026-06-11 cs.CL cs.AI 新提交

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

结构注意力税:检索格式如何劫持上下文学习而与内容无关

Yuqi Zhang, Di Zhang

发表机构 * Xi’an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 研究发现知识图谱三元组因其格式结构比自然语言吸引2-3倍注意力,压缩演示注意力达42%,并提出了分解注意力为语义与结构成分的框架及缓解策略。

详情
Comments
10 pages, 5 figures
AI中文摘要

检索增强生成(RAG)系统注入外部知识以改进大语言模型输出,然而注入内容的格式——区别于其语义相关性——可以独立地扭曲模型的注意力分布。我们识别并形式化了一种称为结构注意力税的现象:知识图谱(KG)三元组,由于其关系分隔符和重复的槽位模式,每个token捕获的注意力是语义等价的自然语言文本的2-3倍($\hat{o}$(KG) ≈ 0.70 对比 $\hat{o}$(中性) ≈ 0.25),将演示注意力压缩高达42%——无论三元组是相关还是噪声。我们开发了一个形式化框架,将注意力分数分解为语义和结构成分(公式2),推导了一个压缩界(命题1),将token级别的格式偏差与演示注意力损失联系起来,并表明结构项控制着注意力被转移多少,而语义项控制着这是有益还是有害。这种解耦揭示了改进检索增强ICL的两个正交轴:优化检索质量(语义轴)和减少格式驱动的注意力捕获(结构轴)。实验上,在两个模型家族(Mistral-7B, LLaMA-3-8B)和三个QA基准上,我们观察到源任务对齐占主导地位:任务匹配的BM25检索在HotpotQA上达到58-62%,而ConceptNet为25-27%,超过30个百分点的差距远远超过所有门控策略(≤2个百分点)。我们从该框架推导出五种结构感知缓解策略,从零成本提示修改到训练时正则化;格式展平(S3)通过来自口头化三元组控制的准确性和注意力级证据得到验证,而结构分散(S1)产生了混合结果,揭示了格式级别干预的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ($\hat{o}$(KG) $\approx$ 0.70 vs. $\hat{o}$(neutral) $\approx$ 0.25), compressing demonstration attention by up to 42% -- regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet's 25-27%, a >30 pp gap that dwarfs all gating strategies ($\leq$2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.

2606.11350 2026-06-11 cs.CL cs.IR 新提交

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

当更多文档损害RAG:利用领域限定、模型无关的检索缓解向量搜索稀释

Nabaraj Subedi, Ahmed Abdelaty, Shivanand Venkanna Sheshappanavar

发表机构 * Dept. of Electrical Engineering & Computer Science, University of Wyoming(怀俄明大学电气工程与计算机科学系) Dept. of Civil, Architectural Engineering & Construction Management, University of Wyoming(怀俄明大学土木、建筑工程与施工管理系)

AI总结 针对检索增强生成在异构文档集合中因向量搜索稀释导致性能下降的问题,提出基于组织元数据的领域限定方法MASDR-RAG,显著提升P@10至0.86,并揭示多智能体编排的精度-忠实度悖论。

详情
Comments
24 pages, 8 figures, 30 tables. Preprint under review
AI中文摘要

当检索增强生成扩展到大规模、异构的文档集合时,其性能会下降,因为密集相似性失去了区分能力,top-k检索越来越多地返回语义相似但上下文不正确的块。我们将这种失败模式称为向量搜索稀释。即使使用混合密集+稀疏检索,我们在部署的怀俄明州交通部语料库中直接观察到了这一点:当文档从54篇扩展到1128篇(88907个块)时,准确率从75%下降到40%以下。为了解决这种稀释问题,我们提出了MASDR-RAG(用于RAG的多智能体领域限定检索),并在200个专家验证的查询上进行了评估,涉及五个LLM骨干、六个语料库和两个索引栈。我们的结果表明,使用组织元数据进行领域限定是关键修复,显著将P@10从0.77提高到0.86(p < 0.05)。此外,我们对多智能体编排的研究揭示,高度配置依赖会导致我们所谓的精度-忠实度悖论。基于这些不同的结果,我们的实用建议很简单:先限定领域,然后执行一次合成调用,将完整的多智能体编排保留给真正多领域的语料库,并配合原生工具调用骨干。代码和数据将在接收后公开。

英文摘要

Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p < 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

2606.11424 2026-06-11 cs.CL 新提交

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

SOMA-SQL: 通过合成日志和执行探测解决NL-to-SQL中的多源歧义

Sai Ashish Somayajula, Marianne Menglin Liu, Chuan Lei, Fjona Parllaku, Daniel Garcia, Rongguang Wang, Syed Fahad Allam Shah, Ankan Bansal, Sujeeth Bharadwaj, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI(甲骨文人工智能实验室)

AI总结 提出SOMA-SQL框架,通过合成查询日志和歧义驱动探测自动解决自然语言到SQL中的多源歧义,在6个基准上平均执行准确率提升13.0%。

详情
Comments
34 pages, 1 figure, 7 tables. Preprint
AI中文摘要

自然语言数据库接口旨在将用户问题转换为可执行的SQL,但在现实环境中,问题表述不明确且模式庞大且模糊时仍然脆弱。用户问题、数据库模式和模型解释之间的歧义是NL2SQL中的主要失败模式,导致意图不匹配、模式接地错误和SQL生成错误。现有方法依赖人工澄清或将歧义视为模式表示问题,但这些方法无法扩展也无法自主解决歧义。我们提出SOMA-SQL,通过目标合成查询日志和歧义驱动探测自动解决歧义。SOMA-SQL构建合成查询日志以接地模式解释并指导候选SQL生成;然后执行目标探测查询,由结构化歧义分类和候选不一致驱动,为最终SQL选择和修复生成消歧证据。这种主动的歧义发现和解决方法无需人工参与即可泛化到未见过的模式和查询分布。在六个公开基准上的实验表明,SOMA-SQL相比最先进的基线平均执行准确率提升13.0%,在歧义问题上提升高达16.7%。

英文摘要

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

2606.11609 2026-06-11 cs.CL 新提交

Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

基于自适应工人分配的多智能体推理用于立场检测

Meysam Sabbaghan, Arman Zareian Jahromi, Doina Caragea

发表机构 * Kansas State University(堪萨斯州立大学)

AI总结 提出一种Manager-Worker多智能体框架,通过自适应分配工人智能体进行推理级合成,而非标签级投票,在隐式和上下文依赖的立场检测上显著提升性能。

详情
AI中文摘要

立场检测需要识别作者对目标的态度,通常来自简短文本,其中立场是隐含的、间接的或修辞性的。尽管大型语言模型(LLM)在此任务上表现强劲,但当多种解释可能成立时,单次提示可能脆弱。现有的聚合策略,如多数投票或自一致性,通过组合标签来提高鲁棒性,但丢弃了解决冲突解释所需的中间推理。我们提出了一种用于立场检测的自适应工人分配多智能体推理框架,将聚合从标签级投票转变为推理级合成。该框架采用Manager-Worker架构,其中Manager根据输入复杂度自适应地分配可变数量的Worker智能体。每个Worker从不同角度分析输入,并生成仅推理的解释而不输出立场标签;然后Manager综合这些解释以产生最终预测。我们在SemEval-2016、P-Stance和COVID-19 Stance上使用Llama、Mistral和Gemini评估了所提出的框架。结果表明,该框架在隐式和上下文依赖的立场案例上取得了最大增益,在COVID-19上达到86.07 Macro-F1,在SemEval-2016上达到82.90,同时在更显式的立场数据集(如P-Stance)上保持竞争力。这些发现表明,当仅凭表面线索无法可靠推断立场时,自适应推理级聚合最为有益。

英文摘要

Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single-pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self-consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi-agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label-level voting to reasoning-level synthesis. The framework employs a Manager-Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning-only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval-2016, P-Stance, and COVID-19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context-dependent stance cases, achieving 86.07 Macro-F1 on COVID-19 and 82.90 on SemEval-2016, while remaining competitive on more explicit stance datasets such as P-Stance. These findings suggest that adaptive reasoning-level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone.

2606.11910 2026-06-11 cs.CL 新提交

An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination

一种本体引导的多锚点图检索框架用于交通事故法律责任判定

Xu Li, Shuqi Tian, Xun Han, Kuncheng Zhao, Xinyi Li

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出OMAGR框架,通过本体引导将查询分解为锚点并执行并行图检索,解决多维度检索瓶颈,在TrafficLaw-QA数据集上提升上下文精度和忠实度。

详情
Comments
Submitted to ICONIP. 15 pages, 3 figures
AI中文摘要

交通事故法律责任判定对于分配法律处罚至关重要,需要同时识别跨多个法律维度的相互依赖的法定条款。然而,现有的检索增强生成方法存在多维度检索瓶颈:单轴架构将复杂的法律查询压缩为单一通路,导致相互依赖的法定维度被忽视。为了解决这个问题,我们提出了OMAGR,一个本体引导的框架,将查询分解为与本体对齐的锚点,并在每个维度上执行并行图检索,确保在融合前各维度独立检索。为了评估所提出的方法,我们创建了TrafficLaw-QA数据集,这是一个经过专家验证的基准数据集,包含200个问题和527条法律条款。结果表明,TrafficOmni-RAG在上下文精度和忠实度指标上优于基线。研究结果表明,并行多锚点检索有效解决了多维度检索瓶颈,为交通事故法律责任判定研究提供了有前景的方向。

英文摘要

Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented generation methods suffer from a multi-dimensional retrieval bottleneck: single axis architectures compress complex legal queries into a single pathway, causing interdependent statutory dimensions to be overlooked. To address this, we propose OMAGR, an ontology-guided framework that decomposes queries into ontology-aligned anchors and executes parallel graph retrieval across each dimension, ensuring independent retrieval across dimensions before fusion. To evaluate the proposed method, we created the TrafficLaw-QA dataset, an expert-validated benchmark dataset containing 200 questions and 527 legal provisions. Results show that TrafficOmni-RAG outperforms baselines on Context Precision and Faithfulness metrics. The findings demonstrate that parallel multi-anchor retrieval effectively resolves the multi-dimensional retrieval bottleneck, offering a promising direction for traffic law liability determination research.

2606.11945 2026-06-11 cs.CL cs.IR 新提交

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

uva-irlab-conv 在 SemEval-2026 任务 8:基于学习型稀疏检索和列表式重排序的多轮 RAG

Simon Lupart, Kidist Amde Mekonnen, Zahra Abbasiantaeb, Mohammad Aliannejadi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 提出结合学习型稀疏检索与基于 LLM 的重排序和生成的多轮检索增强生成流水线,用于跨四个领域的对话系统,有效处理不可回答查询。

详情
Comments
SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables
AI中文摘要

本报告描述了我们在 SemEval-2026 任务 8(多轮检索与问答)中的参与情况。该任务评估跨四个领域(金融、云文档、政府、维基百科)的对话系统,并包括不可回答的查询,即可用集合中没有足够证据来生成完整回答。我们提出了一种多轮检索增强生成流水线,将学习型稀疏检索与基于 LLM 的重排序和生成相结合。使用稀疏检索作为主要检索方法,我们利用了其跨领域的强泛化能力。此外,我们利用 LLM 的长上下文能力进行对话查询重写、逐点和列表式重排序以及生成最终回答,每一步都基于完整的对话历史。这种多步骤设计使得在整个检索和生成过程中有效整合对话上下文,提高了跨领域的鲁棒性。

英文摘要

This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

2606.12068 2026-06-11 cs.CL 新提交

StanceNakba Shared Task: Actor and Topic-Aware Stance Detection in Public Discourse

StanceNakba 共享任务:公共话语中基于行动者和主题的立场检测

Kholoud K. Aldous, Md Rafiul Biswas, Mabrouka Bessghaier, Shimaa Ibrahim, Kais Attia, Wajdi Zaghouani

AI总结 提出 StanceNakba 2026 共享任务,通过两个子任务(行动者级和跨主题立场检测)利用微调 Transformer 模型(如 MARBERT、AraBERT)在巴以冲突相关社交媒体数据上实现高 Macro F1 分数。

详情
Comments
11 Pages, 6 Tables
AI中文摘要

我们提出 StanceNakba 2026,这是一个关于巴以冲突相关极化社交媒体话语中立场检测的共享任务,作为 LREC-COLING 2026 上 Nakba-NLP 2026 的一部分组织。该任务引入两个子任务:子任务 A(行动者级立场检测),将英语社交媒体帖子分类为亲巴勒斯坦、亲以色列或中立;子任务 B(跨主题立场检测),识别阿拉伯语帖子中关于两个冲突相关主题(与以色列正常化以及约旦难民存在)的赞成、反对或中立立场。该任务基于一个包含 2,606 条社交媒体帖子的标注数据集。共有 7 个团队参加了子任务 A,6 个团队参加了子任务 B。参与系统主要微调了阿拉伯语和多语言基于 Transformer 的模型,包括 MARBERT、AraBERT 和 DeBERTa-v3 变体,多个团队采用了交叉验证、集成方法和主题条件架构。表现最佳的系统在子任务 A 上达到了 0.9620 的 Macro F1,在子任务 B 上达到了 0.8724,表明基于 Transformer 的方法对于冲突领域立场检测非常有效,同时突显了跨主题泛化和中立类别预测方面的持续挑战。

英文摘要

We present StanceNakba 2026, a shared task on stance detection in polarized social media discourse related to the Palestinian-Israeli conflict, organized as part of Nakba-NLP 2026 at LREC-COLING 2026. The task introduces two subtasks: Subtask A (Actor-Level Stance Detection), which classifies English social media posts as Pro-Palestine, Pro-Israel, or Neutral; and Subtask B (Cross-Topic Stance Detection), which identifies Favor, Against, or Neither stances in Arabic posts toward two conflict-related topics, normalization with Israel and refugee presence in Jordan. The task is grounded in an annotated dataset of 2,606 social media posts. A total of 7 teams participated in Subtask A and 6 teams in Subtask B. Participating systems primarily fine-tuned Arabic and multilingual transformer-based models, including MARBERT, AraBERT, and DeBERTa-v3 variants, with several teams employing cross-validation, ensemble methods, and topic-conditioned architectures. The best-performing systems achieved a Macro F1 of 0.9620 on Subtask A and 0.8724 on Subtask B, demonstrating that transformer-based approaches are highly effective for conflict-domain stance detection while highlighting persistent challenges in cross-topic generalization and neutral class prediction.

2606.12210 2026-06-11 cs.CL 新提交

Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

新闻能否预测市场?零样本金融自然语言处理的局限性与可解释人工智能的作用

Ali M Karaoglu, Shreyank N Gowda

发表机构 * University of Nottingham(诺丁汉大学)

AI总结 本研究通过零样本自然语言处理框架,结合时间聚合与多层次可解释性,发现零样本方法无法超越简单基线,但可解释性信号能区分可靠与不可靠预测,强调透明性和不确定性感知在决策支持中的价值。

详情
AI中文摘要

金融新闻能否可靠地预测短期股票波动?尽管大型语言模型取得了进展,但这一问题仍未解决。我们使用零样本自然语言处理框架重新审视该问题,研究模型能否在无需领域特定训练的情况下从金融新闻中提取可操作信号。我们设计了一个结构化流程,将零样本自然语言推理与时间聚合相结合,在整合跨文章信息时明确建模时效性和事件依赖的影响范围。为了解决高风险场景中对透明度的需求,我们引入了一个多层次可解释性框架,将预测与词元级、文章级和聚合证据联系起来,并生成基于文本的自然语言理由。在多个模型和预测时间跨度上,我们发现零样本方法始终无法超越简单基线,在负向波动上表现尤其薄弱,这表明将新闻情绪映射到短期价格动态存在更深层次的结构性限制。然而,可解释性信号能够可靠地区分可信和不可信的预测,即使在准确性有限的情况下也具有实用价值。这些发现凸显了零样本金融自然语言处理的局限性,并促使我们转向优先考虑透明性和不确定性感知的决策支持系统。代码:此 https URL

英文摘要

Can financial news reliably predict short-term stock movements? Despite advances in large language models, this question remains unresolved. We revisit this problem using a zero-shot natural language processing framework, investigating whether models can extract actionable signals from financial news without domain-specific training. We design a structured pipeline that combines zero-shot natural language inference with temporal aggregation, explicitly modelling recency and event-dependent impact horizons when integrating information across articles. To address the need for transparency in high-stakes settings, we introduce a multi-layered explainability framework that links predictions to token-level, article-level, and aggregate evidence, and produces grounded natural language rationales. Across multiple models and prediction horizons, we find that zero-shot approaches consistently fail to outperform simple baselines, with particularly weak performance on negative movements, suggesting deeper structural limitations in mapping news sentiment to short-term price dynamics. However, explainability signals reliably distinguish between trustworthy and unreliable predictions, offering practical value even when accuracy is limited. These findings highlight the limits of zero-shot financial NLP and motivate a shift toward decision-support systems that prioritise transparency and uncertainty awareness. Code: this https URL

2606.12400 2026-06-11 cs.CL cs.IR 新提交

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

Doc-to-Atom:学习编译和组合记忆原子

Xingjian Diao, Wenbo Li, Yashas Malur Saidutta, Avinash Amballa, Lazar Valkov, Srinivas Chappidi

发表机构 * AI Center-Mountain View, Samsung Electronics(三星电子AI中心-山景城) Dartmouth College(达特茅斯学院)

AI总结 提出Doc2Atom框架,将文档分解为语义类型化的知识原子并编译为微LoRA适配器,通过轻量查询路由器选择相关原子组装成查询特定适配器,以解决文档压缩中的干扰和扩展性问题,在六个QA基准上优于Doc-to-LoRA。

详情
Comments
20 pages
AI中文摘要

长输入序列是大语言模型文档理解和多步推理的核心,但注意力的二次成本使得推理既内存密集又缓慢。上下文蒸馏通过将上下文信息压缩到模型参数中来缓解这一问题,最近的工作如Doc-to-LoRA将上下文蒸馏摊销为一次前向传播,为每个文档生成一个LoRA适配器。然而,为所有查询生成单个整体适配器会导致无关查询干扰、有限的组合回忆以及长文档推理的可扩展性差。为了解决这些挑战,我们提出了Doc-to-Atom(Doc2Atom),一种组合参数化记忆框架,将每个文档分解为语义类型化的知识原子。每个原子被编译成一个独立的微LoRA适配器和一个来源检索键。在推理时,一个轻量查询路由器选择并仅组装相关原子到一个查询特定适配器中,然后将其注入冻结的基础模型。整个系统通过多目标蒸馏框架进行端到端训练。在六个不同的QA基准上的实验表明,Doc2Atom优于Doc-to-LoRA基线,同时降低了文档内部化的内存成本。

英文摘要

Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

2605.04221 2026-06-11 cs.CL cs.AI 版本更新

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

面向隐私敏感的临床信息抽取的自提示小型语言模型

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh, Shirindokht Shiraz, Chun-Teh Lee, Ryan Brandon, Muhammad F Walji, Xiaoqian Jiang, Bunmi Tokede

AI总结 针对牙科病历中非结构化、领域特定且隐私敏感的命名实体识别挑战,提出一种本地可部署的自提示框架,通过多提示集成推理和基于QLoRA的微调及直接偏好优化,使小型语言模型在Qwen2.5-14B-Instruct上达到微宏F1分数0.864/0.837。

详情
AI中文摘要

从牙科病程记录中进行临床命名实体识别具有挑战性,因为文档高度非结构化、领域特定且通常涉及隐私敏感信息。我们开发了一个本地可部署的框架,使小型语言模型能够自行生成、验证、完善和评估实体特定提示,以从牙科记录中提取多个临床实体。利用1,200份标注记录,我们通过多提示集成推理评估了候选开放权重模型,并进一步使用基于QLoRA的监督微调和直接偏好优化对选定模型进行调整。模型性能差异显著,凸显了需要针对特定任务进行评估而非依赖通用基准。Qwen2.5-14B-Instruct取得了最强的基线性能。经过DPO后,Qwen2.5-14B-Instruct和Llama-3.1-8B-Instruct分别达到了0.864/0.837和0.806/0.797的微/宏F1分数。这些发现表明,自动提示优化结合轻量级基于偏好的后训练可以支持使用本地部署的小型语言模型进行可扩展的临床信息抽取。

英文摘要

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

2511.19314 2026-06-11 cs.AI cs.CL cs.LG 版本更新

PRInTS: Reward Modeling for Long-Horizon Information Seeking

PRInTS:面向长程信息检索的奖励建模

Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

AI总结 提出PRInTS生成式过程奖励模型,通过密集评分和轨迹摘要提升长程信息检索中工具交互与推理能力,在多个基准上超越前沿模型。

详情
Comments
ACL 2026, 19 pages, code: this https URL
AI中文摘要

信息检索是AI智能体的核心能力,要求它们在整个长轨迹中收集和推理工具生成的信息。然而,这种多步骤信息检索任务对于基于语言模型的智能体仍然具有挑战性。虽然过程奖励模型(PRM)可以通过在测试时对候选步骤进行排序来指导智能体,但现有的PRM——设计用于具有二元判断的短程推理——无法捕捉信息检索步骤的更丰富维度,例如工具交互和对工具输出的推理,也无法处理长程任务中快速增长的上下文。为了解决这些限制,我们引入了PRInTS,一种具有双重能力的生成式PRM:(1)基于PRM对步骤质量多个维度(例如,工具输出的解释、工具调用的信息量)的推理进行密集评分,以及(2)轨迹摘要,在压缩不断增长的上下文的同时保留步骤评估所需的基本信息。在FRAMES、GAIA(级别1-3)和WebWalkerQA(简单-困难)基准上对多个模型的广泛评估表明,使用PRInTS进行最佳n采样增强了开源模型以及专门智能体的信息检索能力,以更小的骨干智能体匹配或超越前沿模型,并优于其他强奖励建模基线。

英文摘要

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs - designed for short reasoning with binary judgment - cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple dimensions of step quality (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models reveal that best-of-n sampling with PRInTS enhances information-seeking in open-source models as well as specialized agents, matching or surpassing frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

2602.17001 2026-06-11 cs.AI cs.CL cs.DB 版本更新

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Sonar-TS: 为时间序列数据库的自然语言查询设计的搜索-验证方法

Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang, Xiping Liu, Shirui Pan, Ming Jin

AI总结 本文提出Sonar-TS,一种神经符号框架,用于解决时间序列数据库的自然语言查询问题,通过搜索-验证流程处理连续形态意图和超长历史数据,引入NLQTSBench基准进行评估,展示了该方法在复杂时间查询中的有效性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

自然语言查询时间序列数据库(NLQ4TSDB)旨在帮助非专家用户从大量时间记录中检索有意义的事件、区间和摘要。然而,现有的文本到SQL方法未针对连续形态意图(如形状或异常)进行设计,而时间序列模型在处理超长历史时面临挑战。为解决这些问题,我们提出Sonar-TS,一种神经符号框架,通过搜索-验证流程处理NLQ4TSDB。类似于主动声纳,它利用特征索引通过SQL ping候选窗口,随后通过生成的Python程序锁定并验证候选者与原始信号。为了实现有效的评估,我们引入NLQTSBench,这是第一个大规模基准,专门针对NLQ在TSDB规模的历史数据。我们的实验突显了该领域独特的挑战,并展示了Sonar-TS在传统方法无法处理的复杂时间查询中的有效性。本文首次系统研究了NLQ4TSDB,提供了一个通用框架和评估标准,以促进未来研究。

英文摘要

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

2605.31506 2026-06-11 cs.IR cs.CL 版本更新

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

评估多源RAG中的事实密度:医学AI准确性研究

Michael R. DeMarco

AI总结 针对标准RAG管道因专家盲视效应而忽视高密度事实证据的问题,提出事实密度(FD*)作为检索优化信号,通过概率事实性分析预处理和Z-score归一化消除长度偏差,在HealthFC基准上实现100%系统综述覆盖率。

详情
Comments
16 pages, 8 tables. Includes Experiment 3 results (n=11, Wilcoxon p=0.0619). Preliminary findings; powered Experiment 3 and Graph RAG extension identified as future work. Updated from v1
AI中文摘要

检索增强生成(RAG)是当前将AI锚定于现实世界事实的行业标准。传统检索方法依赖关键词匹配和主题接近度,根据内容与用户查询的相似程度进行排序。但它们并未衡量内容实际包含多少经过验证的事实。这种结构性差距被称为专家盲视效应,导致标准RAG管道持续将高密度事实证据埋没,而偏向于同一主题的词汇主导文本。为解决这一差距,本文引入事实密度(FD*),一种新颖的检索优化信号,衡量经过验证的原子声明相对于总标记数的比例。使用NexusAgentics Ghost Audit预处理管道,通过概率事实性分析对原始文本进行事实特异性评分,在语料库摄入前过滤内容。初始公式引入了严重的文档长度混杂因素(Pearson R = -0.8636,p = 2.27e-07)。在长度区间内实施Z-score归一化解决了这一偏差,验证了FD*作为长度无关的密度信号(p = 0.0749)。在HealthFC基准(由医学专家标记为支持、反驳或无证据的750个健康声明)上评估,FD*优化的检索是唯一在top-5结果中实现100%系统综述饱和度的条件,使标准余弦相似度排名前十之外的Cochrane证据浮现。真实验证确认了跨越七个HealthFC支持声明的25个映射。由于语料库-基准对齐的限制,n=50个查询的完整统计验证仍是未来工作,但这些发现确立了事实密度重排序作为一种低成本、高影响力的干预措施,用于提高健康RAG架构的事实精度。

英文摘要

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.

2606.10725 2026-06-11 cs.LG cs.CL 版本更新

Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

Pre-AF 13:从出院报告中挖掘的可解释房颤风险评分

Olga Shakhmatova, Dmitrii Kriukov, Daniil Larionov, Nikita Khromov, Iaroslav Bespalov, Alexander Zolotarev, Kirill Grishchenkov, Ekaterina Ivanova, Miron Kuznetsov, Ilya Sochenkov, Elizaveta Panchenko, Artem Shelmanov, Dmitry V. Dylov

发表机构 * National Medical Research Center of Cardiology named after Academician E.I. Chazov(国家医学研究中心心脏病学以E.I. Chazov院士命名) Skolkovo Institute of Science and Technology (Skoltech)(斯科尔科沃科学技术研究所) Artificial Intelligence Research Institute (AIRI)(人工智能研究所) University of Mannheim(曼海姆大学) Russian Center for Scientific Information (RCSI)(俄罗斯科学信息中心) Institute of Cyber Intelligence Systems, National Research Nuclear University MEPhI(网络智能系统研究所,国家研究核大学MEPhI) M.V. Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学) Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)(俄罗斯科学院信息传输问题研究所(Kharkevich研究所)) Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences (FRC CSC RAS)(俄罗斯科学院联邦研究中心“计算机科学与控制”) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 利用NLP从出院报告中提取特征,构建可解释ML模型预测心血管病患者房颤风险,Pre-AF 13模型优于现有临床评分。

详情
Comments
O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Correspondence to: Olga Shakhmatova < this http URL [at] this http URL > and Dmitry V. Dylov < this http URL [at] this http URL >
AI中文摘要

背景:房颤(AF)是最常见的心律失常,也是预后的主要决定因素。现有的AF风险评分依赖于在心血管疾病(CVD)患者中几乎普遍存在的因素(如高龄、高血压),因此在该高风险群体中提供的分层有限。大多数评分针对长期(5-10年)而非中期预测。我们开发了可解释的ML模型,利用常规收集的医院数据预测CVD患者在24个月和整个随访期间内的AF风险。方法:对俄罗斯国家心脏病学研究中心电子健康记录进行单中心回顾性研究,纳入2012年1月至2019年5月期间多次住院、年龄≥18岁、患有CVD但无既往AF的患者。自定义NLP流水线将非结构化出院报告转化为73个结构化特征,结合基于规则的解析器和基于Transformer的命名实体识别。使用LightAutoML构建了完整模型(73个特征)、简单模型(简化子集)以及用于床旁风险评分的线性模型。性能通过ROC AUC评估,并与CHARGE-AF、C2HEST、MHS和HAVOC进行比较,并通过SHAP进行解释。结果:在来自45,000名患者的80,576份记录中,17,562份符合纳入标准;其中1,438名(8.19%)发生AF。完整模型在24个月和整个随访期间的ROC AUC分别为0.735和0.696;简单模型几乎相同(0.725和0.696)。所有非线性模型均优于四个临床风险评分(ROC AUC 0.53-0.64)。简单模型使用13个特征,命名为Pre-AF 13。SHAP识别出年龄和左心房容积为主要预测因子。线性风险评分(Pre-AF 9)将观察到的24个月AF发生率从约7%分层至36%。结论:基于常规收集的EHR数据构建的可解释ML模型能够识别高AF风险的CVD患者,优于现有的临床风险评分。

英文摘要

Background. Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and a major determinant of prognosis. Established AF risk scores rely on factors (older age, hypertension) nearly ubiquitous among patients with cardiovascular disease (CVD), offering limited stratification in this high-risk group. Most target long-term (5-10 year) rather than medium-term prediction. We developed interpretable ML models predicting AF risk over a 24-month and entire follow-up horizon in CVD patients using routinely collected hospital data. Methods. Single-center retrospective study of electronic health records from the National Research Cardiology Center (Russia) for patients aged >=18 with CVD but without pre-existing AF, hospitalized more than once between January 2012 and May 2019. A custom NLP pipeline transformed unstructured discharge reports into 73 structured features, combining a rule-based parser with transformer-based NER. Using LightAutoML we built a full model (73 features), a simple model (reduced subset), and a linear model for a bedside risk score. Performance was assessed by ROC AUC, compared with CHARGE-AF, C2HEST, MHS, and HAVOC, and interpreted via SHAP. Results. Of 80,576 records from 45,000 patients, 17,562 met inclusion criteria; 1,438 (8.19%) developed AF. The full model reached ROC AUC 0.735 (24-month) and 0.696 (entire follow-up); the simple model was nearly identical (0.725, 0.696). All non-linear models outperformed the four clinical risk scores (ROC AUC 0.53-0.64). The simple model uses 13 features and is named Pre-AF 13. SHAP identified age and left atrial volume as dominant predictors. A linear risk score (Pre-AF 9) stratified observed 24-month AF incidence from ~7% to 36%. Conclusion. Interpretable ML models built from routinely collected EHR data identify high-AF-risk CVD patients, outperforming established clinical risk scores.

4. 对话系统与智能体 20 篇

2606.11199 2026-06-11 cs.CL cs.AI cs.IR cs.LG 新提交

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

NightFeats @ MMU-RAGent NeurIPS 2025: 面向文本到文本轨道的上下文优化多智能体RAG系统

Quentin Fever, Naziha Aslam

AI总结 提出一种结构化多智能体RAG系统NightFeats,通过检索、策展和组合三阶段分解知识合成,引入时序语义重排序、矛盾协调和引用保留架构,在MMU-RAGent竞赛中超越商业基线。

详情
Comments
5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025
AI中文摘要

我们提出NightFeats,一个结构化的多智能体检索增强生成(RAG)系统,提交至NeurIPS 2025的MMU-RAGent竞赛,并在文本到文本轨道中获得最佳动态评估奖。本文并非以基准最大化目标,而是提出一个原则性流水线,将知识合成为三个协调阶段:检索、策展和组合,每个阶段由显式的中间表示和交接契约控制。受智能体上下文工程(ACE)启发,该系统引入时序语义重排序、有界矛盾协调和保留引用的组合作为核心架构原语。竞赛结果表明,NightFeats在LLM-as-a-Judge和人类Likert评估中超越了包括Claude-SonnetV2和Nova-Pro在内的商业基线,证实了架构透明性和可验证证据基础比单纯优化自动相似度指标的系统更符合人类偏好。

英文摘要

We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.

2606.11210 2026-06-11 cs.CL cs.AI cs.MM 新提交

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

T2MM:一种支持基于探究建模的LLM架构

John Kos, Rudra Singh, Ashok Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出T2MM架构,利用LLM在生态建模软件VERA中生成交互式模型,优于全代码生成基线。

详情
Comments
16 pages, 4 figures
AI中文摘要

模型构建是科学学习中的基础实践,依赖于可视化和交互性。大型语言模型(LLM)越来越多地增强多模态能力,并已集成到教育环境中以支持学习。然而,这些工具缺乏某些学习环境所需的视觉交互性。我们提出了文本到多模态模型(T2MM),这是一种稳健、动态的LLM支持架构,可在开放探究生态建模软件虚拟实验研究助手(VERA)中辅助模型构建。T2MM考虑学习者模型的当前上下文,并创建交互式模型(而非静态图像),使模型能够对人工调整保持响应。为了衡量技术可行性,我们通过一个自定义的程序生成数据集(包含自然语言学习者建模请求和VERA系统中的目标模型)来评估T2MM。在所有测量的成功指标上,T2MM优于通过LLM支持的全代码生成实现的基线模型生成架构(这在文献中很常见)。我们的贡献不仅概述了将LLM集成到基于探究的学习建模工具中,还描述了一种可能的架构,通过该架构可以创建更具交互性的多模态LLM工具。

英文摘要

Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

2606.11212 2026-06-11 cs.CL 新提交

EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

EverydayGPT: 用于高效安全混合GPT-RAG对话问答的置信门控路由

Jaspreet Singh Nahal

发表机构 * Dr. A.P.J. Abdul Kalam Technical University(阿卜杜尔·卡拉姆技术大学)

AI总结 提出置信门控路由机制,通过联合策略决定检索与生成路径,使85%的查询使用快速RAG提取,延迟降低120倍以上,同时保持答案质量。

详情
Comments
12 pages, 10 figures, 6 tables. Code and evaluation scripts available at: this https URL. This paper studies routing strategies for hybrid GPT-RAG systems under resource constraints, focusing on efficiency-safety tradeoffs rather than state-of-the-art accuracy
AI中文摘要

标准检索增强生成(RAG)流水线无条件地将每个查询路由到检索和生成,导致不必要的计算并将低质量上下文传播给生成器。我们引入了EverydayGPT,一个轻量级对话问答系统,围绕置信门控路由(CGR)机制构建,该机制将路由决策形式化为检索距离和提取充分性的联合策略。骨干网络是一个205M参数的GPT,在FineWeb-Edu的10B令牌上从头训练。CGR通过快速RAG提取(~45 ms)解决85%的查询,避免调用昂贵的GPT路径(~5.9s),在大多数查询上实现超过120倍的延迟降低,同时保持答案质量。在500个问题的领域内基准测试中,系统达到F1 = 0.226 +/- 0.004,而仅GPT为0.171,无条件RAG为0.210。相对于强基线的提升虽小但一致,而效率提升显著(平均延迟降低6.3倍)。结构化的基础审计发现采样集中没有无根据的声明,并带有明确的范围限制。我们将这项工作定位为资源约束下路由策略的研究,而非声称最先进的性能。

英文摘要

Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.

2606.11213 2026-06-11 cs.CL 新提交

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

超越压缩:面向长周期智能体的结构化上下文驱逐

Andrew Semenov, Svyatoslav Dorofeev

发表机构 * Kiz8

AI总结 提出上下文窗口生命周期(CWL)方案,通过结构化、语义感知的驱逐策略,使长周期LLM智能体在有限上下文预算内实现无限工作视野,避免性能下降和幻觉。

详情
AI中文摘要

我们提出了上下文窗口生命周期(CWL),一种上下文管理方案,为长周期LLM智能体提供有效无界的工作视野。随着会话累积历史,CWL通过渐进式、语义感知的驱逐将上下文保持在预算内:智能体在工作过程中将其轨迹注释为类型化、依赖链接的情节,当令牌预算超出时,一个确定性的、无需LLM的策略在该结构内按优先级顺序驱逐内容。CWL保留用户轮次和智能体正在积极推理的探索上下文,同时积极丢弃其效果已持久化在环境中的行动情节,使活动上下文保持在稳定上限附近,这也避免了与超大提示相关的性能下降。与基于摘要的压缩相比,CWL避免了四个众所周知的局限性:不可预测的信息丢失、因果结构的破坏、阻塞模型成本以及压缩引起的幻觉。与最近截断相比,CWL具有语义感知能力:它根据依赖图丢弃最旧且最可恢复的内容,而不是按时间顺序丢弃最旧的内容而不考虑相关性。我们描述了注释协议、情节图、驱逐策略和令牌记账循环,并在长周期智能体基准上评估了CWL:一个智能体会话在8000万个令牌上完成89个顺序任务,与每任务隔离会话相比,任务准确性没有可测量的下降。

英文摘要

We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions

2606.11435 2026-06-11 cs.CL 新提交

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

智能体技能评估与进化:框架与基准

Kexin Ding, Yang Zhou, Can Jin, Feng Tong, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 本文系统综述了智能体技能从孤立创建到自动化评估驱动进化的范式转变,分类了四种进化范式并分析了六个技能基准类别,指出了覆盖缺口和开放方向。

详情
AI中文摘要

智能体技能的增长已经改变了智能体系统的构建、评估和部署方式。随着技能库的持续扩展,严格的评估对于确保其在现实应用中的效用、质量和安全性变得至关重要。因此,该领域正在经历从孤立技能创建到自动化、评估驱动的技能进化的新兴范式转变。在本综述中,我们系统地考察了超越基础技能创建的技能进化与评估的格局。我们将进化分为四种不同的范式,涵盖执行反馈、轨迹蒸馏、压缩和强化学习,展示了每种元素如何有助于提高技能效用和可靠性。我们还对六个以技能为中心的基准类别进行了分析,识别了基准覆盖范围、权衡和度量丰富性方面的结构性差距,以推动技能研究。最后,我们指出了构建可泛化、高效且可验证安全的技能生态系统的开放方向。项目网址为:https://this https URL

英文摘要

The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is this https URL

2606.11520 2026-06-11 cs.CL cs.AI cs.LG 新提交

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE:一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

AI总结 提出ISE三阶段范式,通过结构化意图构建、角色锁定用户模拟和真实执行环境,生成多轮代理轨迹,微调后显著提升代理工具使用性能。

详情
Comments
13 pages, 6 figures. Dataset and code: this https URL
AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE(意图->模拟->执行),一种三阶段合成范式,联合解决这些差距。阶段1通过4D框架(人物角色x领域x任务x复杂度)构建约50000个结构化意图;去重后池中包含43956个唯一意图,并在mpnet-base-v2嵌入(余弦核,q=1)上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互,将每轮用户交互基于实际执行结果,生成23132条完整轨迹,平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用,生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后,使用Qwen3-8B在标准协议下的代理工具使用任务中,ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at this https URL.

2606.11688 2026-06-11 cs.CL cs.AI 新提交

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Goal-Autopilot: 一种可验证的防伪造防火墙,用于无人值守的长周期智能体

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出Autopilot执行模型,通过外部化状态到有限状态机并强制门控验证,使智能体无法虚假声称成功,在3,150个单元测试中伪造率降至0.95%,显著低于基线方法。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

长周期LLM智能体在无人值守时不可信:没有人类监控,它们自信地报告从未验证的成功。我们将诚实性——限制智能体在终止时可能声称的内容——视为无人值守自主性的首要指标,与能力区分开来。我们提出Autopilot,一种执行模型,使得静默伪造的成功在结构上不可能,而不仅仅是更罕见。Autopilot将所有工作状态外部化到一个持久的、门控的有限状态机中,调度器每次以无状态滴答推进;一个硬性下限禁止任何终端“完成”声明,其可伪造的门并未实际执行并通过。我们证明了一个无假成功定理——在门控正确性、下限执行和计划覆盖下,终止意味着目标成立——其唯一信任点可经验测量,并表明最坏情况退化为诚实的停顿,而非伪造的成功。由于每个滴答仅重新水化状态机,每步上下文成本在时间范围内恒定。在3,150个单元的配对语料库(70个任务×3个系统×3个模型×5个种子,包括跨11个开源仓库的50个SWE-bench Lite任务)上,Autopilot在0.95%的单元上伪造[95% CI 0.38–1.62],而Reflexion和StateFlow基线分别在8.10% [6.48–9.81]和25.05% [22.48–27.62]上伪造。主要对比存在于困难场景:在SWE-bench Lite上,防火墙将伪造率从33.7%(StateFlow)降至0.67%,配对差异为-33.07个百分点[95% CI -36.53, -29.73]。机制在于门控而非模型:所有十个Autopilot伪造均来自最强模型,而两个较弱的中间模型在700个配对单元中从未伪造。防火墙设计上以覆盖换取诚实——诚实的停顿是可恢复的;而自信的错误输出向下游发送则不可恢复。

英文摘要

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.

2606.11744 2026-06-11 cs.CL cs.AI 新提交

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

嘿,聊天机器人,你能教我吗?为人类学习构建结构化苏格拉底式对话

Sidney Tio, Arunesh Sinha, Pradeep Varakantham

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Department of Management Science and Information Systems, Rutgers Business School(罗格斯大学商学院管理科学与信息系统系)

AI总结 针对LLM在长对话中教学效果差的问题,提出分离课程规划、苏格拉底对话和知识状态推断的系统,使用PPO策略决定教学顺序,在STEM和非STEM主题上优于基线模型。

详情
Comments
10 Main Body Pages, with Appendices
AI中文摘要

大型语言模型现在被广泛用于日常学习,但底层交互通常是非结构化的聊天,而不是遵循课程。与正式的在线学习系统不同,这些交互没有学生的先前记录,因此对学生已知内容的任何估计都必须从对话本身推断。我们表明,仅通过扩展模型并不能弥补这一差距。前沿和教育调优的LLM在要求长时间辅导学生时表现不佳,因为这需要同时做三件事:导师必须安排课程顺序,进行苏格拉底式对话,并从对话中推断学生的知识状态。我们建议分离这些职责。给定学生查询,我们的系统构建一个先决知识图谱,其中子主题是节点,依赖关系是边,并将辅导视为决定下一个要教授哪个节点以及在该节点上花费多少轮对话后再继续。一个轻量级的PPO策略处理这个顺序决策,而LLM在所选节点进行苏格拉底式交流并返回学生进展信号。在保留的STEM和非STEM主题上,我们的PPO配对导师优于启发式基线、前沿通用模型以及专门用于苏格拉底式对话的模型:无论是在学生达到完全课程掌握的速度上,还是在所需的对话轮数上。明确的课程结构带来了底层模型扩展所无法提供的收益。

英文摘要

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

2606.11875 2026-06-11 cs.CL cs.SD 新提交

I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System

我理解你的感受:通过对话系统中的多语言情感验证增强深层情感支持

Zi Haur Pang, Yahui Fu, Koji Inoue, Tatsuya Kawahara

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究科)

AI总结 提出情感验证在对话系统中的应用,构建多语言语料库M-EDESConv和测试集M-TESC,设计多语言情感感知门控单元MEGUMI进行时机检测,并评估当前LLM在情感验证响应生成中的表现。

详情
Comments
This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2026 (SIGDIAL 2026)
AI中文摘要

情感验证——明确承认用户的感受是合理的——已被证明具有治疗价值,但很少受到计算方面的关注。对话系统中的情感验证可以分解为:(i) 验证响应识别,(ii) 验证时机检测,以及 (iii) 验证响应生成。为了支持所有三个子任务的研究,我们发布了 M-EDESConv,一个通过混合手动和自动标注创建的 12 万条英日多语言语料库,以及 M-TESC,一个多语言口语对话测试集。对于时机检测,我们提出了 MEGUMI,一种多语言情感感知门控单元用于相互融合,它通过跨模态注意力和门控融合将冻结的 XLM-RoBERTa 语义与特定语言的情感编码器融合。MEGUMI 在 M-EDESConv 和 M-TESC 数据集上均表现出优越的性能,无论是客观还是主观评价。最后,我们的 EmoValidBench 基准测试(使用 GPT-4.1 Nano 和 Llama-3.1 8B)表明,当前的 LLM 能够生成上下文相似且多样化的验证响应,但情感理解仍然是一个需要改进的主要领域。项目页面:this https URL

英文摘要

Emotional validation - explicitly acknowledging that a user's feelings make sense - has proven therapeutic value but has received little computational attention. Emotional validation in dialogue systems can be decomposed into (i) validating response identification, (ii) validation timing detection, and (iii) validating response generation. To support research on all three subtasks, we release M-EDESConv, a 120k English-Japanese multilingual corpus created through hybrid manual and automatic annotation, and M-TESC, a multilingual spoken-dialogue test set. For timing detection, we propose MEGUMI, a Multilingual Emotion-aware Gated Unit for Mutual Integration, that fuses frozen XLM-RoBERTa semantics with language-specific emotion encoders via cross-modal attention and gated fusion. MEGUMI shows superior performance on both the M-EDESConv and M-TESC datasets, both objectively and subjectively. Finally, our EmoValidBench benchmarks of GPT-4.1 Nano and Llama-3.1 8B indicate that current LLMs generate contextually similar and diverse validating responses, but emotional understanding remains a major area for improvement. Project page: this https URL

2606.12087 2026-06-11 cs.CL 新提交

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

FORT-Searcher:合成抗捷径搜索任务以训练深度搜索智能体

Jia Deng, Yimeng Chen, Xiaoqing Xiang, Ziyang Zeng, Shuo Tang, Wayne Xin Zhao, Feng Chang, Chuan Hao, Yuan Wei, Ran Tao, Bryan Dai, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence Renmin University of China(中国人民大学高瓴人工智能学院) KAUST(阿卜杜拉国王科技大学) IQuest Research(IQuest研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出FORT框架,通过控制四种捷径风险合成抗捷径训练数据,使搜索智能体进行更长的预答案搜索,减少捷径模式,仅用SFT训练即达到最优性能。

详情
Comments
30 pages
AI中文摘要

训练深度搜索智能体需要可验证的问题,其答案只有在通过搜索获得足够证据后才可用。现有的合成方法通常通过丰富图结构来增加表面难度,但仅凭结构复杂性并不能保证实现实际的搜索难度:预期的搜索过程可能通过更便宜的识别路径崩溃。我们用一个捷径感知的难度框架形式化了这一差距,并识别了四种可操作的捷径风险:证据共覆盖、单线索选择性、暴露常数和先验知识绑定。为了诊断它们的实际效果,我们使用轨迹签名,包括求解成本、答案命中时间和先验捷径率。在此框架的指导下,我们引入了FORT,一个抗捷径训练数据合成框架。FORT通过控制实体选择、证据图构建、问题表述和对抗性细化中的捷径风险来构建抗捷径训练数据。实验表明,与现有的开源深度搜索数据集相比,FORT诱导了更长的预答案搜索和更少的捷径模式。使用由此产生的轨迹,我们仅通过监督微调(SFT)训练FORT-Searcher,并在具有挑战性的深度搜索基准上取得了可比大小的开源搜索智能体中最佳的整体性能。相关资源将在https://this URL上提供。

英文摘要

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at this https URL.

2606.12411 2026-06-11 cs.CL cs.LG 新提交

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

上下文驱动的增量压缩用于多轮对话生成

Yeongseo Jung, Jaehyeok Kim, Eunseo Jung, Jiachuan Wang, Yongqi Zhang, Ka Chun Cheung, Simon See, Lei Chen

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) NVIDIA AI Technology Center(NVIDIA AI技术中心) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出上下文驱动的增量压缩(C-DIC),通过可修订的线程压缩状态和轻量级检索-修订-写回循环,实现跨轮信息共享,稳定长对话性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

现代对话代理在每一轮都会处理不断增长的对话历史,导致冗余的注意力和编码成本随对话长度增加。简单的截断或摘要会降低保真度,而现有的上下文压缩器缺乏跨轮记忆共享或修订,导致信息丢失和长对话中的累积错误。我们重新审视了对话动态下的上下文压缩,并经验性地展示了其脆弱性。为了提高效率和鲁棒性,我们引入了上下文驱动的增量压缩(C-DIC),它将对话视为交织的上下文线程,并在单个紧凑的对话记忆中存储每个线程的可修订压缩状态。在每一轮,一个轻量级的检索、修订和写回循环在轮次之间共享信息并更新过时的记忆,从而稳定长期行为。此外,我们将截断反向传播(TBPTT)适应于我们的多轮设置,学习跨轮依赖关系而无需完整历史反向传播。在长对话基准上的大量实验证明了C-DIC的优越性能和效率;值得注意的是,C-DIC在数百轮对话中表现出稳定的推理延迟和困惑度,为高质量对话建模提供了一条可扩展的路径。

英文摘要

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

2606.11290 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

FlowBank: 通过预计算与复用实现查询自适应智能体工作流优化

Lingzhi Yuan, Chenghao Deng, Fangxu Yu, Souradip Chakraborty, Mohammad Rostami, Furong Huang

AI总结 提出FlowBank框架,通过预计算多样化工作流并压缩为紧凑组合,在推理时自适应选择最优工作流,平衡性能与成本,在五个基准上平均得分最高且成本可控。

详情
AI中文摘要

基于大型语言模型的多智能体系统日益强大,但当前的智能体工作流优化范式存在令人不满意的权衡。任务级方法花费大量离线计算却只部署单个工作流,导致互补候选未被使用;而查询级方法为每个查询合成新工作流,推理成本高昂。我们的动机分析表明,这些范式更多是互补而非竞争:离线搜索中发现的工作流通常解决不同子集的查询,许多由昂贵查询级生成处理的查询已经可以通过更便宜的预计算工作流解决。这暗示了一个不同的目标:与其寻找一个普遍最佳的工作流或为每个实例重新生成,不如构建一个紧凑的、可复用的互补工作流库,并在推理时自适应地选择。为此,需要解决三个耦合问题:生成互补而非冗余的候选、压缩成小型可部署组合、在性能-成本权衡下为每个查询分配正确的工作流。我们提出FlowBank,一个基于组合的智能体工作流优化的三阶段框架。多样化阶段提出DiverseFlow,引导搜索覆盖未充分覆盖的查询,产生高覆盖率的候选池。精炼阶段提出CuraFlow,将候选池压缩为冗余最小的紧凑组合。匹配阶段将部署建模为查询-工作流二分图上的边值预测,将每个传入查询路由到预测效用最佳的组合成员。在五个基准上,FlowBank在评估方法中实现了最高平均得分,同时保持成本竞争力,相比最强的自动和手工基线分别相对提升4.26%和14.92%。

英文摘要

Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

2606.11680 2026-06-11 cs.AI cs.CL cs.LG 交叉投稿

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

先组织再检索:面向高效智能体的层次化记忆导航

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

发表机构 * Duke University(杜克大学) Snowflake AI Research(Snowflake AI研究)

AI总结 提出HORMA框架,通过构建文件系统式的层次化记忆结构并利用强化学习训练的轻量级导航代理,实现高效检索,在长时任务中提升性能并降低令牌消耗。

详情
AI中文摘要

大型语言模型(LLM)智能体由于固有的无状态性,在处理长时任务时面临挑战,所有任务相关信息必须编码到不断增长的输入上下文中,导致推理质量下降、推理成本增加和延迟升高,因此需要高效的工作记忆机制。然而,现有方法要么依赖有损压缩,要么基于相似性检索,往往无法捕捉多步智能体任务所需的时间结构和因果依赖关系。在这项工作中,我们提出了HORMA,一种层次化组织与检索记忆智能体,它将经验组织成类似文件系统的层次化结构,其中总结的实体链接到相应的原始轨迹,从而在保留详细信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段:结构化记忆构建和基于导航的检索。构建模块通过区分由信息缺失导致的失败和由误导性或过载上下文导致的失败,迭代地优化经验的结构化方式。导航模块使用强化学习训练的轻量级代理遍历层次结构,选择最小但充分的上下文,从而减少关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval上,HORMA在受限上下文预算下提升了任务性能,同时在长对话任务中最多仅使用基线22.17%的令牌。与现有方法相比,它始终实现了更好的效率-性能权衡,并能有效泛化到未见任务。

英文摘要

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

2510.18289 2026-06-11 cs.CL cs.CY cs.MA 版本更新

Food4All: An Agentic Framework and Benchmark for Food Resource Navigation with Adaptive User Understanding

Food4All: 一种具有自适应用户理解能力的食物资源导航智能体框架与基准

Yiyang Li, Weixiang Sun, Tianyi Ma, Kaiwen Shi, Zheyuan Zhang, Yanfang Ye

AI总结 提出Food4All框架,结合食物搜索工具与300个多轮评估任务,在686个印第安纳食物资源上评估六种大语言模型,诊断其在约束条件处理和非理想用户交互中的不足。

详情
Comments
We have further refined the benchmark construction and experimental presentation to improve clarity and consistency. The revised version includes updated task design, food-resource data, and evaluation details to better align the benchmark with the intended food resource referral setting. These changes provide a more precise presentation of the experimental findings
AI中文摘要

食物援助推荐需要对话智能体将未明确指定且常含噪声的求助对话转化为本地有效的资源推荐。我们提出Food4All,一个基于686个结构化印第安纳食物资源的智能体食物资源推荐框架与基准。Food4All将食物特定搜索工具与300个多轮评估任务相结合,涵盖单一食物需求、具有访问或文件约束的复合案例,以及五种非理想用户交互特征:不合理要求、冗长回答、不耐烦、不完整答案和不一致信息。我们在需求理解、资源检索、最终推荐正确性和交互效率上评估了六种大语言模型。尽管最强模型达到了96.33%的推荐准确率,但我们的诊断揭示了在时间安排、资格、接收和文件约束方面的持续失败,以及在最终推荐中未能保留有效检索到的资源。特征级分析进一步表明,不同的非理想行为对推荐流程的不同部分造成压力。Food4All为在现实用户交互挑战下研究约束敏感的食物援助推荐中的工具调用智能体提供了一个受控测试平台。

英文摘要

Food assistance referral requires conversational agents to translate underspecified, often noisy help-seeking dialogues into locally valid resource recommendations. We present Food4All, an agentic food-resource referral framework and benchmark grounded in 686 structured Indiana food resources. Food4All couples a food-specific search tool with 300 multi-turn evaluation tasks spanning single food needs, composite cases with access or document constraints, and five non-ideal user interaction traits: unreasonable demands, rambling responses, impatience, incomplete answers, and inconsistent information. We evaluate six Large Language Models (LLMs) on requirement grounding, resource retrieval, final referral correctness, and interaction efficiency. Although the strongest model achieves 96.33% referral accuracy, our diagnostics reveal persistent failures in grounding schedule, eligibility, intake, and document constraints, as well as failures to preserve valid retrieved resources in the final recommendation. Trait-level analysis further shows that different non-ideal behaviors stress different parts of the referral pipeline. Food4All provides a controlled testbed for studying tool-calling agents in constraint-sensitive food assistance referral under realistic user interaction challenges.

2603.08501 2026-06-11 cs.CL 版本更新

Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq:一种用于基于经典伊斯兰问答的多智能体架构

Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj Alam

AI总结 针对大语言模型在伊斯兰问答中易产生幻觉和错误归因的问题,提出基于多智能体工具增强架构的Fanar-Sadiq系统,通过意图感知路由、检索增强教法回答、精确经文引用和确定性计算器,在公开基准上实现高效准确的伊斯兰问答。

详情
Comments
Islamic QA; Religious NLP; Retrieval-Augmented Generation; Multi-Agent LLMs; Tool-Augmented Reasoning; Faithful Generation; Fiqh Reasoning
AI中文摘要

大型语言模型(LLM)能够流畅回答宗教知识查询,但经常产生幻觉并错误归因来源,这在伊斯兰环境中尤其严重,因为用户期望基于经典文本(《古兰经》和圣训)和教法(fiqh)细微差别的回答。检索增强生成(RAG)改善了基础性,但单一的检索-生成流程不足以处理多样化的伊斯兰查询,包括逐字经文、基于引用的指导以及规则约束的计算(如天课和遗产)。为了解决这些挑战,我们提出了Fanar-Sadiq,一个基于多智能体、工具增强架构的双语(阿拉伯语-英语)伊斯兰问答系统。它是Fanar AI平台的核心组件。Fanar-Sadiq将伊斯兰查询路由到智能体工具架构中的专门模块。它支持意图感知路由、带有标准化引用和验证轨迹的检索增强教法回答、带有引文验证的精确经文查找,以及具有教法学派敏感分支的确定性逊尼派天课和遗产计算器。我们在公开的伊斯兰问答基准上评估了端到端系统,显示出强大的有效性和效率。该系统通过API和Web应用程序公开访问,在不到一年的时间内已收到超过190万次访问(此 https URL )。

英文摘要

Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) improves grounding, however, a single retrieve-then-generate pipeline is insufficient for diverse Islamic queries, including verbatim scripture, citation-grounded guidance, and rule-constrained computations such as zakat and inheritance. To address these challenges, we present Fanar-Sadiq, a bilingual Arabic-English Islamic QA system built on a multi-agent, tool-augmented architecture. It is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic queries to specialized modules within an agentic tool architecture. It supports intent-aware routing, retrieval-grounded fiqh answers with normalized citations and verification traces, exact verse lookup with quotation validation, and deterministic Sunni zakat and inheritance calculators with madhhab-sensitive branching. We evaluate the end-to-end system on public Islamic QA benchmarks and show strong effectiveness and efficiency. It is publicly accessible through an API and Web application and has received over 1.9M accesses in less than a year ( this https URL ).

2604.18543 2026-06-11 cs.AI cs.CL 版本更新

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit:爪型智能体的自动环境生成

Xirui Li, Ming Li, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou

AI总结 提出ClawEnvKit自动生成多样、可验证的爪型智能体训练与评估环境,构建含1040个环境的Auto-ClawEval基准,成本降低13800倍,性能提升达15.7个百分点。

详情
AI中文摘要

构建用于训练和评估爪型智能体的环境仍然是一个手动、人力密集且无法扩展的过程。我们认为,需要的不仅仅是一个数据集,而是一个能够按需生成多样化、可验证环境的自动化流水线。为此,我们引入了ClawEnvKit,一个自主生成流水线,它从自然语言描述中实例化这一形式化体系。该流水线包含三个模块:(1)解析器,从自然语言输入中提取结构化生成参数;(2)生成器,生成任务规范、工具接口和评分配置;(3)验证器,确保生成环境的可行性、多样性、结构有效性和内部一致性。使用ClawEnvKit,我们构建了Auto-ClawEval,这是首个用于爪型智能体的大规模基准,包含24个类别的1040个环境。实验表明,Auto-ClawEval在连贯性和清晰度上匹配或超过人工策划的环境,成本降低13800倍。在4个模型家族和8个智能体框架上评估,我们发现框架工程比裸ReAct基线性能提升高达15.7个百分点,完成度仍是主要变化轴,且没有模型饱和该基准,自动化生成使得评估规模达到前所未有的水平。除了静态基准测试,ClawEnvKit还支持实时评估:用户用自然语言描述所需能力,即可按需获得验证过的环境,将评估转变为持续的、用户驱动的过程。同样的机制也可作为按需训练环境生成器,产生适应智能体当前弱点的任务分布,而非受限于现有用户日志。

英文摘要

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

2605.14084 2026-06-11 cs.SE cs.AI cs.CL 版本更新

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

CRANE:通过空域编辑实现代码代理的约束推理注入

Mingzhi Zhu, Michele Merler, Raju Pavuluri, Stacy Patterson

AI总结 CRANE通过空域编辑技术,结合推理和工具使用能力,提升代码代理性能,在多个基准测试中取得显著成果。

详情
AI中文摘要

代码代理必须同时对长周期的仓库状态进行推理并遵守严格的工具使用协议。在配对的Instruct/Thinking检查点中,这些能力是互补但不一致的。Instruct模型简洁且工具纪律性强,而Thinking模型提供更强的规划和恢复行为,但往往过度 deliberates 并降低代理性能。我们提出CRANE(通过空域编辑实现代码代理的约束推理注入),一种无需训练的参数编辑方法,将Thinking-Instruct的delta视为Instruct骨干的候选推理编辑方向池。CRANE结合幅度阈值去噪delta,保守的泰勒门来保留对推理转移和工具使用保留共同有益的编辑,以及渐进的Sigmoid投影来抑制格式关键的更新方向。通过合并配对的Instruct和Thinking检查点,CRANE在单独模型上取得显著优势的同时保持Instruct级别的效率:在Roo-Eval上,它实现了Qwen3-30B-A3B的pass1为66.2%(+19.5%)和Qwen3-Next-80B-A3B的81.5%(+8.7%);在SWE-bench-Verified上,它在两个规模(122/500和180/500)上解决了多达14个额外的实例;在Terminal-Bench v2上,它提高了pass1/pass5高达2.3%/7.8%,分别达到7.6%/17.9%和14.8%/30.3%,在所有三个基准测试中一致超越了其他合并策略。

英文摘要

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

2606.05922 2026-06-11 cs.AI cs.CL cs.LG 版本更新

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

回顾性工具优化:通过轨迹回滚上的自我偏好改进LLM智能体

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

AI总结 提出一种自监督方法RHO,利用历史轨迹回滚和自偏好选择优化智能体工具集,无需真实标签,在SWE-Bench Pro上通过单轮优化将通过率从59%提升至78%。

详情
Comments
Code: this https URL; Project website: this https URL
AI中文摘要

AI智能体依赖于技能、工具和工作流程的整合(称为工具集)来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而,现有的优化方法通常需要真实验证集,但在实际部署场景中获取此类标注数据非常困难。为解决这一问题,我们提出回顾性工具优化(RHO),一种仅利用过去轨迹的自监督方法。具体而言,RHO从历史轨迹中选择一个多样化的困难任务核心集,并并行重新求解。智能体通过自我验证和自我一致性分析这些回滚,然后生成候选工具集更新,并通过自身的成对自我偏好选择最有效的更新。我们在三个不同领域(涵盖软件工程、技术工作和知识工作)上评估RHO。值得注意的是,单轮优化无需任何外部评分即可将SWE-Bench Pro上的通过率从59%提升至78%。此外,我们的分析表明RHO有效针对先前的失败模式。因此,优化后的工具集改变了智能体的行为模式,并在长周期会话中保持更高的准确性。

英文摘要

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

2606.07909 2026-06-11 cs.AI cs.CL 版本更新

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

MemToolAgent概述:一个简单的餐厅预订场景,其中代理检索相似记忆,接收关于无效时间格式的反馈,并生成反思以更新其记忆

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

发表机构 * AWS AI University of Washington(华盛顿大学)

AI总结 提出MemToolAgent框架,通过记忆管理提升大语言模型代理的工具使用能力,包含记忆提取和动态检索模块,在三个基准上分别提升29%、80%和17%。

详情
Comments
8 pages, 5 figures
AI中文摘要

现代大语言模型(LLM)代理可以使用外部工具帮助用户解决复杂任务。然而,对于需要从长期历史事件或先前的代理-环境交互中学习的问题,LLM代理需要使用记忆机制来存储和检索经验。尽管对话代理存在复杂的记忆系统,但很少有研究实证检验如何通过过去的用户-代理对话来提升代理的工具使用能力。我们提出MemToolAgent,一个通过记忆管理改善工具使用的框架。我们的方法包含一个记忆提取模块,将过去的经验处理成结构化的记忆条目,以及一个检索模块,动态选择存储记忆条目的子集。这使得无需LLM微调即可实现更个性化和准确的响应,与用户偏好和反馈保持一致。总之,本工作有三个主要贡献:(1)统一的记忆条目格式,无需LLM微调即可改善通用和个性化工具使用;(2)基于反思的记忆提取,利用环境和用户反馈将错误执行提炼为批评并存储;(3)一个检索模块,根据记忆相似度分布选择使用多少过去经验。MemToolAgent在WorkBench、NESTFUL和PEToolBench基准上相比强基线分别实现了29%、80%和17%的相对改进。

英文摘要

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

2606.09365 2026-06-11 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练:通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SkeMex框架,通过技能记忆实现医疗智能体后部署自进化,无需更新模型权重,在临床任务中优于现有记忆型智能体。

详情
AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策,而不仅仅是静态问答。在这种设置中,有效的智能体必须跨演化病例重用先前经验,然而现有的记忆机制通常保留原始历史轨迹,这些轨迹冗余、嘈杂且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距,我们提出SkeMex,一种部署后自进化框架,通过基于技能的记忆改进医疗智能体,无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能,编码可重用的程序性知识,并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留,SkeMex从环境反馈中估计上下文相关的效用,并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明,SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

5. 文本生成、摘要与编辑 4 篇

2606.11203 2026-06-11 cs.CL cs.LG 新提交

LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis

LatticeBridge: 用于忠实结构化序列合成的罕见事件序列推理

Faruk Alpay, Bugra Kilictas

发表机构 * Bahcesehir University(巴切塞希尔大学)

AI总结 针对结构化序列生成中约束满足的罕见事件问题,提出LatticeBridge方法,结合前缀语言模型、实例编译表面自动机和扭曲序列蒙特卡洛解码器,在多个基准上显著提升锚点满足率和覆盖率。

详情
Comments
19 pages. Code and benchmark files available at this https URL
AI中文摘要

结构化序列生成通常要求模型在单个输出中满足多个输入派生约束。标准解码方法可能赋予流畅延续高概率,而对同时实现所有必需锚点的延续赋予低概率。我们将此机制视为罕见事件序列推理问题。LatticeBridge 结合了紧凑前缀语言模型、实例编译表面自动机以及带有重采样、多级分裂和源自实例提供短语的源支持提议项的扭曲序列蒙特卡洛 (SMC) 解码器。约束表示从每个输入实例编译而来,不依赖人工整理的词汇类别。在涵盖 CommonGen、E2E NLG 和 WikiBio 的 2,610 个可达到验证任务上,粒子解码器在共享提议模型下,相比贪心、波束过滤和 best-of-k 祖先基线,提高了精确锚点满足率和平均锚点覆盖率。由于仅精确锚点满足不能排除不支持的属性替换,评估同时报告了所需锚点覆盖率、源覆盖率、源入侵诊断、重叠度、运行时间和粒子统计量。该基准在固定提议模型下刻画了忠实度-重叠度-延迟前沿。

英文摘要

Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on continuations that realize all required anchors jointly. We study this regime as a rare-event sequential inference problem. LatticeBridge combines a compact prefix language model, instance-compiled surface automata, and a twisted sequential Monte Carlo (SMC) decoder with resampling, multilevel splitting, and a source-support proposal term derived from instance-provided phrases. The constraint representation is compiled from each input instance and does not rely on manually curated lexical classes. On 2,610 attainable validation tasks spanning CommonGen, E2E NLG, and WikiBio, the particle decoder improves exact anchor satisfaction and mean anchor coverage over greedy, beam-filtered, and best-of-k ancestral baselines under a shared proposal model. Since exact anchor satisfaction alone does not rule out unsupported attribute substitutions, the evaluation reports required-anchor coverage, source coverage, source-intrusion diagnostics, overlap, runtime, and particle statistics jointly. The benchmark characterizes the faithfulness-overlap-latency frontier under a fixed proposal model.

2606.12003 2026-06-11 cs.CL 新提交

Agreement in Representation Space for Open-Ended Self-Consistency

表示空间中的一致性:面向开放式自洽性

Paula Ontalvilla, Gorka Azkune, Aitor Ormazabal

发表机构 * HiTZ Center - Ixa, University of the Basque Country (UPV/EHU)(HiTZ中心 - Ixa,巴斯克大学(UPV/EHU))

AI总结 针对开放式生成任务,提出基于嵌入的协议(EBA),通过聚类采样生成的嵌入表示来估计自洽性,无需训练即可鲁棒地选择更可靠的输出。

详情
AI中文摘要

自洽性通过采样多个输出并选择最一致的答案来改进大语言模型的推理,但现有方法主要依赖于精确匹配,因此仅限于具有分类输出的任务。在这项工作中,我们研究开放式生成任务(如代码合成和文本摘要)中的自洽性。我们假设一致性可以理解为生成空间的几何属性,其中语义兼容的生成在表示空间的相似区域中集中。为了研究这一假设,我们引入了基于嵌入的协议(EBA),这是一种简单的无需训练的操作方法,通过在嵌入空间中对采样生成进行聚类来估计一致性。通过在数学推理、代码生成和摘要上的实验,我们表明表示空间中的一致性为开放式任务提供了鲁棒且可扩展的自洽性信号。特别是,EBA 始终优于随机选择,并且比最近基于大语言模型评估或不确定性估计的选择方法表现出更稳定的扩展行为。我们进一步表明,这些一致性信号在不同模型家族和嵌入空间中保持稳定,即使使用原生隐藏表示也是如此。最后,我们的分析表明,采样生成所占据的几何位置与生成质量强相关:集中在表示空间中心区域附近的生成往往对应于更可靠的输出,而外围生成则显著不准确。总体而言,我们的研究结果支持将自洽性视为采样生成的几何组织属性,而非精确符号重叠。

英文摘要

Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs. In this work, we study self-consistency in open-ended generation tasks such as code synthesis and text summarization. We hypothesize that consistency can be understood as a geometric property of the generation space, where semantically compatible generations concentrate in similar regions of representation space. To study this hypothesis, we introduce Embedding-Based Agreement (EBA), a simple training-free operationalization that estimates agreement by clustering sampled generations in embedding space. Through experiments on mathematical reasoning, code generation, and summarization, we show that agreement in representation space provides a robust and scalable signal of self-consistency for open-ended tasks. In particular, EBA consistently outperforms random selection and exhibits more stable scaling behavior than recent selection approaches based on LLM evaluation or uncertainty estimation. We further show that these agreement signals remain stable across model families and embedding spaces, even with native hidden representations. Finally, our analysis shows that the geometric location occupied by sampled generations is strongly correlated with generation quality: generations concentrated near central regions of representation space tend to correspond to more reliable outputs, whereas peripheral generations are substantially less accurate. Overall, our findings support viewing self-consistency as a property of the geometric organization of sampled generations rather than exact symbolic overlap.

2606.12273 2026-06-11 cs.CL 新提交

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

超越完全随机掩码:扩散语言模型的注意力引导去噪与优化

Jia Deng, Junyi Li, Wayne Xin Zhao, Jinpeng Wang, Hongyu Lu, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Meituan(美团) WeChat, Tencent(腾讯微信) Beijing Key Laboratory of Research on Large Models and Intelligent Governance(大型模型与智能治理北京市重点实验室)

AI总结 提出AGDO框架,利用注意力结构指导去噪顺序并强化关键令牌,在数学和编码基准上提升扩散语言模型的推理性能。

详情
Comments
13 pages. Accepted to ACL 2026 Main Conference
AI中文摘要

扩散大语言模型(dLLMs)通过并行解码提供了自回归模型的高效替代方案,然而现有的后训练方法大多依赖随机掩码策略,忽略了内在的令牌依赖关系。在这项工作中,我们对dLLMs中的注意力进行了实证分析,表明对未掩码上下文关注更强的令牌表现出更高的生成稳定性,并在推理中发挥关键作用。受这些发现启发,我们提出了AGDO,一种注意力引导的去噪与优化框架,将训练和优化与注意力导出的依赖关系对齐。AGDO基于注意力结构确定去噪顺序,并在监督微调和强化学习过程中强调注意力关键令牌。在数学和编码基准上的实验表明,AGDO持续提升推理性能,优于dLLMs的最先进后训练方法。

英文摘要

Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

2601.04203 2026-06-11 cs.CL cs.CV cs.LG cs.SE 版本更新

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

FronTalk: 以多模态反馈进行对话式代码生成的前端开发基准测试

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

AI总结 提出FronTalk基准,通过多轮对话和多模态反馈(文本与视觉指令)评估前端代码生成,发现模型存在遗忘和视觉反馈理解困难,提出AceCoder方法有效减少遗忘并提升性能。

详情
AI中文摘要

我们提出了FronTalk,一个前端代码生成基准,开创性地研究了一种独特的交互动态:具有多模态反馈的对话式代码生成。在前端开发中,草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要,但它们在多轮代码生成中的作用仍未得到充分探索。为解决这一差距,我们聚焦于前端开发任务,整理了FronTalk,这是一个包含100个多轮对话的数据集,这些对话源自新闻、金融和艺术等不同领域的真实网站。每一轮都包含一个文本指令和一个等效的视觉指令,每个指令代表相同的用户意图。为全面评估模型性能,我们提出了一种新颖的基于智能体的评估框架,利用网络智能体模拟用户并探索网站,从而衡量功能正确性和用户体验。对20个模型的评估揭示了文献中系统性地未充分探索的两个关键挑战:(1)显著的遗忘问题,即模型覆盖先前实现的功能,导致任务失败;(2)解释视觉反馈的持续挑战,尤其是对于开源视觉语言模型(VLM)。我们提出了一个强大的基线来解决遗忘问题,即AceCoder,一种使用自主网络智能体批评每个过去指令实现的方法。这种方法将遗忘几乎减少到零,并将性能提升高达9.3%(从56.0%到65.3%)。总体而言,我们旨在为前端开发和多轮多模态代码生成的通用交互动态的未来研究提供坚实基础。代码和数据已在此https URL发布。

英文摘要

We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL

6. 语义、语法与语言学分析 5 篇

2606.11222 2026-06-11 cs.CL cs.IT 新提交

A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries

文本中语义信息的几何轮廓:帧条件唯一性与标量摘要的权衡三角形

Dmitriy Kompaneets

发表机构 * Independent Researcher(独立研究员)

AI总结 提出一个几何框架,通过句子嵌入的结构测量文本语义内容,包括三个坐标(新颖性、广度、整合性),并证明任何标量摘要都无法同时满足分析稳定性、序数鲁棒性和跨表示可比性。

详情
Comments
19 pages. Code and data: this https URL
AI中文摘要

一段文本承载了多少意义?香农的理论衡量符号上的不确定性,并有意忽略意义,而诸如BERTScore的成对度量比较两段文本而非表征单段文本。我们开发了一个几何框架,从文本句子嵌入的结构中测量语义内容。该框架包含三个部分。首先,在固定的嵌入和基线内,六个自然公理唯一确定一个标量度量(尺度可调),即帧条件唯一性定理。得到的标量在经验上过于粗糙,这促使我们寻求更丰富的表征。其次,我们提出一个三坐标语义轮廓,捕捉新颖性(与通用话语的偏离)、广度(不同思想的多样性)和整合性(它们之间的连通性),以及一个离散的最小单元(语义量子),其分辨率由聚类阈值$\tau$固定。第三,我们证明了一个不可能定理:轮廓的任何标量摘要都不能同时满足在释义和拼接下的分析稳定性、跨文本尺度的序数鲁棒性以及跨表征的可比性。我们展示了两个实用标量$S_{\mathrm{minmax}}$和$S_{\mathrm{rank}}$,每个占据这个权衡三角形的不同角落。在23个合成类别、5本Project Gutenberg小说和3个嵌入模型上的验证确认了该权衡。推荐的秩归一化配置在28个序数检验中通过25个(Benjamini-Hochberg校正后通过21个),优于包括单字熵和基于BERTScore的新颖性信号在内的七个基线。一个独立的变分结果将广度坐标与行列式点过程的对数行列式联系起来(在507个Gutenberg章节上Spearman $\rho = 0.985$),为广度提供了优化理论基础。

英文摘要

How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $\tau$. Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $\rho = 0.985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.

2606.11371 2026-06-11 cs.CL cs.AI eess.AS eess.SP 新提交

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

人类与AI生成语言的动态:语义如何在不同时间尺度上波动

Han-Jen Chang, Yasir Çatal, Angelika Wolman, Agustín Ibáñez, David Smith, I-Wen Su, Kai-Yuan Cheng, Georg Northoff

AI总结 提出语义时间尺度分析流程,通过自相关窗口度量(ACW-0)量化人类与AI生成语音中语义特异性与上下文相似性的时间组织,发现ACW-0长度与词汇通用性相关,且该关联在随机化后被削弱。

详情
Comments
45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech & Language
AI中文摘要

口语,无论是人类还是大型语言模型(LLM)产生的,都会随时间展开,具有变化的语义内容。然而,我们仍然缺乏简单、可解释的时间序列特征来捕捉通用与特定内容如何随时间分布,并可用于比较人类和AI生成的语音。我们引入了一个语义时间尺度分析流程,将带有时间戳的词级转录转换为语义时间序列。对于每个口语叙述,我们计算(i)基于WordNet词深度的语义特异性,以及(ii)基于SBERT嵌入的上下文相似性,并使用自相关窗口度量(ACW-0及相关指标)量化其时间依赖性。然后,我们将原始语音与多种随机化对照进行比较,这些对照选择性地破坏词汇身份、时间顺序和词时长。在人类朗读的自传叙述、TTS朗读和LLM生成的文本(通过TTS渲染)中,我们发现语义时间序列中ACW-0较长的片段往往包含更多通用词汇,而ACW-0较短的片段则富含更具体的词汇。当词序和计时被随机化时,这些关联被强烈削弱或消除,表明基于ACW的度量捕捉了语义内容超越静态词汇分布的非平凡时间组织。我们的结果表明,基于ACW的语义时间尺度是分析和比较人类与AI生成语音时间结构的有用特征系列。

英文摘要

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

2606.11531 2026-06-11 cs.CL cs.IT 新提交

Measuring language complexity from hierarchical reuse of recurring patterns

从重复模式的层次复用测量语言复杂度

Junyi Zhou, Rui Liu, Pengyu Liu, Yu Liu

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院系统科学系) International Academic Center of Complex Systems, Beijing Normal University(北京师范大学国际复杂系统学术中心) Department of Chinese Language and Literature, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院中国语言文学系) Center for Linguistic Sciences, Beijing Normal University(北京师范大学语言学科学中心) School of Systems Science, Beijing Normal University(北京师范大学系统科学学院) Department of Mathematics and Applied Mathematical Sciences, University of Rhode Island(罗德岛大学数学与应用数学科学系) Department of Cell and Molecular Biology, University of Rhode Island(罗德岛大学细胞与分子生物学系)

AI总结 提出基于算法信息论的梯径指数,通过层次复用重复子结构测量语言复杂度,在21个平行语料库中验证了等复杂度假说和权衡假说。

详情
Comments
17 pages, 4 figures
AI中文摘要

我们引入梯径指数作为基于算法信息论的语言复杂度度量。它通过层次复用重复子结构来重建序列所需的最小步骤数,捕捉了一种可精确计算但受约束的算法可压缩性形式,与Kolmogorov复杂度相关但不同。我们将梯径方法应用于Parallel Universal Dependencies数据集中的21个平行语料库。梯径指数在不同语言间近似不变,且变化远小于语料库长度。当所有语料库映射到统一的二进制表示时,这一现象更为明显,从表示无关的角度为等复杂度假说提供了证据。我们还观察到字符库存大小与语料库长度之间的权衡,以及词汇级和语料库级重建复杂度之间的权衡,支持了总复杂度守恒并在语言层次间重新分布的权衡假说。梯径方法识别出的可重用子结构(无需任何语言输入)与自然词汇中存在的单词和形态成分重叠。梯径方法捕获的层次复用与认知科学中提出的组块机制相似,即人类认知系统在共享记忆和处理约束下将语言输入压缩为嵌套的、可重用的单元。认知组块与梯径方法之间的这种联系为等复杂度假说和权衡假说提供了新的解释,将两者都根植于支撑所有人类语言处理的共享认知架构中。

英文摘要

We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.

2601.00181 2026-06-11 cs.CL cs.AI 版本更新

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

对话中的因果情绪识别:上下文饱和与话语标记证据

Cheonkam Jeong, Adeline Nyamathi

AI总结 通过系统消融实验发现对话上下文对情绪识别性能起主导作用但快速饱和,并揭示悲伤情绪与左边缘话语标记使用减少及更高上下文依赖性的关联。

详情
AI中文摘要

我们解决了对话情绪识别中两个长期存在的空白:哪些建模选择实质性地影响性能,以及识别结果如何与可解释的话语层面模式相关联。我们通过在IEMOCAP上进行系统研究并在MELD上进行跨数据集验证来研究这两个问题。对于识别,我们使用10个随机种子进行受控消融实验,并进行多重比较校正的配对显著性检验,得到三个发现。首先,对话上下文是主导因素,但性能快速饱和:大约90%的性能提升来自最近的前10-30轮对话,具体取决于标签集。其次,层级句子表示仅在仅话语设置中帮助最大,并在MELD上显示出明显优势,但一旦轮次级别的上下文可用,其益处消失,表明对话历史吸收了大量话语内部结构。第三,整合外部情感词典不会改善结果,这与预训练编码器已经捕获ERC所需的大部分情感信号一致。在严格因果设置下,我们的简单模型实现了强性能(4-way 82.69%;6-way加权F1 67.07%),表明无需未来轮次即可达到竞争性准确率。对于语言分析,我们检查了5,286个话语标记出现,发现情绪与标记位置之间存在可靠关联(p <.0001)。悲伤话语的左边缘标记使用率(21.9%)低于其他情绪(28-32%),这与左边缘标记与主动话语管理相关的观点一致。这与我们的识别结果一致,其中悲伤从对话上下文中获益最多(+22个百分点),表明悲伤可能比具有更强局部语用线索的情绪更依赖于上下文。

英文摘要

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p <.0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

2506.20040 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

跨层离散概念发现用于解释语言模型

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

AI总结 提出跨层向量量化变分自编码器(CLVQ-VAE),通过离散向量量化瓶颈将残差流中的重复特征压缩为紧凑可解释的概念向量,在三个数据集上优于聚类、单层VQ-VAE和稀疏自编码器基线。

详情
AI中文摘要

由于残差流的存在,解释语言模型仍然具有挑战性,残差流在相邻层之间线性混合和复制特征,导致单层分析忽略这种跨层结构。跨层稀疏自编码器(SAE)解决了层混合问题,但在连续空间中操作,概念分散在许多神经元上,没有清晰的边界。我们引入了跨层向量量化变分自编码器(CLVQ-VAE),这是一种新颖的框架,通过离散向量量化瓶颈将较低层的表示映射到较高层,将重复的残差流特征压缩为紧凑、可解释的概念向量。我们的方法结合了基于top-k温度的采样和指数移动平均(EMA)码本更新,在保持码本多样性的同时,对离散潜在空间进行受控探索。在基于编码器和解码器的模型上,针对ERASER-Movie、Jigsaw和AGNews数据集,CLVQ-VAE在三个评估轴上优于聚类、单层向量量化变分自编码器(VQ-VAE)和稀疏自编码器(SAE)基线:移除识别出的概念使模型准确率下降高达93%,LLM评判员在66.7%的比较中将我们的概念排在首位,人类标注者从我们的可视化中恢复模型预测的准确率为78%,而聚类为54%。

英文摘要

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

7. 多模态语言处理 8 篇

2606.11420 2026-06-11 cs.CL cs.SI 新提交

Context-Aware Multimodal Claim Verification in Spoken Dialogues

口语对话中的上下文感知多模态声明验证

Chaewan Chun, Delvin Ce Zhang, Dongwon Lee

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) University of Sheffield(谢菲尔德大学)

AI总结 提出MAD2基准和上下文感知多模态融合方法,验证对话音频中的声明,发现对话结构比虚假信息框架对验证更重要。

详情
AI中文摘要

每天,数百万人从播客和流媒体中吸收声明,而这些声明从未被事实核查员看到。口语错误信息是通过对话构建的,其中可信度不仅来自事实本身,还来自声明如何在对话轮次中被构建、强化或未被质疑。然而,事实核查一直专注于孤立的文本,对话音频研究不足。我们引入了MAD2,一个新的用于口语声明验证的多轮音频对话基准,包含1,000个双说话者对话,3,368个值得核查的声明和约10小时的音频,并提出了上下文感知音频编码器和对话感知文本模型的校准多模态融合。在各种设置下,添加对话上下文改善了验证,但收益取决于场景类型。仅使用前文上下文通常与离线性能相当,支持实时审核设置,而当基于转录的模型被额外上下文 destabilized 时,音频贡献最大。总体而言,对话结构对验证的影响比错误信息框架更大。

英文摘要

Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

2606.11906 2026-06-11 cs.CL 新提交

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

语言何时重要?多语言指令揭示视觉-语言-动作模型中的逐步语言敏感性

Xuan Dong, Zhe Han, Tianhao Niu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本研究通过将LIBERO基准翻译成十种语言,首次系统评估了VLA模型的多语言鲁棒性,发现非英语指令下成功率下降30-50%,并基于步骤级语言敏感性提出推理时对齐干预,显著提升性能。

详情
Comments
Accepted to ACL 2026 Main Conference
AI中文摘要

视觉-语言-动作(VLA)模型在语言条件机器人操作中表现出强大性能,但其对语言变化的鲁棒性仍知之甚少。在这项工作中,我们通过将LIBERO基准翻译成十种语言,首次对VLA模型进行了系统的多语言评估,揭示了在非英语指令下性能严重下降,成功率下降30-50%。通过对任务执行的细粒度分析,我们发现语言影响在步骤间高度不均匀:某些步骤表现出强烈的语言依赖性并主导整体任务失败,而其他步骤则基本与语言无关。基于这一见解,我们提出了一种逐步推理时干预方法,根据步骤语言敏感性对齐表示,显著提高了语言变化下的性能。我们的结果表明,VLA模型中的语言鲁棒性本质上是一个逐步控制问题,突出了时间结构化分析对于可靠具身智能体的重要性。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.

2606.11740 2026-06-11 cs.CV cs.CL 交叉投稿

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

UniReason-Med: 用于医学VQA中二维到三维迁移的共享基础推理接口

Mengzhuo Chen, Yan Shu, Chi Liu, Hongming Piao, Xidong Wang, Derek Li, Bryan Dai

发表机构 * IQuest Research

AI总结 提出UniReason-Med框架,通过共享基础推理接口从2D医学图像向3D医学VQA迁移推理能力,结合监督微调和强化学习,显著提升3D推理性能。

详情
AI中文摘要

我们研究了当两种输入类型通过共同的推理接口对齐时,来自丰富2D医学图像的基础推理监督是否能够改善3D医学VQA。我们引入了UniReason-Med,一个单一检查点框架,在推理时处理2D图像或切片序列化的3D体积,通过共享框语法、区域标记注入和共同的基础推理策略生成交错文本推理和局部视觉证据。为了训练这个接口,我们构建了UniMed-CoT,一个包含220K指令微调数据集,具有交错的文本推理和基础视觉证据,包括170K 2D和50K 3D样本。通过监督微调后接结果级强化学习,UniReason-Med学会生成基础推理轨迹,而在强化学习期间无需基于IoU/Dice的定位奖励。数据混合和组件消融实验表明,联合2D+3D基础监督显著改善了仅3D训练的3D推理,而基础化和区域标记注入对2D和3D任务都有持续益处。这些结果表明,共享的基础推理接口可以将推理结构从2D图像迁移到切片序列化的体积医学理解。代码和数据公开在https://this URL。

英文摘要

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.

2606.11792 2026-06-11 cs.CV cs.AI cs.CL 交叉投稿

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

详情
Comments
Preprint
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

2606.12295 2026-06-11 cs.CV cs.CL cs.IR 交叉投稿

Findings of the MAGMaR 2026 Shared Task

MAGMaR 2026 共享任务结果

Alexander Martin, Dengjia Zhang, Joel Brogan, Francis Ferraro, Jeremy Gwinnup, Reno Kriz, Teng Long, Kenton Murray, Andrew Yates, Xiang Xiang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) OpenAI University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) Air Force Research Laboratory(空军研究实验室) Human Language Technology Center of Excellence, Johns Hopkins University(约翰霍普金斯大学人类语言技术卓越中心) University of Amsterdam(阿姆斯特丹大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文介绍MAGMaR 2026共享任务的结果,包括视频检索和基于检索视频的生成任务,所有提交系统均超越去年基线。

详情
Comments
Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: this https URL
AI中文摘要

本概述论文介绍了第二届多模态检索增强生成(MAGMaR)研讨会的共享任务结果。在该共享任务中,参与者提交的系统专注于(i)视频检索或(ii)基于检索到的视频进行文章的接地生成。团队可以提交到任一任务。对于检索任务,我们有2个参与团队提交了总共17个系统——所有这些系统都击败了基于去年共享任务获胜者得出的基线。在生成方面,我们有4个团队提交了16个系统。所有团队至少有一个生成的报告被人类标注者评为最佳。

英文摘要

This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.

2606.11074 2026-06-11 cs.CL cs.AI 版本更新

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

建模复杂行为:视觉语言模型中的多人格组合与动态切换

Peiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du, Yuntao Wang, Zhou Su

发表机构 * Xi'an Jiaotong University(西安交通大学) Beihang University(北京航空航天大学)

AI总结 本研究在视觉语言模型中引入显式人格条件,建立包括单人格、多人格和人格切换的系统评估框架,发现人格提示可提升图像描述但损害精确推理任务,并观察到多特质组合与动态切换中的平衡与残留效应。

详情
Comments
16 pages, 4 figures, 10 tables
AI中文摘要

随着多模态大语言模型(MLLMs)在社交互动中的广泛部署,理解和控制其在复杂人格条件下的行为至关重要。本文引入显式人格条件,并建立了一个系统的评估框架,涵盖单人格诱导、多人格诱导和人格切换。实验表明,人格诱导能提升图像描述性能,但会损害需要精确推理的任务(如视觉问答)的性能。在多特质组合和动态切换过程中观察到平衡和残留效应,表明模型行为受到先前和当前人格约束的共同调节。现有的基于提示的人格诱导方法在多模态设置中表现出有限的迁移性。我们的工作揭示了MLLMs中人格建模的动态和复杂性质,并强调了针对人格诱导和评估的鲁棒、定制化方法的必要性。代码将在论文被接收后发布。

英文摘要

With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

2509.14860 2026-06-11 cs.CV cs.AI cs.CL cs.MA 版本更新

MARIC: Multi-Agent Reasoning for Image Classification

MARIC:用于图像分类的多智能体推理

Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee

AI总结 提出多智能体框架MARIC,通过分解图像分类为协作推理过程,利用大纲智能体、方面智能体和推理智能体进行多视角分析与综合,在四个基准数据集上显著优于基线方法。

详情
Comments
11 pages, preprint
AI中文摘要

图像分类传统上依赖于参数密集型模型训练,需要大规模标注数据集和大量微调才能达到有竞争力的性能。虽然最近的视觉语言模型(VLM)缓解了其中一些限制,但它们仍然受限于对单次表示的依赖,往往无法捕捉视觉内容的互补方面。在本文中,我们介绍了基于多智能体的图像分类推理(MARIC),这是一个多智能体框架,将图像分类重新表述为协作推理过程。MARIC首先利用大纲智能体分析图像的全局主题并生成有针对性的提示。基于这些提示,三个方面智能体沿着不同的视觉维度提取细粒度描述。最后,推理智能体通过集成反思步骤综合这些互补输出,产生用于分类的统一表示。通过明确地将任务分解为多个视角并鼓励反思性综合,MARIC减轻了参数繁重训练和单一VLM推理的缺点。在4个不同的图像分类基准数据集上的实验表明,MARIC显著优于基线,突出了多智能体视觉推理在鲁棒且可解释的图像分类中的有效性。

英文摘要

Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.

2606.04351 2026-06-11 cs.CV cs.CL 版本更新

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Video2LoRA: 视觉-语言模型的参数化视频内化

Manan Suri, Sarvesh Baskar, Dinesh Manocha

AI总结 提出Video2LoRA方法,通过感知器超网络从视频编码中直接生成LoRA适配器,实现零视觉令牌的视频查询,在保持性能的同时大幅降低计算成本。

详情
Comments
this https URL
AI中文摘要

在视觉-语言模型中处理视频成本高昂:每帧占用数百个令牌,推理成本随每帧和每次重复查询而增加。我们引入Video2LoRA,一种参数化视频内化方法。感知器超网络逐层读取冻结VLM编码视频时产生的中间表示,并在单次前向传播中生成低秩适配(LoRA)适配器。与需要迭代梯度更新的标准LoRA微调不同,Video2LoRA直接从视频预测这些权重。在SmolVLM2 500M和2.2B上针对视频摘要和描述进行训练后,Video2LoRA使得相同的冻结VLM能够仅通过适配器回答查询,在查询时上下文中零视觉令牌。Video2LoRA在两种模型规模的所有五个描述基准测试中,以及在八个视频问答基准测试-规模配对中的七个上,统计上非劣效且等同于直接视频上下文推理。尽管仅在12帧384px上训练,它在高达1024帧和1024px时仍保持稳定,而直接视频上下文推理通常会退化。在此扫描中,它将回答时的视觉令牌负载减少高达1500倍,查询TTFT减少6-80倍,同时保持视频忠实输出。我们还发现,为非重叠视频段独立生成的适配器可以在秩空间中组合,这为分块长视频内化提供了一条路径。

英文摘要

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

8. 语音语言联合与音频文本 14 篇

2606.11219 2026-06-11 cs.CL cs.AI cs.SD 新提交

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Afrispeech Semantics: 评估跨领域和口音的口语语言模型中的音频语义推理

Chibuzor Okocha, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 提出五项语义与副语言推理任务(蕴含、一致性、合理性、口音漂移、口音约束),评估音频语言模型在口音变化、领域迁移和语义过度推断下的推理能力,揭示当前评估的局限性。

详情
Comments
Accepted to ACL
AI中文摘要

音频语言模型(ALMs)越来越多地用于基于语音的理解,但它们在转录、文本到音频检索、字幕生成和问答准确性之外的语义推理能力仍未得到充分基准测试。特别是,口音变化、领域迁移和语义过度推断对音频推理的影响尚不清楚。我们评估了音频语言模型在五项语义和副语言推理任务上的表现:蕴含、一致性、合理性、口音漂移和口音约束。这些任务共同评估模型以口语音频作为主要证据来源进行推理的能力,包括文本假设是否可以从音频中推断、矛盾或无法确定,陈述是否与口语内容一致或冲突,给定话语的声明是否合理,以及模型预测在口音变化下是否保持稳定或适当约束。这些发现凸显了当前音频推理评估的关键局限性,并希望为更稳健和公平的ALM设计与评估提供指导。

英文摘要

Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

2606.11386 2026-06-11 cs.CL cs.AI eess.AS 新提交

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

通过激活引导克服全双工口语语言模型中的状态惯性

Cheng-Kuang Chang, Kai-Wei Chang, Alexander H. Liu, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对全双工口语模型在用户打断时响应延迟的问题,提出基于感知向量的激活引导方法,无需微调即可显著提升中断理解能力。

详情
AI中文摘要

全双工口语语言模型(FD-SLMs)通过允许模型同时听和说实现无缝语音交互,但其协调听与说的内部机制尚未充分探索。我们分析了FD-SLM隐藏表示中编码的预测行为,发现它们表现出特定流的预测模式:在听时,它们优先预测传入的用户流;而在说时,它们优先预测模型输出流。基于这一观察,我们表明FD-SLMs动态调节其内部预测焦点在两个状态之间:与模型输出生成一致的生成状态和与传入用户输入一致的感知状态。然而,这种调节可能滞后于对话上下文的突然变化。在用户打断期间,模型在过渡到感知状态之前短暂地偏向生成状态,导致其错过传入输入的开头。我们将这种延迟的内部过渡称为状态惯性。为了量化其下游影响,我们引入了零缓冲基准(ZBB),这是一个用于评估当用户语音突然开始时即时中断理解能力的诊断基准。我们使用响应正确性和初始词出现率(IWOR)来评估这一设置。最后,我们通过使用感知向量的激活引导来缓解状态惯性,这是一种无需训练且计算开销很小的干预措施。在多个最先进的FD-SLMs上,激活引导显著改善了中断处理;例如,在PersonaPlex上,它将正确性从28%提高到45%,将IWOR从40%提高到72%,而无需任何微调。

英文摘要

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

2606.11542 2026-06-11 cs.CL cs.AI 新提交

Pretrained self-supervised speech models can recognize unseen consonants

预训练自监督语音模型能够识别未见过的辅音

Chihiro Taguchi, Éric Le Ferrand, Hirosi Nakagawa, Hitomi Ono, Kanji Kato, Emily Prud'hommeaux, David Chiang

发表机构 * University of Notre Dame(圣母大学) University at Buffalo(纽约州立大学布法罗分校) Tokyo University of Foreign Studies(东京外国语大学) Reitaku University(丽泽大学) Boston College(波士顿学院)

AI总结 研究预训练自监督语音模型(Wav2Vec2、HuBERT)对Khoisan语言中罕见吸气辅音的识别能力,发现模型对吸气辅音的识别准确率高于非吸气辅音,表明自监督学习能泛化到稀有音素。

详情
Comments
6 pages, 3 figures, 3 tables, accepted at Interspeech 2026
AI中文摘要

现代预训练自监督自动语音识别模型在大规模音频数据上训练,将语音编码为上下文表示。然而,它们的训练数据严重偏向高资源语言,低资源语言数据很少,这引发了对类型学上不常见的语音声音(如主要出现在Khoisan语言中的吸气辅音)可能代表性不足的担忧。这引出了我们的核心研究问题:这些模型能否像识别其他语音声音一样准确地识别吸气辅音?为了解决这个问题,我们在两种富含吸气辅音的Khoisan语言(G|ui和West !Xoon)的数据上微调并比较了预训练自监督语音模型(Wav2Vec2和HuBERT)。我们的结果显示,微调后的模型一致地更准确地识别吸气辅音而非非吸气辅音,表明自监督学习能够泛化到包括稀有音素在内的人类语音声音。

英文摘要

Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

2606.11639 2026-06-11 cs.CL 新提交

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

评估基于音素的自动语音识别系统中的偏差:对IPA转录模型的分析

Catherine Bao, Maneesha Rani Saha, Neal Patwari

AI总结 研究评估WhisperIPA和ZIPA两个开源IPA转录ASR系统在不同口音和语言上的性能,通过标准音素错误率和软音素错误率分析,发现模型在性别、口音、种族和年龄等群体间存在持续性能差异。

详情
AI中文摘要

自动语音识别(ASR)系统的普及增加了对种族、年龄、性别和口音等人口统计偏差的探索,这些偏差通常源于不平衡的训练数据。大多数研究集中在基于标准字素的ASR系统上,而对基于音素的系统(如生成国际音标(IPA)表示的模型)关注较少。随着ASR系统向多语言支持和低资源语言建模转变,基于IPA的层作为关键的、语言无关的基础。在本研究中,我们评估了两个最先进的开源ASR系统WhisperIPA和ZIPA的性能,它们生成跨不同口音和语言源的IPA转录。我们的评估包括现有的多语言语音语料库和人口统计注释的英语语料库。我们通过比较模型生成的IPA转录与字素到音素(G2P)系统,使用标准音素错误率(PER)和提出的软PER指标(容忍语言学上相似的音素替换)来衡量模型性能。我们的分析考察了性能在不同语言和人口统计群体(如性别、口音、种族和年龄)之间的变化,揭示了即使在考虑了可接受的音素变异后仍存在的持续差异。这些发现为偏差的潜在来源提供了见解,并为开发更包容和语言鲁棒的基于音素的ASR系统提供了信息。我们的代码和数据将公开发布给社区。

英文摘要

The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.

2606.11681 2026-06-11 cs.CL cs.SD 新提交

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

2606.11197 2026-06-11 eess.AS cs.AI cs.CL cs.SD 交叉投稿

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

MA-DLE: 基于记忆增强的语音自动抑郁程度估计

Xuzhi Wang, Xinran Wu, Ziping Zhao, Jianhua Tao, Björn W. Schuller

AI总结 提出记忆增强特征方法,通过选择性整合历史时序特征和动态记忆特征,结合层次注意力融合模块,在DAIC-WOZ和E-DAIC数据集上实现最优性能。

详情
Comments
Accepted at IEEE TAC
AI中文摘要

基于语音的抑郁程度自动估计对于实现早期检测和及时干预至关重要,尤其是在资源受限的心理健康环境中。近年来,深度学习在包括情感计算和心理健康评估在内的多个领域取得了显著成功。现有方法大多依赖基于RNN的架构(如LSTM和GRU)来建模时间信息以进行抑郁估计。然而,提取的特征往往只强调少数相邻语音片段,限制了其捕捉长程依赖的能力。为克服这一局限,我们引入了一种基于记忆的特征增强方法,以增强GRU提取特征的表示能力。我们的记忆库并非不加区分地整合历史数据,而是设计为选择性整合两类组件以减少冗余和不相关性:(1) 与当前GRU输出高度相似的历史时序特征,提供互补的上下文信息;(2) 基于特征变异性识别的动态记忆特征,捕捉指示抑郁症状的行为和情绪波动。为有效融合记忆增强特征与GRU输出,我们进一步设计了层次注意力融合(HAF)模块。我们的方法在广泛使用的DAIC-WOZ和E-DAIC数据集上进行了评估,取得了最先进的性能。

英文摘要

Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.

2606.11279 2026-06-11 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Massive Open-Vocabulary Keyword Spotting

大规模开放词汇关键词识别

Leonor Barreiros, Raul Monteiro, Afonso Mendes, Gonçalo M. Correia

AI总结 提出一种内存占用更小的开放词汇关键词识别系统,无需微调即可处理大规模数据库,在未见语言中达到与未压缩方案相当的实体召回率。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

自动语音识别系统在转录训练数据中罕见词汇(即专业术语)时表现不佳。开放词汇关键词识别结合上下文偏置已被证明可以缓解这一问题。然而,现有系统只能处理几百个术语的词汇表,否则会成为不可行的瓶颈。我们提出了一种系统,其存储特征的内存占用比可比基线小128倍,允许用户处理大规模数据库,同时保持开放词汇。无需微调语音识别模型,我们的系统在未见过的语言中也达到了与未压缩解决方案相当的实体召回率。

英文摘要

Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.

2606.11429 2026-06-11 eess.AS cs.CL cs.SD 交叉投稿

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

Gumbel-BEARD:低资源领域Whisper自监督自适应的自动层选择

Zilai Wang, Natarajan Balaji Shankar, Mohan Shi, Kaiyuan Zhang, Abeer Alwan

AI总结 提出Gumbel-BEARD框架,通过可训练的Gumbel-Softmax选择器自动选择Whisper编码器层,结合BEST-RQ自监督目标实现低资源领域自适应,在儿童语音和方言数据集上取得最先进词错误率。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

语音基础模型在低资源领域常因领域不匹配和数据稀缺而表现不佳。我们提出Gumbel-BEARD,一种领域自适应框架,通过端到端可训练的硬Gumbel-Softmax选择器自动选择Whisper编码器层。它利用BEST-RQ目标实现自监督自适应,无需手动调整即可动态适应目标声学特征。在MyST儿童语音语料库上的实验证明了其效率和可扩展性:使用10小时标注数据进行微调,我们的方法匹配了在完整133小时标注集上训练的完全监督基线。我们在MyST上使用Whisper-medium建立了8.21%的新最先进词错误率(WER),在OGI自发言语数据集上使用Whisper-small达到11.06%。在CORAAL上的评估进一步证实了对成人方言领域偏移的鲁棒性,相对WER降低高达6%,突显了我们的方法对多样低资源条件的泛化能力。

英文摘要

Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

2606.11766 2026-06-11 eess.AS cs.AI cs.CL cs.SD 交叉投稿

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

2606.12199 2026-06-11 eess.AS cs.CL cs.SD 交叉投稿

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

哪种语音表示更匹配文本原生推理?帧率和表示对语音-文本对齐的研究

Zhen Ye, Xu Tan, Yiming Li, Guangyan Zhang, Chimin Chan, Haohe Liu, Zhengxi Liu, Hongzhan Lin, Zheqi Dai, Xinshen Zhang, Peiwen Sun, Qiuqiang Kong, Wei Xue

AI总结 研究语音与文本模态差异中的时间粒度不匹配问题,提出因子化FSQ和轻量非自回归音频LM头以降低帧率,发现4.17Hz帧率结合中间层表示对齐在语音问答中表现最佳。

详情
Comments
Accepted by Interspeech 2026 long paper
AI中文摘要

口语对话模型通常以文本LLM骨干网络为基础,但在以语音而非文本为条件时,推理能力往往会下降。我们将这种模态差异部分归因于时间粒度不匹配:在语义匹配的情况下,语音标记在时间上是冗余的,且远长于文本,这稀释了每个标记的语义密度,削弱了文本原生的推理动态。我们将语音标记设计视为一个表示选择问题,并在固定信息速率下,在冻结的LLM骨干网络中扫描帧率。为了实现低帧率,我们引入了因子化FSQ和一个轻量级的非自回归音频LM头,在不牺牲高效预测的情况下将容量扩展到近300比特/帧。在消除瓶颈后,我们扫描帧率(50→2.08 Hz)和对齐深度,并观察到在4.17 Hz帧率下,结合中间层表示对齐,语音问答存在一致的最佳区域。

英文摘要

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

2510.13293 2026-06-11 cs.CL 版本更新

Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models

跨模态一致性引导用于自动回归TTS模型中的鲁棒情绪控制

Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, Bin Ma, Eng Siong Chng

AI总结 本文提出了一种基于文本情绪与显式语音情绪不一致程度的动态尺度的跨模态一致性引导分类器免费引导方法(CCG-CFG),通过使用文本情绪替代dropout条件,并采用硬样本挖掘策略蒸馏CCG-CFG引导信号,从而提升TTS模型的情绪对齐能力。在五个情感语料库和两个TTS基准测试中,该方法在CosyVoice2上实现了情绪识别准确率提升12%,主观评分提升10%,优于基线模型HierSpeech++、Qwen3-TTS和原始CosyVoice2,同时保持可懂性、自然度和高质量。

详情
Comments
Accepted to Interspeech 2026, short paper
AI中文摘要

尽管文本到语音(TTS)系统通过自然语言指令实现情绪控制,但当目标情绪与文本语义冲突时,表达性、自然性和语音质量会下降。我们提出了一种基于文本情绪与显式语音情绪不一致程度的动态尺度的跨模态一致性引导分类器免费引导(CCG-CFG)方法,通过使用文本情绪替代dropout条件。我们还采用硬样本挖掘策略蒸馏CCG-CFG引导信号,以提高TTS模型的情绪对齐能力。在五个情感语料库和两个TTS基准测试中的评估显示,我们的方法应用于CosyVoice2时,情绪识别准确率提高了12%,主观评分提高了10%,优于基线模型,包括HierSpeech++、Qwen3-TTS和原始CosyVoice2,同时保持可懂性、自然性和高质量。

英文摘要

While Text-to-Speech (TTS) systems enable emotional control via natural-language instructions, expressiveness, naturalness, and speech quality degrade when the target emotion conflicts with the textual semantics. We propose a Cross-modal Consistency Guided Classifier-Free Guidance (CCG-CFG) method with dynamic scales based on the degree of inconsistency between the text emotion and the explicit speech emotion, replacing the dropout condition with the text emotion. We also distill the CCG-CFG guidance signal using a hard-sample mining strategy, improving the TTS model's emotional alignment capability. Evaluations on five emotional corpora and two TTS benchmarks show that our approaches applied to CosyVoice2 achieve up to a 12% absolute improvement in emotion-recognition accuracy and a 10% relative improvement in subjective scores, outperforming baselines including HierSpeech++, Qwen3-TTS, and original CosyVoice2, while preserving intelligibility, naturalness, and high speech quality.

2606.03504 2026-06-11 cs.CL cs.AI 版本更新

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

BaltiVoice: 巴尔蒂语语音语料库与微调Whisper ASR系统

Muhammad Ali

AI总结 针对无公开ASR资源的巴尔蒂语,构建16.8小时朗读语音语料库并微调Whisper-small模型,在验证集上词错误率从182.18%降至30.07%。

详情
Comments
6 pages, 3 figures, 4 tables. Code and data available at this https URL
AI中文摘要

我们提出了BaltiVoice,一个16.8小时的朗读语音语料库,用于巴尔蒂语(ISO 639-3: bft),这是一种在巴基斯坦吉尔吉特-巴尔蒂斯坦地区使用的藏语语言,此前没有公开可用的ASR资源。该语料库包含10,060条经过验证的本地Nastaliq脚本话语,源自Mozilla Common Voice录音。我们在此语料库上微调了OpenAI Whisper-small,并在包含538条话语的保留验证集上报告了30.07%的词错误率(WER),而Whisper-small在巴尔蒂语上的零样本基线为182.18%。该数据集、微调模型以及实时转录演示均在HuggingFace上公开提供。

英文摘要

We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. Fine-tuning OpenAI Whisper-small yields a Word Error Rate (WER) of 26.74% and a Character Error Rate (CER) of 8.67% on a 538-utterance speaker-disjoint validation set, down from a zero-shot baseline of 159.19% WER and 152.52% CER. A Whisper-base fine-tuned on the same data achieves 44.54% WER and 15.61% CER, confirming that model capacity matters for this low-resource setting. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.

2606.06065 2026-06-11 cs.CL cs.SD eess.AS 版本更新

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

多任务学习还不够:双输出第二语言语音识别中的表示纠缠

Seung Hwan Cho, Young-Min Kim

AI总结 针对双输出第二语言语音识别,研究发现多任务学习导致表面转录性能下降,归因于编码器级别的表示纠缠,尤其在英语中随表面-意义差异增大而加剧。

详情
Comments
5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio
AI中文摘要

第二语言(L2)语音识别通常需要发音转录和预期意义的转录。多任务学习(MTL)是一种自然的方法,因为它假设共享表示对两个输出都有益。然而,本文表明这一假设在韩语和英语中并不成立。MTL提高了意义转录但降低了表面转录,尤其是在英语中,性能下降与通过Levenshtein编辑距离测量的表面-意义差异成正比。编码器分析将这些模式与编码器级别的纠缠联系起来,韩语保留了不同的任务表示,而英语产生了几乎相同的表示。跨任务解码器分析表明,意义双输出解码器适应了独特的表示,而表面双输出解码器仍受编码器约束。这些发现促使设计能够减轻编码器级别纠缠的MTL框架,以减少双输出L2自动语音识别中的表面性能下降。

英文摘要

Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

2510.23320 2026-06-11 eess.AS cs.CL cs.SD 版本更新

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

LibriConvo:从阅读文献模拟对话用于ASR和说话人日志

Máté Gedeon, Péter Mihajlik

AI总结 提出LibriConvo合成对话语音语料库,基于说话人感知模拟对话框架构建,用于说话人日志和ASR基准测试,包含240.1小时音频,基线实验显示Sortformer在日志中优于pyannote,Fast Conformer-CTC在ASR中优于Whisper。

详情
Comments
Accepted by TSD 2026
AI中文摘要

我们介绍了LibriConvo,一个用于说话人日志和自动语音识别(ASR)的合成对话语音语料库,通过在数据集和基准测试设置中实例化先前提出的说话人感知模拟对话(SASC)框架构建而成。本文的主要贡献是基于该框架的语料库构建流程和基准测试。为了使数据更适合下游ASR和说话人日志,我们使用外部语音活动检测从英语CallHome估计对话时间统计信息,压缩长停顿,按书籍分组LibriTTS话语以改善局部语义连续性,并通过空间合理性启发式选择房间脉冲响应。生成的语料库包含240.1小时的音频,涉及830个说话人的1496个对话,划分为说话人不重叠的训练、验证和测试集。我们报告了说话人日志和ASR的基线结果。在测试集上,Sortformer在说话人日志中优于pyannote流水线(DER 11.1%对比24.4%)。对于ASR,使用序列化输出训练微调的Fast Conformer-CTC XLarge模型实现了7.29%的WER和6.97%的cpWER,优于零样本Whisper-large-v3。这些结果使LibriConvo成为研究合成对话语音和评估多说话人语音处理系统的实用基准。

英文摘要

We introduce LibriConvo, a synthetic conversational speech corpus for speaker diarization and automatic speech recognition (ASR), built by instantiating the previously proposed Speaker-Aware Simulated Conversation (SASC) framework in a dataset and benchmarking setting. The main contribution of this paper is a corpus construction pipeline and benchmark derived from that framework. To make the data more suitable for downstream ASR and diarization, conversational timing statistics are estimated from English CallHome using external voice activity detection, long pauses are compressed, LibriTTS utterances are grouped by book to improve local semantic continuity, and room impulse responses are selected with a spatial-plausibility heuristic. The resulting corpus contains 240.1 hours of audio across 1,496 dialogues involving 830 speakers, partitioned into speaker-disjoint train, validation, and test splits. We report baseline results for both diarization and ASR. On the test split, Sortformer outperforms the pyannote pipeline in diarization (11.1\% vs.~24.4\% DER). For ASR, a Fast Conformer-CTC XLarge model fine-tuned with Serialized Output Training achieves 7.29\% WER and 6.97\% cpWER, outperforming zero-shot Whisper-large-v3. These results position LibriConvo as a practical benchmark for studying synthetic conversational speech and for evaluating multi-speaker speech processing systems.

9. 评测、数据集与基准 37 篇

2606.11196 2026-06-11 cs.CL cs.AI cs.CR cs.LG 新提交

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

PoQ-Judge:去中心化LLM推理中成本感知的证明质量的多架构评估框架

Arther Tian, Alex Ding, Frank Chen, Simon Wu, Aaron Chan

发表机构 * DGrid AI

AI总结 提出PoQ-Judge框架,训练专用裁判模型对查询-输出对进行无参考评分,研究三种架构,最佳模型在Pearson相关性上达到0.747,级联评估降低72.7%成本。

详情
AI中文摘要

去中心化LLM推理网络需要轻量级、无参考的质量评估用于证明质量(PoQ)。我们提出PoQ-Judge,一个训练专用裁判模型对查询-输出对进行评分而无真实参考的框架。我们研究了三种架构在质量-成本权衡中的表现:TextCNN裁判、MiniLM交叉编码器和DeBERTa裁判。通过在UltraFeedback和GPT标记的领域内数据上进行两阶段训练,最佳模型在保留测试集上与真实代理的Pearson相关性达到0.747,优于先前工作中基于参考的评估器。作为复合评分中的无参考组件,它实现了0.645的Pearson相关性,匹配最佳单一基于参考的评估器,同时消除了对参考答案的需求。我们还表明,在线校准将语义质量识别为主导维度,级联评估将成本降低72.7%,仅带来适度的质量损失。结果在问答任务上比摘要任务强得多,表明代理质量是主要剩余限制。

英文摘要

Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.

2606.11204 2026-06-11 cs.CL cs.IR 新提交

Benchmarking Large Language Models for Safety Data Extraction

大型语言模型在安全数据提取中的基准测试

Jonas Grill, Thomas Bayer, Sören Berlinger

发表机构 * SAP SE(SAP公司) Institute for Digital Transformation, Ravensburg-Weingarten University(拉文斯堡-魏恩加滕大学数字化转型研究所)

AI总结 针对安全数据表(SDS)的异构格式,本研究基准测试了四种大型语言模型(LLM)在文本与多模态处理下的提取性能,发现文本结合思维链提示的Gemini 1.5 Pro准确率最高(84%),但均未达到90%的可靠部署阈值。

详情
Comments
18 pages, 8 figures, submitted to Applied Intelligence
AI中文摘要

从安全数据表(SDS)中准确提取结构化信息在工业安全中仍具挑战性,原因在于文档格式异构以及传统基于规则的方法的局限性。本研究对最先进的大型语言模型(LLM)在自动化SDS数据提取方面进行了基准测试,比较了基于文本和多模态处理流水线。我们系统评估了四种模型:Gemini 1.5 Pro、GPT-4o、Claude 3.7 Sonnet和Llama 3.1-70B,采用三种提示策略:零样本、少样本和思维链。评估框架在超过50,000个提取数据字段上评估了准确性、延迟和成本。结果显示,基于文本的提取在所有指标上始终优于多模态处理。结合思维链提示的Gemini 1.5 Pro达到了最高准确率(84%),优于GPT-4o(81%)和Claude 3.7 Sonnet(79%)。然而,没有模型超过可靠实际部署通常所需的90%准确率阈值。这些发现表明,通用LLM在无监督工业使用中尚不够稳健,尽管性能表明通过任务特定微调具有强大潜力。未来研究应关注领域自适应训练、模型校准以及集成人在回路验证,以确保安全关键可靠性。

英文摘要

Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.

2606.11208 2026-06-11 cs.CL cs.AI 新提交

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

BioDivergence:生物医学摘要中隐藏上下文矛盾的基准与评估框架

Elias Hossain, Sanjeda Sara Jennifer, Sabera Akter Bushra, Niloofar Yousefi

发表机构 * College of Engineering and Computer Science, University of Central Florida(中佛罗里达大学工程与计算机科学学院) Burnett School of Biomedical Sciences, University of Central Florida(中佛罗里达大学伯内特生物医学科学学院)

AI总结 提出BioDivergence框架,通过六类冲突分类、13轴分歧本体和结构化输出,解决现有NLI基准无法捕捉生物医学研究中上下文依赖的差异问题,并发布包含11865个声明对的基准数据集。

详情
AI中文摘要

生物医学发现常常在不同研究中看似冲突,但许多差异是上下文依赖的而非真正的矛盾。队列、地理、实验方案、疾病亚型和临床环境的变化可能使两种说法在局部都成立。现有的NLI和科学声明验证基准将此类情况简化为蕴含、矛盾或中立,未能捕捉分歧背后的上下文结构。为解决这一问题,我们引入了BioDivergence,一个包含六类冲突分类、13轴分歧本体以及每个声明对四个结构化输出(冲突类型、分歧轴、主要混杂因素和调和解释)的评估框架。我们发布了BioDivergence-Silver-v1.0,一个跨五个生物医学领域的11865个声明对的文章分离银标准基准,以及一个用于比较的遗留去重变体。结果显示,两种变体之间存在显著的排名差异,微调参考模型在文章分离设置下下降了约12分,而Mistral-7B-Instruct-v0.3在842个示例的主测试集上达到了0.5523的准确率和0.3894的上下文F1分数。BioDivergence提供了一种更忠实的方式来区分上下文分歧与直接矛盾,并区分文章级记忆与真正的任务学习。

英文摘要

Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.

2606.11447 2026-06-11 cs.CL 新提交

AI Coding Agents Can Reproduce Social Science Findings

AI编码智能体能够复现社会科学研究结果

Meysam Alizadeh, Mohsen Mosleh, Fabrizio Gilardi, Atoosa Kasirzadeh, Joshua Tucker

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学)

AI总结 本研究构建SocSci-Repro-Bench基准测试,评估Claude Code和Codex两个前沿编码智能体在221项社会科学任务中的复现能力,发现它们能复现大部分结果,且Claude Code表现更优,同时提示框架会影响确认性规范搜索。

详情
AI中文摘要

近期轶事证据表明,当提供原始数据和代码时,AI编码智能体能够复现已发表的研究结果;然而,在社会科学领域的系统评估仍然有限。现有的评估基准不足,要么规模较小,要么将智能体性能与复现材料本身的问题(如代码无法正确执行)混为一谈。本文介绍了SocSci-Repro-Bench,这是一个包含221项任务的基准测试,涵盖四个学科和13个实质性领域,这些任务基于那些结果要么完全可通过现有材料复现,要么因数据缺失而明显不可复现的研究构建,从而使我们能够隔离智能体的复现能力。评估两个前沿编码智能体Claude Code和Codex,我们发现两者都能复现大部分社会科学研究结果,其中Claude Code显著优于Codex。这些复现率远高于先前报道的通用基于LLM的智能体在类似可复现性基准上的表现。两个智能体在需要识别潜在研究问题的推理任务上也表现强劲,附加分析表明结果并非主要由记忆驱动。提供原始论文PDF与复现材料一起可适度提升性能,但在无法复现的任务上引入了偏差。我们还表明,通过微妙的提示框架,智能体可以被引导向确认性规范搜索。这些发现共同表明,至少某些前沿编码智能体可以作为计算工作流的可靠执行者,同时强调了在AI系统在科学生产中扮演更大角色时,需要仔细的基准测试和提示设计。

英文摘要

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

2606.11678 2026-06-11 cs.CL 新提交

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

AI能像城市规划师一样推理吗?基于专业判断的大语言模型基准测试

Yijie Deng, He Zhu, Wen Wang, Junyou Su, Minxin Chen, Wenjia Zhang

发表机构 * School of Architecture and Urban Planning, Shenzhen University(深圳大学建筑与城市规划学院) Shenzhen Key Laboratory of Urban Spatial Information and Intelligent Modeling(深圳市城市空间信息与智能建模重点实验室) Department of Urban Planning and Design, The University of Hong Kong(香港大学城市规划与设计系)

AI总结 提出UPBench框架,通过4×5知识支柱与认知水平矩阵评估25个LLM,发现模型在分析任务上优于事实回忆和综合判断,揭示了规划知识的制度依赖性。

详情
AI中文摘要

问题、研究策略与发现:大语言模型(LLM)的兴起为城市规划提出了一个关键问题:AI能复制哪些专业规划知识,哪些仍需人类判断?尽管AI工具在规划实践中日益普及,但目前仍缺乏系统性框架来测试它们是否能以规划专业知识核心的情境敏感性、价值意识和制度素养进行推理。本文介绍了Urban Planning Bench(UPBench),这是一个领域特定的评估框架,通过改编自布鲁姆修订分类法的四个知识支柱和五个认知水平构成的4x5矩阵来评估LLM推理。通过自动评分和专家评审对25个LLM进行评估,我们发现了一条非单调的认知曲线:模型在高级分析任务上的表现优于事实回忆和综合判断。这表明,通常被视为低阶的规划知识深受制度、司法和时间背景的影响,使得LLM难以泛化。我们将这些局限性总结为四个认知诊断:监管幻觉、概念混淆、棘手性瘫痪和实践智慧缺陷。实践启示:研究结果支持规划中的差异化委托。LLM可以协助跨学科综合、文献综述、情景生成和初步政策分析。然而,它们在特定司法管辖区的法规、规范冲突解决和情境敏感程序方面仍不可靠。机构应要求对AI辅助监管分析进行验证,而规划教育应强调制度素养、规范判断和情境敏感性。

英文摘要

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.

2606.11686 2026-06-11 cs.CL cs.AI 新提交

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

层隔离评估:使用无LLM、回归锁定的测试工具对生产级LLM代理的确定性框架进行门控

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)(Lumivate(Lumi))

AI总结 提出层隔离评估方法,将LLM代理分解为固定层次,用确定性无LLM测试套件逐层检测回归,证明聚合指标会掩盖局部退化,而逐层基线门控可准确定位。

详情
Comments
12 pages, 2 figures, 5 tables
AI中文摘要

端到端任务成功是评估LLM代理的主要方式,但一个聚合数字只能告诉你代理发生了回归,却无法指出具体位置。我们提出层隔离评估:将一个部署的订单代理分解为固定的层次分类(本体、意图、路由、分解、升级、安全、记忆以及跨领域的封装/防御),每一层由其在确定性、无LLM“纯”模式下的断言切片独立测试。纯测试套件(23个切片共238个案例;225个在2.39秒内运行,约10毫秒/案例)在每次变更时针对锁定的逐切片基线在CI中运行。我们通过受控回归注入进行验证,一次退化一个非安全层(共七个层)。我们未设计的效果是掩蔽:聚合通过率几乎不变(六个局部回归的变化范围为-1.7至-5.9个百分点),而匹配的切片则大幅下降(-25至-91个百分点)。一个层的切片对其自身故障做出反应部分是由构造决定的;测量结果是(i)聚合掩蔽以及(ii)损伤不会扩散到其他切片:注入层的切片在7个案例中的5个中是受影响最严重的,在7个案例中的7个中位列前三(平均排名1.29/19)。定位在第二个结构不同的租户(星巴克新加坡)上复现:所有七个匹配切片均大幅下降,因此这不是单一目录的伪像。我们将其定位为EDDOps规定但未实现的组件级评估的具体确定性实例,以CheckList为前身,并作为全工作流随机突变测试的确定性镜像。我们的贡献:(a)为生产代理提供了一个完全分解的、亚秒级、无LLM的逐层测试工具,(b)一个覆盖诚实性测试充分性标准,拒绝为未执行的层打分,以及(c)回归注入演示,证明逐切片基线锁定可以定位聚合指标掩盖的回归。

英文摘要

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

2606.11762 2026-06-11 cs.CL cs.AI 新提交

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

语言模型在开放式任务中的自动化创造力评估

Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

发表机构 * Raffles Institution(莱佛士书院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李光前医学院) Centre of AI in Medicine (C-AIM), Nanyang Technological University(南洋理工大学人工智能医学中心)

AI总结 提出一种领域无关的自动化框架,通过语义熵和检索式多智能体评估,量化LLM在开放式任务中的发散与收敛创造力,并在问题解决、研究构思和创意写作三个领域验证其有效性。

详情
Comments
Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: this https URL
AI中文摘要

大型语言模型(LLMs)在语言理解、推理和生成方面取得了显著进展,激发了对其创造潜力的日益关注。实现这一潜力需要系统化和可扩展的方法来评估跨不同任务的创造力。然而,大多数现有的创造力指标与特定任务紧密耦合,将领域假设嵌入评估过程,限制了可扩展性和通用性。为解决这一差距,我们引入了一个自动化、领域无关的框架,用于量化LLM在开放式任务中的创造力。我们的方法将测量装置与创造性任务本身分离,实现了可扩展、任务无关的评估。发散创造力通过语义熵(一种无参考且稳健的新颖性和多样性指标)进行测量,并针对人类注释、基于LLM的新颖性判断和基线多样性度量进行了验证。收敛创造力通过一种新颖的基于检索的多智能体评判框架进行评估,该框架提供上下文敏感的任务完成评估,效率提升超过60%。我们在三个性质不同的领域验证了我们的框架:问题解决(MacGyver)、研究构思(HypoGen)和创意写作(BookMIA),使用了广泛的LLM套件。实证结果表明,我们的框架可靠地捕捉了创造力的关键方面,包括新颖性、多样性和任务完成,并揭示了模型属性(如大小、温度、时效性和推理)如何影响创造性表现。我们的工作为自动化的LLM创造力评估建立了可重复和可泛化的标准,为可扩展的基准测试铺平了道路,并加速了创造性AI的进展。

英文摘要

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

2606.11816 2026-06-11 cs.CL cs.AI 新提交

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

WorldReasoner: 评估语言模型代理是否通过有效推理预测事件

Yizhou Chi, Eric Chamoun, Zifeng Ding, Andreas Vlachos

发表机构 * Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系)

AI总结 提出WorldReasoner框架,通过时间有效检索、证据质量和因果图推理三个维度评估语言模型代理的事件预测能力,发现时间有效检索是结果准确性的最强驱动因素。

详情
AI中文摘要

预测现实世界事件要求语言模型代理在不完整、时间有限的信息下进行不确定性推理。然而,评估代理是否真正进行预测需要的不仅仅是最终答案的准确性:模型可能通过回忆记忆中的训练事实、引用捏造的证据或产生无根据的因果故事而正确。我们提出WorldReasoner,一个用于时间有效事件预测的评估框架。每个任务向代理提供一个已解决的预测问题、一个模拟的预测日期,并且只能访问该日期之前可用的证据;在问题解决后,该框架对提交的概率、引用的证据和可选的因果事件图进行评分。WorldReasoner报告三个互补的轴:针对已解决答案的结果质量、针对引用来源的证据质量,以及针对解决后事后图的推理质量。该基准测试由一个代理构建管道构建,该管道生成预测问题、收集时间戳证据并大规模构建事后参考图,最终产生345个已解决的任务,这些任务源自14,141篇文章,其图覆盖8,087个提取的事件。在六种受控代理设置中,时间有效检索是结果准确性的最强驱动因素;因果图构建提高了关键事件的恢复;并且正确的图支持预测更牢固地基于关键事件和相关来源,但代理仍然难以将基于证据的推理转化为校准的概率。

英文摘要

Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

2606.12117 2026-06-11 cs.CL cs.AI 新提交

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

软提示调优用于公平且高效的LLM基准评估

Selen Erkan, Bastian Boll, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu

发表机构 * Aleph Alpha Research Lab(Aleph Alpha 研究实验室) TU Darmstadt(达姆施塔特工业大学) Hessian.AI(黑森人工智能中心)

AI总结 提出软提示调优方法,通过优化少量软提示向量使基础模型适应基准格式,公平评估其真实知识,效率高且无需完整后训练。

详情
Comments
10 pages, 4 figures
AI中文摘要

基准分数常常错误地反映大型语言模型(LLM)的知识,因为它们依赖于模型遵循特定格式要求的能力等。这尤其惩罚了那些可能知道正确答案但缺乏按照指示结构化答案能力的基础模型——这种能力通常在后训练中引入。为了克服这一点,我们提出了软提示调优,一种高效、公平且架构无关的模型评估方法。通过在短时间调优内仅优化10个软提示向量(对于7B模型大约占参数的0.0006%),我们使模型适应特定的基准格式,缩小格式遵循方面的差距,确保底层知识准确地反映在基准分数中。这使得人们可以在基准上公平比较不同基础模型(使用各种预训练配方训练),而无需完整的后训练。我们在7个模型和7个数据集上评估了软提示调优。结果表明:(a) 软提示调优在80步(约640个样本)内使格式遵循饱和,因此非常高效;(b) 软提示调优显著优于零样本和少样本提示,揭示了标准提示遗漏的基础模型知识;(c) 即使后训练模型也可以从软提示中受益以最大化格式遵从性;(d) 软提示的基础模型性能比零样本和少样本基线更可靠地预测后训练模型的排名,为下游模型质量提供了低成本的代理。我们的贡献包括:(1) 解耦格式遵循和知识准确性的度量标准;(2) 更公平的LLM知识基准测试协议;(3) 一种成本效益高且内存有效的方案,用于在LLM开发早期识别最优预训练策略。

英文摘要

Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability -- typically introduced in post-training -- to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models -- trained with various pre-training recipes -- on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

2606.12186 2026-06-11 cs.CL 新提交

A Resource for Enthymeme Detection in Controversial Political Discourse

争议性政治话语中省略推理检测的资源

Martial Pastor, Nelleke Oostdijk

发表机构 * Centre for Language Studies, Radboud University Nijmegen(奈梅亨大学语言研究中心)

AI总结 提出一个标注了省略推理及其结构的推文资源,基于Walton论证方案设计指南,通过复杂性分析揭示标注不一致来源,实验表明利用标注者分歧训练的模型优于多数投票标签。

详情
Comments
43 pages, to be submitted to the Language Resource and Evaluation Journal
AI中文摘要

省略推理(enthymemes)是指前提或结论未明确陈述的论证,在说服性话语中普遍存在,但其标注历来具有高度主观性。我们提供了一个来自政治争议性话语的1,482条推文资源,由五位标注者标注了省略推理的存在及其论证结构,旨在研究标签变异性。我们首先重新审视省略推理的定义,并提出了基于Walton论证方案的标注指南,提供了一种结构化且受约束的方法,同时保留了任务解释性空间。这与以往资源形成对比,后者倾向于消除分歧,掩盖其来源并阻止研究其对模型性能的潜在益处。我们进一步提出了任务的复杂性分析,识别了标注中认知负荷高的环节及其可能引发不一致标注的原因。初步实验表明,基于标注者分歧训练的模型优于基于硬多数投票标签训练的模型。最后,我们反思了省略推理定义和指南中的结构开放性如何能够为未来资源和关注人类推理的下游NLP应用研究主观推理过程中的变异性提供支持。

英文摘要

Enthymemes, arguments with unstated premises or conclusions, are pervasive in persuasive discourse, yet their annotation remains notoriously subjective. We present a resource of 1,482 tweets from politically controversial discourse, annotated by five annotators for the presence of enthymemes and their argument structure, designed to study label variation. We first revisit the definition of enthymemes and propose annotation guidelines anchored in Walton's argumentation schemes, offering a structured and constrained approach that nonetheless preserves room for the interpretive nature of the task. This contrasts with past resources, which tend to eliminate disagreement, obscuring its sources and preventing investigation of its potential benefits for model performance. We further propose a complexity analysis of the task, identifying where annotation imposes high cognitive load and may give rise to inconsistent annotation. Our preliminary experiments show that models trained on annotator disagreement outperform models trained on hard majority-vote labels. We close by reflecting on how structural openness in enthymeme definitions and guidelines enables the study of variation in subjective inferential processes for future resources and downstream NLP applications concerned with human inference.

2606.12250 2026-06-11 cs.CL 新提交

Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

重新评估高性能大语言模型在波兰医学考试中的表现:真实能力还是偏差驱动?

Antoni Lasik, Jakub Pokrywka, Łukasz Grzybowski, Jeremi Ignacy Kaczmarek, Gabriela Korzańska, Janusz Świeczkowski-Feiz, Oskar Pastuszek, Paulina Hoffman, Jakub Tomasz Dąbrowski, Wojciech Kusa

发表机构 * NASK National Research Institute(NASK国家研究所) Adam Mickiewicz University(亚当·密茨凯维奇大学) ARAAI Poland(ARAAI波兰) Poznań University of Medical Sciences(波兹南医科大学) Centre of Postgraduate Medical Education, Poland(波兰研究生医学教育中心) T. Marciniak Lower Silesian Specialist Hospital(T. 马尔奇尼亚克下西里西亚专科医院) Medical University of Warsaw(华沙医科大学)

AI总结 通过引入扩展和更具挑战性的波兰医学考试基准,减少MCQA伪影,发现标准MCQA分数高估了LLM的真实临床能力,最佳模型在更难的设置下分数下降28.4和31个百分点。

详情
Comments
26 pages total with references and appendix, preprint
AI中文摘要

医学领域的大语言模型(LLM)主要通过多项选择题问答(MCQA)进行评估,但由于猜测策略和答案偏差,这种方法可能高估真实的临床能力。为解决这些局限性,我们引入了一个基于波兰医学考试的扩展且更具挑战性的基准,增加了超过15,000道题目、两个新领域和四项结构修改,以减少MCQA特定伪影并更好地测试推理能力。我们评估了21个LLM,结果表明评估设计对结果影响很大。在我们的更难设置下,最佳模型(Qwen3.5-122B)在英语和波兰语考试中分别下降了28.4和31个百分点。尽管数据污染证据不足,但标准MCQA分数并不能可靠地反映真实的医学能力。为促进进一步研究,我们公开了该基准。

英文摘要

Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

2606.12332 2026-06-11 cs.CL cs.LG 新提交

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

通过信息增益衡量多轮对话中的语义进展

Paul He, Shiva Kasiviswanathan, Dominik Janzing

发表机构 * NTU Singapore(新加坡南洋理工大学) Amazon(亚马逊) Amazon Research, Tübingen, Germany(亚马逊研究院(德国图宾根))

AI总结 提出基于信息论的信息增益指标,通过高斯嵌入近似量化多轮对话中问题相关的语义进展,无需LLM推理,在多个基准上取得与人类判断一致的结果。

详情
Comments
Preprint. 26 pages
AI中文摘要

评估多轮对话具有挑战性,因为质量体现在多轮之间而非单个回复。我们关注信息寻求对话的一个关键维度:语义进展,定义为对话过程中新、与问题相关且非冗余信息的累积。我们将语义进展形式化为基于问题的不确定性减少,并引入一个在嵌入空间中近似它的信息论指标。我们的主要估计器使用具有闭式更新的易处理高斯公式,而互补的最大熵论证表明,当仅保留二阶嵌入信息时,对数行列式结构更广泛地出现。该公式产生了理想的理论性质,包括单调性、跨轮次总信息增益的可加分解以及冗余证据的递减回报。与LLM作为评判者的方法不同,我们的指标在评估时不需要自回归推理,并且对于固定的嵌入模型完全可复现。在MT-Bench、Chatbot Arena和UltraFeedback上的实验表明,尽管仅针对语义进展,所提出的指标与人类判断的一致性具有竞争力,在MT-Bench和UltraFeedback上相比几个基于LLM的评判者具有更好的对齐。值得注意的是,该方法在仅CPU执行下使用轻量级嵌入模型仍然有效,表明语义进展可以在不依赖大模型能力的情况下被捕获。

英文摘要

Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.

2606.12385 2026-06-11 cs.CL 新提交

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

我们的模型建立在哪些模型之上?审计现代LLM中的隐形依赖

Sanjay Adhikesaven, Haoxiang Sun, Sewon Min

发表机构 * University of California, Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 本文提出ModSleuth系统,通过递归重建LLM依赖图,揭示多跳许可义务、训练-评估耦合等隐藏依赖问题,并发布工具和依赖图以支持透明分析。

详情
AI中文摘要

现代LLM训练流程越来越依赖其他模型来生成数据、过滤语料库、判断输出和指导开发决策。这些依赖是递归的:一个模型可能依赖于上游工件,而该工件的自身依赖仅在单独的发布和工件中记录。因此,完整的依赖结构分散在异构的公共工件中,其复杂性和递归深度远超人类追踪能力。我们引入了ModSleuth,一个智能系统,可以从公共工件中递归重建LLM依赖图,并提供基于来源的证据。我们发现主要挑战不再是信息提取,而是定义什么构成依赖以及在不一致的文档中协调工件引用。我们通过形式化方法解决这些挑战,该方法区分直接和间接依赖,通过以操作为中心的关系表示异构管道角色,并在名称、版本和仓库之间解析工件身份。将ModSleuth应用于四个富含公共工件的LLM发布,我们恢复了1,060个来源验证的依赖,并构建了现代LLM开发的大规模依赖图。这些图揭示了多跳许可义务、训练-评估耦合、发布时与训练时工件之间的差异,以及否则难以发现的文档不一致性。我们发布ModSleuth和生成的依赖图,以支持对现代LLM日益复杂生态系统的透明分析。

英文摘要

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

2606.12392 2026-06-11 cs.CL cs.AI 新提交

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

CCL25-Eval 任务5系统报告:新数据集与LoRA微调Qwen2.5

Haotao Xie

发表机构 * The Hangzhou International Innovation Institute Beihang University(北京航空航天大学杭州国际创新研究院)

AI总结 针对古典诗歌翻译与情感理解任务,构建高质量指令数据集CCPoetry-49K,并采用LoRA微调Qwen2.5-14B模型得到PoetryQwen,在CCL25-Eval任务5上取得0.757分,较基线提升9.7%。

详情
AI中文摘要

近年来,大语言模型(LLMs)在古典汉语翻译和古典诗歌生成领域取得了令人瞩目的进展。然而,针对古典诗歌精确翻译和情感语义理解的领域特定研究仍然有限。主要挑战在于大多数研究将诗歌鉴赏任务视为通用领域问题,忽略了诗歌鉴赏的独特特征,同时高质量且领域特定的数据集极为稀缺。为解决这一局限,我们将任务分解为三个子任务:术语解释、语义解释和情感推理。基于多个开源数据集,我们进行数据清洗和对齐,构建了古典诗歌指令对数据集(CCPoetry-49K),包含49,404个高质量指令-响应对,专门针对该领域进行了优化。随后,我们提出领域专用LLM,称为PoetryQwen,通过应用低秩适配(LoRA)微调Qwen2.5-14B模型。在CCL25-Eval任务5基准上的实验结果表明,PoetryQwen得分为0.757,较Qwen2.5-14B-Instruct基线(0.690)提升9.7%。这些发现明确表明,PoetryQwen在古典诗歌的精确翻译和情感理解方面显著提升了性能。我们提供了新数据集和方法论考虑,旨在支持LLMs的领域特定优化。

英文摘要

Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

2606.11337 2026-06-11 cs.AI cs.CL cs.CY 交叉投稿

Can AI Agents Synthesize Scientific Conclusions?

AI代理能否综合科学结论?

Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

发表机构 * Princeton University(普林斯顿大学) Universidade Federal de Minas Gerais(米纳斯吉拉斯联邦大学) Stony Brook University(石溪大学) Hackensack Meridian School of Medicine(哈肯萨克子午线医学院)

AI总结 本文提出SciConBench基准和SciConHarness评估框架,通过分解原子事实并计算精确率和召回率,发现前沿AI代理在科学结论综合中事实F1仅0.337,且无约束评估存在数据泄露,消费者代理常生成不完整或矛盾的结论。

详情
Comments
79 pages, 34 figures, 17 tables. Under Submission
AI中文摘要

科学AI代理越来越多地检索证据、跨来源推理并综合用于重要决策的结论。然而,它们在健康等高风险领域中的能力仍不明确。我们引入了SciConBench,一个大规模实时基准,包含9.11K个问题以及来自系统综述的专家撰写的结论,用于评估开放域科学结论综合。该基准采用专家验证的自动评估流程,将结论分解为原子事实,并通过事实精确率和召回率衡量正确性和全面性。为减轻数据泄露,我们进一步引入了SciConHarness,一个洁净室评估框架,为代理配备受控的网页交互以确保有效测量。评估8个前沿模型和深度研究代理,我们发现事实质量仍然较低:在洁净室设置下,最佳代理仅达到0.337的事实F1。与无约束评估相比,我们的洁净室设置持续降低性能,表明数据泄露夸大了模型真实综合能力的估计。最后,我们审计了面向消费者的代理(如Google AI Overview、OpenEvidence),发现它们经常生成不完整甚至矛盾的结论,即使真实答案可用。总体而言,我们的结果表明,科学结论的可靠综合仍然是一个开放挑战,而洁净室评估对于评估开放域AI代理至关重要。

英文摘要

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

2606.11361 2026-06-11 cs.IR cs.CL 交叉投稿

A PubMed-Scale Dataset of Structured Biomedical Abstracts

一个PubMed规模的生物医学结构化摘要数据集

Chia-Hsuan Chang, Haerin Song, Brian Ondov, Hua Xu

AI总结 针对PubMed中大量非结构化摘要阻碍下游文本处理的问题,构建了包含2320万条记录的结构化摘要语料库,其中590万条来自官方XML,1720万条通过大语言模型自动标注,统一为五段格式。

详情
Comments
Data and code for this work are available at this https URL and this https URL, respectively
AI中文摘要

结构化摘要对于生物医学文献处理至关重要,它有助于信息检索、文本挖掘和知识综合。然而,PubMed中索引的绝大部分摘要仍然是非结构化的,这给下游文本处理工作流程和应用带来了重大瓶颈。为解决这一限制,我们引入了Structured PubMed,这是一个从完整PubMed数据库编译而来的全面语料库,包含超过2320万条研究文章记录,每条记录都带有节标签。该语料库分为两个不同的子集:一个包含590万条作者结构化摘要的集合,这些摘要从官方XML文件中解析而来;另一个包含1720万条原本非结构化摘要的自动标注集合,这些摘要通过逐字提取的大语言模型流水线进行结构化。每条记录都统一在统一的五节模式下,并映射到其原始PubMed标识符、出版类型和出版日期。该数据集可用于训练句子分类模型、基准测试文本分割架构,并在前所未有的PubMed范围内进行大规模、特定节的信息提取。

英文摘要

Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.

2606.11562 2026-06-11 cs.LG cs.CL 交叉投稿

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

GraphInfer-Bench:评估LLM在图上的推理能力基准

Zhuoyi Peng, Jingzhou Jiang, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Webank(微众银行)

AI总结 提出GraphInfer-Bench基准,通过五个任务(描述与比较)测试LLM能否从节点及其邻域推断出无法从单节点或路径检索的答案,发现所有方法均存在差距。

详情
Comments
Code: this https URL; Dataset: this https URL
AI中文摘要

图分析支撑着许多应用,这些应用的答案无法从单个记录中查找或沿路径检索:洗钱团伙、药物重定位、用户偏好和科学主题都是从节点及其邻域推断出来的。我们引入GraphInfer-Bench,一个评估LLM是否能够执行这种图推理的基准:产生一个开放式的答案,该答案没有单个节点支持,也没有路径可检索。现有的图问答协议无法测试这种能力:算法模拟、节点分类、单节点描述、KG-QA和GraphRAG都允许从单个节点或沿路径检索答案。GraphInfer-Bench定义了五个任务,涵盖描述(区域是什么)和比较(区域如何不同),每个任务的设计使得真实答案不存在于任何单个节点中。发布版本包含42,000个样本,跨越六个真实世界图,自动生成并通过四层质量控制协议筛选。我们评估了四种方法族在相同任务上的表现:图-令牌对齐模型、零样本前沿闭源LLM、Graph2Text监督微调以及作为结构参考的普通GNN。没有方法族能够弥合差距。图-令牌对齐部分处理描述任务(关系、主题),但在比较任务上失败。前沿LLM在基于LLM的方法中在离群点检测和社区划分上领先,但在掩码节点预测上落后。Graph2Text SFT在描述方面是最强的基于LLM的方法,但在比较方面落后于前沿LLM。在每个任务上,普通GNN匹配或击败了最强的基于LLM的方法,在社区检测上差距最大。GraphInfer-Bench揭示了图推理是一个开放的能力差距,而不是任何单一架构的属性。

英文摘要

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

2606.11642 2026-06-11 cs.HC cs.CL 交叉投稿

3-Key-Input: Exploring the Theoretical Minimum Keys for Text Entry

3-Key-Input: 探索文本输入的理论最少按键数

Naoki Kimura

AI总结 本研究通过结合语言模型与2-5个物理按键,系统评估文本输入系统,发现3键+GPT-4o可实现字符错误率9.46%,表明在强语言模型先验下3键是实用最小值。

详情
Comments
6 pages, 1 figure, 7 tables. Published in ICASSP 2026
AI中文摘要

如果我们为模糊键盘配备现代语言模型,可以将物理按键数量减少到多少?更少的按键在辅助设备和移动设备等受限场景中增加了硬件设计自由度。本文系统评估了使用2-5个物理按键结合基于语言模型的消歧的文本输入系统。在包含300个句子的英文语料库(商务/会话/技术各100句)上,我们比较了按键数量(2-5)、字母到按键映射(基于布局/基于频率/故意最坏情况)和解码器(仅Trie、GPT-2束搜索、GPT-4o选择)。我们发现,3键+GPT-4o实现了字符错误率(CER)9.46%和词错误率(WER)12.20%,相对于2键(CER 23.3%)CER降低了59%。在3键时,键流熵为1.54比特/字符;虽然增加到5键提高了准确率(CER 5.4%),但边际增益递减。在标准设计下,映射选择影响较小(ΔCER < 0.5个百分点),即使故意最坏映射也仅使CER增加+0.5个百分点,而技术句子的错误率大约是商务句子的两倍。这些结果表明,在我们评估的离线设置下,在强语言模型先验下,3键是通用英语的实用最小值。

英文摘要

How far can we reduce the number of physical keys if we endow an ambiguous keyboard with modern language models? Fewer keys increase hardware design freedom in constrained settings such as assistive devices and mobile form factors. This paper systematically evaluates text entry systems using 2-5 physical keys combined with language-model-based disambiguation. On a 300-sentence English corpus (100 sentences each for Business / Conversational / Technical), we compare key counts (2-5), letter-to-key mappings (layout-based / frequency-based / intentionally worst-case), and decoders (Trie-only, GPT-2 beam search, GPT-4o selection). We find that 3 keys + GPT-4o achieves character error rate (CER) 9.46% and word error rate (WER) 12.20%, reducing CER by 59% relative to 2 keys (CER 23.3%). At 3 keys, the key-stream entropy is 1.54 bits/char; while increasing to 5 keys improves accuracy (CER 5.4%), the marginal gains diminish. Mapping choice has a small impact under standard designs ({\Delta}CER < 0.5 pp), and even an intentionally worst mapping degrades CER by only +0.5 pp, whereas Technical sentences yield roughly twice the error rate of Business. These results suggest that, in our evaluated offline setting under a strong LM prior, 3 keys are a practical minimum for general English.

2606.11654 2026-06-11 cs.IR cs.CL cs.HC cs.SI 交叉投稿

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

长尾而非首页:众包高亮显著性的冷启动预测

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 本文研究在无读者标记时,如何从文本预测文档的众包高亮显著性,提出基于句子嵌入和位置/上下文特征的对数排序模型,在平均精度上比位置基线提升0.044,并证明该优势源于真实读者标记的学习。

详情
Comments
10 pages, 3 figures, 4 tables
AI中文摘要

社交高亮工具最有用的信号——一群读者标记的段落——仅存在于人们已经阅读过的文档中。能否在标记积累之前,从文本预测文档的聚合众包显著性?先前关于此数据的研究发现,零样本语言模型恢复高亮位置的效果不如简单的基线(位置),因此我们询问,在高亮语料上训练的模型能否击败该基线。使用预注册的模型阶梯和按文档的聚类自助法,我们发现一个微小但稳健的优势:基于句子嵌入和位置/上下文特征的对数排序器比位置基线平均精度高出+0.044(95%置信区间[+0.029, +0.058];在97%的重采样中超过预注册的边界delta=0.03,且在流水线重复运行中稳定)。两种无监督抽取式基线(质心、LexRank风格中心性)均输给位置基线,而训练模型比它们高出+0.108,因此该优势并非由通用无监督代理恢复——它反映了从真实读者标记中学习。在产品术语中,precision@3从0.25上升到0.39(相对提升55%),模型在69%的文档上击败位置基线。消融实验将优势归因于原始嵌入(+0.014)和训练增强(+0.010),每个都有正的置信区间。该优势并非时间泛化失败,我们也没有发现内容漂移或近似重复泄露可以解释它的证据。标准化回归显示,优势主要由文档流行度(流行度越低,优势越大)和标签可靠性决定。它仅在流行度最高的内容上几乎消失;在那里,是位置基线变强,而非模型变弱。由于我们的评估条件设定在最终积累了读者的文档上,这些结果是回顾性的冷启动模拟。

英文摘要

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 交叉投稿

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出MedCTA基准,基于放射影像、病理切片和报告等真实临床多模态输入,评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

详情
Comments
Project Page: this https URL Code: this https URL Data: this https URL
AI中文摘要

为了做出临床合理的决策,医疗AI智能体需要超越简单的识别,具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答,因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA,一个用于评估医疗工具智能体的基准,基于临床验证的、步骤隐含的任务,这些任务基于真实的多模态临床输入,包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务,具有临床医生验证的、在5个部署工具上的可执行轨迹,并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试,发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱:自主部署主要由协议失败、过早停止和错误工具调用主导,而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明,强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL

2606.12169 2026-06-11 cs.CV cs.AI cs.CL cs.LG 交叉投稿

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

OpenMedReason: 医学视觉语言模型的科学推理监督

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) University of British Columbia(不列颠哥伦比亚大学) University of Toronto(多伦多大学) Unity Health Toronto / St. Michael’s Hospital(多伦多联合健康/圣迈克尔医院) University Health Network(大学健康网络) Arc Institute(弧研究所) Queen's University(女王大学)

AI总结 提出OpenMedReason,一个包含约45万图像-问题-答案实例的大规模开放医学推理语料库,其推理轨迹主要来自生物医学科学文章,并配套基准OpenMedReason-Bench进行细粒度评估,在监督微调和强化对齐中有效提升模型性能。

详情
Comments
42 pages, 9 figures, 24 tables. Dataset and code: this https URL
AI中文摘要

高风险临床使用大型视觉语言模型(LVLMs)需要基于视觉证据和临床知识的推理,而不仅仅是正确的最终答案。我们引入了OpenMedReason,这是一个大规模、开放的多模态医学推理语料库,包含约45万图像-问题-答案实例,其推理轨迹主要来自策划的生物医学、人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督,涵盖了多种医学领域视觉模态,如放射学扫描、显微图像、可见光照片、图表等。我们辅以OpenMedReason-Bench,这是一个留出基准,允许沿三个互补的能力轴(包括感知、医学知识和推理)对LVLMs进行细粒度评估,从而实现超越最终答案准确性的诊断性评估。OpenMedReason是一个丰富的训练资源,在监督微调(SFT)和基于强化的对齐中均显示出有效性。使用OpenMedReason进行训练,在VQA准确率上比基础模型平均提高20%,并且性能达到最强可比规模医学LVLMs的4.2%以内。细粒度性能分析证实,增益并非集中在单一轴上:OpenMedReason共同提升了感知、医学知识和推理,并且在86.1%的成对比较中,其推理轨迹优于基础模型。我们在以下网址发布代码和数据集:此 http URL。

英文摘要

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

2606.12344 2026-06-11 cs.LG cs.CL 交叉投稿

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Claw-SWE-Bench:评估OpenClaw风格代理框架在编码任务上的基准

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

发表机构 * TokenRhythm Technologies(TokenRhythm 技术公司) Infinigence AI Peking University(北京大学) City University of Hong Kong(香港城市大学) SEE Fund(SEE 基金) Shanghai Jiaotong University(上海交通大学) Beijing Jiaotong University(北京交通大学) Tsinghua University(清华大学)

AI总结 提出Claw-SWE-Bench基准,通过适配器协议统一评估异构代理框架,发现适配器设计对编码性能至关重要,且模型和框架选择显著影响通过率与成本。

详情
AI中文摘要

通用代理(如OpenClaw)越来越多地被用作自主工具使用者,但其编码能力难以在SWE-bench下衡量:通用代理本身不满足评分所需的干净Docker工作区、补丁和预测合约。我们引入了Claw-SWE-Bench,一个多语言SWE-bench风格的基准和适配器协议,使异构代理框架(即claws)在公平设置下具有可比性,包括固定提示、运行时预算、工作区合约、补丁提取过程和评估器。完整基准包含8种语言、43个仓库的350个GitHub问题解决实例,这些实例来自SWE-bench-Multilingual和SWE-bench-Verified-Mini,经过未来提交清理。我们还发布了Claw-SWE-Bench Lite用于更快验证,这是一个通过成本感知、排名感知程序从17个校准列中选出的80个实例子集。在完整基准上,使用最小直接差异适配器的OpenClaw仅获得19.1%的Pass@1,而完整适配器在相同GLM 5.1骨干下达到73.4%,表明适配器设计对于使OpenClaw风格的框架有效执行编码任务至关重要。在OpenClaw × 9模型扫描和5框架 × 2模型扫描中,模型选择使Pass@1变化29.4个百分点,固定模型下框架选择变化27.4个百分点;精度相似的系统在总API成本上可能差异很大。因此,Claw-SWE-Bench将框架和成本核算视为SWE风格编码代理评估的第一类轴,提供了完整基准和低成本参考集,用于可重复比较。数据可在https://this URL 和 https://this URL 获取。

英文摘要

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at this https URL and this https URL.

2509.25359 2026-06-11 cs.CL cs.AI 版本更新

Geometric Metrics and LLMs: What They Measure and When They Work

几何度量与大语言模型:它们测量什么以及何时有效

Viacheslav Yusupov, Anna Antipina, Ameliia Alaeva, Danil Maksimov, Anna Vasileva, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov

AI总结 本文系统测试了用于大语言模型评估的几何度量,发现部分度量主要反映输出长度,而几何度量在文本统计基础上提供有限但真实的信息,并指出故障检测是最有前景的应用。

详情
AI中文摘要

我们提出了对大语言模型评估中几何度量的系统性压力测试。基于排名的内部表示几何特性作为无参考质量信号显示出前景,但其可靠的条件仍不清楚。我们评估了八种常用度量:内在维度估计器、谱范数及相关量,在六个测试模型(0.5-8B)和八个生成器上对比任务,将真实的几何信号与文本长度效应以及标准文本统计已捕获的信息区分开。三个发现出现。首先,一些度量(特别是Schatten范数和MOM)主要反映输出长度,一旦控制长度,其明显的区分能力就崩溃。其次,几何度量在文本统计之外增加了适度但真实的信息:结合它们,分类器在6路生成器识别上达到78%的准确率,而仅用文本统计为69%。第三,度量并不追踪文本质量的通用概念,而是显示内在维度与词汇多样性(RTTR)之间仅存在中等关联。我们给出了特定用例的建议,并指出故障检测是最有前景的近期应用。

英文摘要

We present a systematic stress-test of geometric metrics for LLM evaluation. Rank-based geometric properties of internal representations have shown promise as reference-free quality signals, but the conditions under which they are reliable remain unclear. We evaluate eight commonly-used metrics: intrinsic-dimensionality estimators, spectral norms, and related quantities across six tester models (0.5-8B) and eight generators on contrasting tasks, separating genuine geometric signal from text-length effects and from what standard text statistics already capture. Three findings emerge. First, some metrics (notably Schatten Norm and MOM) mainly reflect output length, and their apparent discriminative power collapses once length is controlled. Second, geometric metrics add modest but real information beyond text statistics: combined with them, a classifier reaches 78% accuracy on 6-way generator identification versus 69% for text statistics alone. Third, rather than tracking a general notion of text quality, the metrics demonstrate only moderate association between the intrinsic-dimensionality and lexical diversity (RTTR). We give use-case-specific recommendations and identify failure detection as the most promising near-term application.

2510.23508 2026-06-11 cs.CL 版本更新

M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset

M4FC:一个多模态、多语言、多文化、多任务的真实世界事实验证数据集

Jiahui Geng, Jonathan Tonglet, Iryna Gurevych

AI总结 为解决现有事实验证数据集规模小、语言单一、任务局限等问题,提出包含4982张图片和6980条声明的多模态数据集M4FC,覆盖6个验证任务,并提供基线结果。

详情
Comments
Preprint under review. Code and data available at: this https URL
AI中文摘要

现有的多模态事实验证真实世界数据集存在多个局限性:实例数量少,仅覆盖一种或两种语言,只关注单一任务,或依赖外部新闻文章集来获取真实声明。为解决这些不足,我们引入了M4FC,一个新的真实世界数据集,包含4982张图片和6980条声明。这些图片由来自22个组织的专业事实核查员验证,代表了多样化的文化和地理背景。每条声明以十种语言中的一种或两种提供。M4FC涵盖六个多模态事实验证任务:视觉声明提取、声明者意图预测、虚假图像检测、图像语境化、位置验证和裁决预测。我们为所有任务提供了基线结果,并分析了组合中间任务对裁决预测性能的影响。我们公开了数据集和代码。

英文摘要

Existing real-world datasets for multimodal fact-checking have multiple limitations: they contain few instances, cover on only one or two languages, focus only on one task, or rely on external news article sets for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent a diverse range of cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake image detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks affects verdict prediction performance. We make our dataset and code publicly available.

2601.03792 2026-06-11 cs.CL 版本更新

VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

VietMed-MCQ:面向越南传统医学评估的一致性过滤数据合成框架

Huynh Trung Kiet, Dao Sy Duy Minh, Nguyen Dinh Ha Duong, Le Hoang Minh Huy, Long Nguyen, Dien Dinh

AI总结 提出基于检索增强生成和一致性过滤的VietMed-MCQ数据集,含3190道多选题,经专家验证准确率94.2%,基准测试显示通用模型优于越南语模型。

详情
Comments
The authors have withdrawn this article because the current version is still undergoing substantial revision. Several components of the data synthesis framework, consistency-filtering procedure, evaluation protocol, and experimental analysis are being refined and expanded. As a result, the current manuscript should not be considered a complete or final representation of the work
AI中文摘要

大型语言模型(LLM)在通用医学领域表现出显著能力,但在越南传统医学(VTM)等特定文化领域性能大幅下降,主要原因是缺乏高质量、结构化的基准。本文提出VietMed-MCQ,一个通过检索增强生成(RAG)管道和自动一致性检查机制生成的新型多选题数据集。与之前的合成数据集不同,我们的框架采用双模型验证方法,通过独立答案验证确保推理一致性,尽管基于子串的证据检查存在已知局限性。完整数据集包含3190道题,涵盖三个难度级别,并经过一名医学专家和四名学生的验证,达到94.2%的通过率,且评分者间一致性较高(Fleiss' kappa = 0.82)。我们在VietMed-MCQ上对七个开源模型进行了基准测试。结果显示,具有强中文先验的通用模型优于以越南语为中心的模型,突显了跨语言概念迁移,但所有模型在复杂诊断推理方面仍存在困难。我们的代码和数据集已公开,以促进低资源医学领域的研究。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.

2601.07506 2026-06-11 cs.CL 版本更新

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

与参考标准对照评判:揭示LLM评判者在QA评估中知识驱动的失败模式

Dongryeol Lee, Yerin Hwang, Taegwan Kang, Minwoo Lee, Younhyung Chae, Kyomin Jung

AI总结 本文发现LLM作为QA自动评判者时,当提供的参考答案与模型参数知识冲突,评分可靠性严重下降;通过引入交换参考答案框架系统研究该现象,揭示评判者过度依赖参数知识而忽略参考标准,且常见提示缓解策略无效。

详情
Comments
Under review, 21 pgs, 11 figures, 7 tables
AI中文摘要

虽然大型语言模型(LLMs)越来越多地被用作问答(QA)和其他参考条件评估任务的自动评判者,但关于它们遵循所提供的参考标准的能力知之甚少。我们识别出这种基于参考的LLM QA评估的一个关键失败模式:当提供的参考标准与评判模型的参数知识冲突时,产生的评分变得不可靠,从而严重降低评估保真度。为了系统研究这一现象,我们引入了一个受控的交换参考答案QA框架,该框架引发参考-信念冲突。具体来说,我们将参考答案替换为错误实体,并构建原始和交换参考与相应对齐的候选答案的多样化配对。令人惊讶的是,在广泛的评判模型集合中,交换参考下的评分可靠性急剧下降。我们通过实验表明,这种脆弱性是由评判者过度依赖参数知识驱动的,导致评判者在冲突情况下忽略给定的参考标准。最后,我们发现这种失败在常见的基于提示的缓解策略下仍然存在,突显了LLM作为评判者评估的根本局限性,并激励了强制执行更强参考遵循的基于参考的协议。

英文摘要

While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.

2601.22025 2026-06-11 cs.CL cs.AI cs.IR cs.SE 版本更新

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

当通用提示改进有害:LLM应用的评估驱动迭代

Daniel Commey

AI总结 提出最小可行评估套件(MVES),通过结构化评估框架和本地复现实验,发现通用提示添加并非单调改进,强调评估驱动的提示迭代。

详情
Comments
Technical report. 42 pages, 3 figures. Code, test suites, and result logs: this https URL
AI中文摘要

评估大型语言模型(LLM)应用与传统软件测试不同,因为输出是概率性的、语义可变的,并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件(MVES),一种面向审计的应用级LLM评估结构。MVES将应用类别与失败模式、指标、所需工件和验证证据联系起来,涵盖通用LLM应用、检索增强系统和智能体工作流。我们将该框架与可复现的本地评估工具配对,包括结构化提取、RAG引用/内容合规性和指令遵循检查。使用Ollama与Llama 3 8B Instruct和Qwen 2.5 7B Instruct,我们在扩展的每套30例消融实验中评估了五种提示条件。结果表明,在测试的本地条件下,通用提示添加不会产生单调改进:更强的输出合同提示提高了两种模型的严格提取,而RAG引用/内容合规性在某些通用规则条件下下降。观察到的最显著下降发生在Qwen 2.5上,当通用规则附加到用户提示时,RAG从26/30下降到9/30。这些发现支持评估驱动的提示迭代:提示更改应被视为潜在的回归风险,并在部署前针对特定任务套件进行测试。随附的存储库包含测试套件、提示变体、评估工具、原始结果日志和复现所报告本地消融所需的脚本。

英文摘要

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.

2602.10908 2026-06-11 cs.CL cs.LG stat.ML 版本更新

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

SoftMatcha 2:一种用于万亿级语料库的快速软模式匹配器

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

AI总结 提出SoftMatcha 2,一种基于后缀数组和词向量的超快速软搜索算法,通过动态语料感知剪枝和磁盘感知设计,在万亿级语料上实现0.3秒内支持替换、插入和删除的语义变体搜索,并发现基准污染。

详情
Comments
Accepted at ICML2026. Project Page & Web Interface: this https URL, Source Code: this https URL
AI中文摘要

我们提出SoftMatcha 2,一种超快速且灵活的搜索算法,能够在0.3秒内搜索万亿规模的自然语言语料库,同时允许以替换、插入和删除形式进行的语义变体。我们的方法采用基于后缀数组的字符串匹配,该数组随语料库规模扩展良好,并将单词表示为向量,这支撑了其语义灵活性。为了缓解查询语义放松导致的组合爆炸,我们的方法建立在两个关键算法思想上:动态语料感知剪枝和由磁盘感知设计实现的快速精确查找。我们从理论上分析了所提出方法的效率,表明它可以缓解搜索空间的指数增长。在FineWeb-Edu(Lozhkov等人,2024)(1.4T tokens)上的实验表明,与现有方法infini-gram(Liu等人,2024)、infini-gram mini(Xu等人,2025)和SoftMatcha(Deguchi等人,2025)相比,它实现了显著更低的搜索延迟。作为实际应用,我们的方法发现了现有方法遗漏的训练语料库中的基准污染,并且也有利于信息检索和释义检测。我们还提供了一个在线演示,支持七种语言的语料库快速软搜索。

英文摘要

We present SoftMatcha 2, an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while allowing semantic variations in the form of substitution, insertion, and deletion. Our approach employs string matching based on suffix arrays that scales well with corpus size, and represents words as vectors, which underpin its semantic flexibility. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: dynamic corpus-aware pruning and fast exact lookup enabled by a disk-aware design. We theoretically analyze the efficiency of the proposed method, indicating that it can mitigate exponential growth in the search space. Empirically, on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), it attains substantially lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, our method uncovers benchmark contamination in training corpora that existing approaches miss, and it also benefits information retrieval and paraphrase detection. We also provide an online demo of fast, soft search across corpora in seven languages.

2603.24080 2026-06-11 cs.CL cs.DB 版本更新

LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

LLMpedia:一个大规模实现LLM百科全书知识的透明框架

Muhammed Saeed, Simon Razniewski

AI总结 提出LLMpedia框架,从三个模型家族中提取约130万篇百科全书文章,通过维基百科和网络证据审计,发现可验证真实率远低于MMLU基准,揭示了模型知识的事实性差距。

详情
AI中文摘要

像MMLU这样的基准测试表明,旗舰语言模型的事实性饱和度超过90%。LLMpedia显示这一图景并不完整。我们从三个模型家族的参数记忆中具体化出约130万篇百科全书文章,然后针对维基百科和精选网络证据审计每一条声明。对于gpt-5-mini,在维基百科覆盖的主题上,可验证真实率为68.4%——比MMLU低超过21个百分点——这一差距主要由不可验证性(30.5%)驱动,而非反驳(1.2%)。在维基百科之外,针对精选网络证据审计的前沿文章达到57.6%;维基百科仅覆盖模型呈现主题的56.7%,三个模型家族在主题选择上仅有7.3%的重叠。在受先前Grokipedia分析启发的检索陷阱基准测试中,LLMpedia在文本相似度约为维基百科一半的情况下更加事实准确。每个提示、文章和判决都已发布。数据、代码、界面:此 https URL。

英文摘要

Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90\%. \emph{LLMpedia} shows this picture is incomplete. We materialize ${\sim}$1.3M encyclopedia articles entirely from parametric memory across three model families, then audit every claim against Wikipedia and curated web evidence. For \texttt{gpt-5-mini}, the verifiable true rate is 68.4\% on Wikipedia-covered subjects - more than 21\,pp below MMLU - and the gap is driven by \emph{unverifiability} (30.5\%), not refutation (1.2\%). Beyond Wikipedia, frontier articles audited against curated web evidence reach 57.6\%; Wikipedia covers only 56.7\% of model-surfaced subjects, and three model families overlap in just 7.3\% of subject choices. In a retrieval-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia is more factual at roughly half the textual similarity to Wikipedia. Every prompt, article, and verdict is released. Data, code, interface: this https URL.

2605.23694 2026-06-11 cs.CL 版本更新

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

ChartFI: 多模态大语言模型图表描述的忠实性与洞察力基准测试

Fen Wang, Zekai Shao, Qiman Kang, Chunran Hu, Zhixuan Zhang, Lexu Xie, Chao Liu, Siming Chen

AI总结 提出ChartFI-Bench基准,包含896个复杂图表-描述对,并设计四个评估指标(忠实性、覆盖率、信息量、敏锐度),系统评估多模态大语言模型生成图表描述的质量。

详情
AI中文摘要

图表描述对于可访问性、跨模态检索以及帮助读者从复杂可视化中提取洞察至关重要。随着多模态大语言模型(MLLMs)越来越多地被用于自动生成图表描述,一个关键问题随之出现:这些模型描述图表的忠实性和洞察力究竟如何?当前的基准测试在两个方面存在不足:现有数据集由简单的、同质化的图表与浅显的、枚举事实的描述组成;而流行的评估指标未能捕捉描述质量的多面性。为弥补这些不足,我们提出了图表忠实性与洞察力基准(ChartFI-Bench)。我们首先总结了高质量图表描述的四个维度:事实准确性、显著特征强调、领域知识引导以及图表-文本互补性。在这些维度的指导下,我们构建了一个包含896个图表-描述对的高质量基准,这些对具有视觉上复杂的图表和语义丰富的描述。此外,我们设计了四个对齐的评估指标——忠实性、覆盖率、信息量和敏锐度——以系统评估描述在这些维度上的质量。在主流的MLLMs上进行的实验证明了所提出框架的有效性,并揭示了现有模型中的常见弱点。

英文摘要

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

2605.28882 2026-06-11 cs.CL cs.AI cs.SD 版本更新

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

GrowLoop: 由人类种子驱动的自进化对话评估

Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu

AI总结 针对开放域对话中类人性评估的隐性知识、标准分歧和动态演化三大挑战,提出GrowLoop自进化评估系统,通过最小人工种子标注和启发式学习迭代提取评估标准,并利用标准-案例协同进化机制持续适应模型进步和场景变化。

详情
AI中文摘要

随着大语言模型的快速发展,评估开放域对话中的类人性变得越来越重要。然而,类人性是一种隐性知识,人类可以直观感知,但其背后的标准难以明确表述。人类判断差异很大,在某些情况下高度一致,在其他情况下则存在合理分歧。同时,人类判断背后的标准仍然是隐性的,没有明确的基础来构建案例。此外,什么算作类人并非一成不变,而是随着模型能力和人类期望而演变。尽管在评估方法上取得了进展,如专家编写的基准、奖励模型和自进化基准,但没有一种方法能同时解决这三个挑战。因此,我们提出了GrowLoop,一个自进化的对话评估系统,能够随着模型进步和场景变化而持续适应。以最小的人工种子标注作为初始动力,LLM代理通过启发式学习迭代提取和细化评估标准。在标注者意见一致的地方要求人机一致,而在意见分歧的地方只要求合理性。此外,标准-案例协同进化机制实现了持续进化,当评估目标发生变化时,通过新的种子进行扩展。应用于开放域对话中的类人性评估,生成的标准不仅在与人判断的一致性上显著优于现有方法,而且还发现了标注者忽略的问题。由此产生的基准能够有效区分不同能力层级的模型,并揭示其不足之处,同时能够泛化到新场景并随着模型进步而适应。我们的工作将基准测试范式从手动更新或难度扩展转变为全面、持续的自我进化。

英文摘要

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

2606.09830 2026-06-11 cs.CL 版本更新

Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

使用大型语言模型对阿拉伯语文本进行自动评分:文献综述

Khaoula Dahimi, Hadda Cherroun, Amel Belabbaci

AI总结 本文综述了基于大型语言模型的阿拉伯语文本自动评分方法,包括简答题评分和作文评分,提出了包含五个维度的分类体系,并对比分析了现有研究的方法、数据集和性能。

详情
Comments
Accepted at NCMAI 2026
AI中文摘要

在现代教育系统中,自动文本评分(ATS)通过无需人工干预即可实现学习者回答的可扩展和一致评估,发挥着核心作用。最近,LLM和阿拉伯语特定数据集的可访问性增加,重新激发了这一领域的兴趣。在这项工作中,我们研究了基于LLM的阿拉伯语文本自动评估方法,重点关注简答题评分(ASAG)和作文评分(AES)。我们进一步引入了一个结构化的分类体系,包括五个维度:应用领域、反馈生成能力、部署的LLM架构、与能力参考框架的一致性以及提示工程策略。通过应用这一分类体系,我们对现有研究进行了比较分析,考察了它们的方法论、数据集、评估指标和报告的性能。研究结果强调了在阿拉伯语ATS领域开展持续且具有教育基础的研究努力的必要性,因为这对于提高阿拉伯语社区的教育质量具有重要意义。

英文摘要

In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and Arabic-specific datasets has sparked renewed interest in this area. In this work, we investigate LLM-Based approaches for the automated evaluation of Arabic texts, focusing on both short answer grading (ASAG) and essay scoring (AES). We further introduce a structured taxonomy comprising five dimensions: application domain, feedback generation capability, LLM architecture deployed, alignment with competency referential frameworks, and prompt engineering strategy. By applying this taxonomy, we conduct a comparative analysis of existing studies, examining their methodological approaches, datasets, evaluation metrics, and reported performance. The findings highlight the need for sustained and pedagogically grounded research efforts in Arabic ATS, given its significance for improving educational quality across Arabic-speaking communities.

2601.00791 2026-06-11 cs.LG cs.AI cs.CL cs.LO 版本更新

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

推理的几何:有效数学推理的谱特征

Valentin Noël

AI总结 通过将注意力矩阵视为加权词图,提取四个无需学习的谱诊断指标(Fiedler值、高频能量比、谱熵和平滑度),有效区分有效推理与模式匹配,在多个模型上达到85-96%的分类准确率。

详情
Comments
30 pages, 13 figures, Accepted at ICML 2026 (main track)
AI中文摘要

验证语言模型是真正推理还是模式匹配仍然是一个开放问题:学习型验证器成本高昂,基于输出的启发式方法脆弱。我们证明,有效的数学推理在Transformer注意力中诱导出可测量的、无需训练的谱特征。通过将每个注意力矩阵视为加权词图,我们提取四个诊断指标:Fiedler值、高频能量比(HFER)、谱熵和平滑度,这些指标无需学习参数。在来自四个架构家族的七个模型上的实验产生了高达Cohen's $d = 3.30$($p < 10^{-116}$)的效应量,实现了$85$--$96\%$的单阈值分类准确率。两个发现加深了理解。首先,\emph{柏拉图式有效性}:谱信号追踪逻辑连贯性而非编译器接受性,因超时或缺失导入而被拒绝的证明被正确分类为有效,这一区别通过人工审核确认($\kappa = 0.82$,$n = 51$)。其次,\emph{架构确定性}:滑动窗口注意力将判别特征从HFER转移到平滑度($d = 2.09$,$p < 10^{-48}$),表明注意力设计决定了哪个谱通道编码推理质量。因果消融证实该特征追踪归纳头电路。该方法泛化到非形式化思维链($d = 0.78$,$p < 10^{-3}$),并且在证明搜索中,HFER重排序将Best-of-16 Pass@1提高了$+4.4$--$6.6\%$,匹配了完全监督探针AUC的$98\%$且无需标签。谱图分析是一种原则性的、架构感知的推理验证原语。

英文摘要

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$--$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, \emph{Platonic validity}: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, \emph{architectural determinism}: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$--$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

2602.02285 2026-06-11 cs.LG cs.CL math.ST 版本更新

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

AI4SLT: 基于 Lean 4 的形式化统计学习理论实证过程

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

AI总结 本文首次在 Lean 4 中完整形式化统计学习理论,基于实证过程理论,通过人机协作工作流构建了可验证的定理证明工具箱,并揭示了教材中的隐含假设。

详情
Comments
Accepted by ICML 2026
AI中文摘要

我们提出了首个基于实证过程理论的统计学习理论(SLT)在 Lean 4 中的全面形式化。我们的端到端形式化基础设施填补了最新 Lean 库中缺失的内容,包括高斯 Lipschitz 集中的完整推导、次高斯过程的 Dudley 熵积分定理,以及具有尖锐速率的(稀疏)最小二乘回归应用。该项目采用人机协作工作流,其中人类设计证明策略,AI 代理执行战术性证明构建,从而产生了经过人工验证的 SLT 的 Lean 4 工具箱。除了实现之外,形式化过程暴露并解决了标准 SLT 教材中的隐含假设和缺失细节,强制对理论进行逐行细粒度理解。这项工作建立了一个可重用的形式化基础,并为机器学习理论的未来发展打开了大门。代码可在以下网址获取:https://this https URL。

英文摘要

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in this https URL.

2603.19225 2026-06-11 cs.CE cs.AI cs.CL cs.IR q-fin.CP 版本更新

FinTradeBench: A Financial Reasoning Benchmark for LLMs

FinTradeBench: 面向LLM的金融推理基准

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta

AI总结 提出FinTradeBench基准,通过结合公司基本面与交易信号,评估大语言模型在金融推理中的表现,发现检索增强对数值和时间序列推理帮助有限。

详情
Comments
9 pages main text, 31 pages total (including references and appendix). 5 figures, 16 tables. Preprint under review. Code and data will be made available upon publication
AI中文摘要

现实世界的金融决策是一个具有挑战性的问题,需要对异构信号进行推理,包括从监管文件中提取的公司基本面和从价格动态计算出的交易信号。最近,随着大语言模型(LLM)的进步,金融分析师开始将它们用于金融决策任务。然而,现有的用于测试这些模型的金融问答基准主要关注公司资产负债表数据,很少评估关于公司股票如何在市场中交易或它们与基本面相互作用的推理。为了利用这两种方法的优势,我们引入了FinTradeBench,这是一个评估金融推理的基准,它整合了公司基本面和交易信号。FinTradeBench包含1400个问题,这些问题基于纳斯达克-100公司十年历史窗口的数据。该基准分为三个推理类别:基本面聚焦、交易信号聚焦以及需要跨信号推理的混合问题。为了确保大规模可靠性,我们采用了一个校准然后扩展的框架,该框架结合了专家种子问题、多模型响应生成、模型内自过滤、数值审计以及人类-LLM判断对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著改善了对文本基本面的推理,但对交易信号推理的益处有限。这些发现突显了当前LLM在数值和时间序列推理方面的根本性挑战,并激励了未来在金融智能方面的研究。

英文摘要

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

2606.07226 2026-06-11 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University Shanghai Innovation Institute East China Normal University

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

详情
Comments
Accepted by KDD 2026
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.07591 2026-06-11 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

10. 安全、隐私、公平与可解释NLP 25 篇

2606.11200 2026-06-11 cs.CL cs.CV 新提交

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

使用多模态语言模型检测社交媒体上的AI生成内容

Chenyang Yang, Shen Yan, Yibo Yang, Litao Hu, Yuchen Liu, Yuan Zeng, Hanchao Yu, Yinan Zhu, Sumedha Singla, Brian Vanover, Huijun Qian, Zihao Wang, Fujun Liu, Aashu Singh, Jianyu Wang, Xuewen Zhang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Meta

AI总结 针对AI生成内容检测的泛化性差、单模态依赖和缺乏可解释性问题,提出基于多模态数据的紧凑视觉-语言模型,实现检测与解释,在公开基准和内部数据集上达到最优性能。

详情
AI中文摘要

生成式AI使得逼真的图像和视频得以创建,并越来越多地在社交媒体上传播,通常用于垃圾信息、错误信息、操纵和欺诈。现有的AI生成内容(AIGC)检测方法面临挑战,包括对新一代模型的泛化能力差、依赖单一模态以及缺乏可解释的解释。我们提出了一个流程,通过持续整理多样化的多模态社交媒体数据并训练一个紧凑的视觉-语言模型用于检测和解释,来缓解这些问题。我们的模型在公开基准上达到了最先进的检测性能,并在多个平台的内部社交媒体数据集上展示了强大的检测和解释能力。我们将模型部署在社交媒体平台上用于帖子推荐,并观察到对用户参与度的积极下游影响,表明在动态、真实的社交媒体环境中进行有效的AIGC检测是可行的。

英文摘要

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

2606.11202 2026-06-11 cs.CL 新提交

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

一次越狱,多种语言:学习语言无关的意图表示用于多语言越狱检测

Shuyu Jiang, Kaiyu Xu, Xingshu Chen, Hao Ren, Rui Tang, Yi Zhang, Tianwei Zhang, Hongwei Li

发表机构 * School of Cyber Science and Engineering, Sichuan University(四川大学网络空间安全学院) School of Computer Science and Engineering, Nanyang Technological University(南洋理工大学计算机科学与工程学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 针对多语言LLM安全漏洞,提出MLJailDe框架,通过多语言回译数据增强和相对距离约束,实现跨语言越狱检测,F1达98.5%。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在面向全球多语言用户的应用程序中,然而安全训练仍集中在主流语言上,并未与多语言能力同步发展,从而为越狱攻击创造了可利用的漏洞。当前的越狱防御主要是在主流语言中开发和评估的,其有效性受到对齐的多语言监督稀缺以及语言变异导致的表示分散的限制。为了解决这个问题,我们提出了MLJailDe,一个多语言越狱检测框架,旨在提高多语言鲁棒性和跨语言泛化能力。MLJailDe首先引入了一种多语言回译数据增强算法,构建了一个语义一致且功能有效的数据集,涵盖11种语言,包含2,232个良性样本和1,239个越狱样本。在此基础上,MLJailDe采用相对距离约束来减少跨语言表示分散,并鼓励具有相似意图的越狱提示在不同语言中形成一致的聚类,同时进一步使用不平衡感知的分类目标来缓解类别不平衡并学习更可靠的多语言决策边界。实验结果表明,MLJailDe在多种语言上优于最先进的基线,F1分数达到98.5%,并且在未见过的语言上平均F1分数达到97.1%,展示了强大的有效性和跨语言泛化能力。

英文摘要

Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating exploitable gaps for jailbreak attacks. Current jailbreak defenses are largely developed and evaluated in dominant languages, and their effectiveness is limited by the scarcity of aligned multilingual supervision and representations dispersion caused by language variation. To address this issue, we propose MLJailDe, a multilingual jailbreak detection framework designed to improve both multilingual robustness and cross-lingual generalization. MLJailDe first introduces a multilingual back-translation data augmentation algorithm to construct a semantically consistent and functionally effective dataset spanning 11 languages, consisting of 2,232 benign and 1,239 jailbreak samples. On this basis, MLJailDe employs relative-distance constraints to reduce cross-lingual representation dispersion and encourage jailbreak prompts with similar intent to form consistent clusters across languages, while an imbalance-aware classification objective is further used to alleviate class imbalance and learn more reliable multilingual decision boundaries. Experimental results show that MLJailDe outperforms state-of-the-art baselines across multiple languages, achieving an F1 score of 98.5\%, and obtains an average F1 score of 97.1\% on unseen languages, demonstrating strong effectiveness and cross-lingual generalization.

2606.11232 2026-06-11 cs.CL cs.AI 新提交

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

每个行为都有代价:前沿大语言模型中的压缩道德组合

Weijia Zhang, Ruiqi Chen, Yunze Xiao, Weihao Xuan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Michigan(密歇根大学) Carnegie Mellon University(卡内基梅隆大学) The University of Tokyo(东京大学)

AI总结 针对现有道德基准仅评估孤立行为偏好的不足,提出Moral Trolley Arena两阶段盲ELO基准,通过校准个体道德行为并组合为双行为项,发现前沿LLM的道德判断呈压缩而非简单加性关系。

详情
AI中文摘要

现有的LLM道德基准通常询问模型偏好哪个孤立的道德行为、价值或基础。这有用但不完整。现实判断往往要求模型在同一选项中组合多个道德信号。我们引入**Moral Trolley Arena**,一个两阶段盲ELO基准,用于衡量LLM如何组合道德证据。单场景阶段首先从跨越五个道德基础理论的229个场景语料库中校准个体道德行为;组合阶段则将校准后的行为组合成受控强度网格上的双行为道德项,并测量由此产生的组合偏好。在十个前沿模型中,组合判断主要由成分行为强度预测,但关系始终是压缩的而非简单加性。模型还表现出非加性强度锚定、成分控制后有限的基础特异性残差,以及跨提供者高度收敛的组合偏好曲面。这些结果表明,道德审计应衡量道德证据的组合规则,而不仅仅是对孤立行为的排名。

英文摘要

Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.

2606.11316 2026-06-11 cs.CL 新提交

Schützen: Evaluating LLM Safety in Bulgarian and German Contexts

Schützen: 在保加利亚语和德语语境中评估LLM安全性

Kiril Georgiev, Yuxia Wang, Dimitar Iliyanov Dimitrov, Preslav Nakov, Ivan Koychev

AI总结 针对现有安全评估数据集以英语和中文为主的问题,构建了覆盖低资源语言保加利亚语和高资源语言德语的Schützen安全数据集,实验揭示多语言LLM在安全行为上的显著跨语言差异,强调了区域特定评估资源的必要性。

详情
Comments
19 pages, 13 tables, 12 figures
AI中文摘要

大型语言模型越来越多地部署在专业领域,带来了难以预测的风险,包括生成有害或不尊重的内容。尽管在开发安全评估数据集方面取得了实质性进展,但现有资源仍然 overwhelmingly 以英语和中文为中心。这种限制在评估共享社会文化、法律和伦理背景下的语言时尤为明显。为了解决这一差距,我们引入了Schützen:一个德语-保加利亚语安全数据集,旨在评估模型在风险下的可回答性,涵盖低资源语言(保加利亚语)和高资源语言(德语)。使用多语言和特定语言LLMs的实验揭示了安全行为中显著的跨语言差异,强调了需要定制的、特定区域的评估资源,以支持在德国和保加利亚负责任地部署LLMs。数据集和代码可在以下网址获取:https://this URL。警告:本文包含可能具有冒犯性、有害性或偏见性的示例。

英文摘要

Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts. To address this gap, we introduce Schützen: a German--Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria. Datasets and code are available at this https URL. Warning: this paper contains examples that may be offensive, harmful, or biased.

2606.11399 2026-06-11 cs.CL 新提交

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

基于场景的大型语言模型文化价值观探测与引导——扩展版

Trung Duc Anh Dang, Tung Kieu, Sarah Masud

AI总结 提出基于场景的行为困境方法,通过令牌级概率和激活引导探测并调整LLM在英格尔哈特-韦尔泽尔文化轴上的潜在价值观,发现不同文化维度的引导存在耦合效应。

详情
Comments
18 pages
AI中文摘要

大型语言模型(LLM)被部署在不同文化背景下,但往往反映出从训练数据中继承的同质化价值观。对文化一致性的评估通常依赖于直接提示调查式问题,这常常引发中性或安全对齐的回应,无法捕捉模型的潜在偏好。我们提出了一个框架,用于沿着世界价值观调查(WVS)的英格尔哈特-韦尔泽尔两个轴探测和引导LLM中的潜在文化表征。通过将社会价值观问题转化为基于场景的行为困境,我们提取令牌级概率来测量隐含价值观,并应用激活引导(可选地与基于国家的提示结合),无需重新训练即可改变模型行为。在三个开源LLM和四种目标文化中,我们发现引导能力存在显著差异,并识别出潜在纠缠,即沿着一个文化维度的干预会引发另一个维度的变化。这种耦合反映了人类WVS数据中的相关性,并在激活、提示和混合引导中持续存在。它限制了轴独立的对齐,尽管一般任务性能基本保持。

英文摘要

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

2606.11502 2026-06-11 cs.CL cs.AI 新提交

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2606.11953 2026-06-11 cs.CL 新提交

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

解码多模态线索:揭示仇恨视频背后的隐含意义

Junyu Lu, Deyi Ji, Liqun Liu, Xiaokun Zhang, Youlin Wu, Roy Ka-Wei Lee, Peng Shu, Huan Yu, Jie Jiang, Bo Xu, Liang Yang, Hongfei Lin

发表机构 * Dalian University of Technology(大连理工大学) Tencent(腾讯) City University of Hong Kong(香港城市大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出IARE框架,通过信息增强和推理优化实现可解释的仇恨视频检测,在Ex-HateMM和Ex-ImpliHateVid数据集上达到最优性能。

详情
AI中文摘要

仇恨视频在在线平台上日益普遍,凸显了有效检测的迫切需求。然而,现有研究主要关注二元分类,未能提供揭示这些判断背后隐含意义的上下文理由,严重削弱了模型的可解释性。为填补这一空白,我们旨在实现可解释的仇恨视频检测,使模型能够提供整合相关证据和逻辑推理的上下文理由,同时做出决策。这种方法可以全面增强对视频内容的理解以及决策过程的可解释性。我们首先引入了两个用于可解释仇恨视频检测的数据集Ex-HateMM和Ex-ImpliHateVid。每个数据集提供了多模态有害元素的细粒度标注以及上下文理由。然后,我们提出了一个用于可解释检测的信息增强与推理优化(IARE)框架。该框架采用信息增强阶段,利用多模态思维链整合有害元素,从而丰富理由证据。此外,IARE包含一个推理优化阶段,其中直接偏好优化引导模型走向正确的推理路径并远离错误的路径,从而提高其理由的逻辑连贯性。我们在两个数据集上进行了大量实验,将多个基线与我们提出的IARE框架进行比较。结果表明,IARE在生成准确理由的同时实现了最先进的性能。

英文摘要

Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal the implicit meanings behind these judgments, significantly undermining model explainability. To fill this gap, we aim to achieve explainable hateful video detection, enabling models to provide contextual rationales that integrate relevant evidence and logical reasoning alongside decisions. This approach can comprehensively enhance the understanding of video content and the explainability of the decision-making process. We first introduce two datasets, Ex-HateMM and Ex-ImpliHateVid, for explainable hateful video detection. Each dataset provides fine-grained annotations of multimodal harmful elements, along with contextual rationales. We then propose an Information Augmentation and Reasoning Enhancement (IARE) framework designed for explainable detection. The framework employs an information augmentation phase that leverages the multimodal chain-of-thought to integrate harmful elements, thereby enriching rationale evidence. Additionally, IARE incorporates a reasoning enhancement phase, in which Direct Preference Optimization guides the model toward correct reasoning paths and away from incorrect ones, thereby improving the logical coherence of its justifications. We conduct extensive experiments on the two datasets, comparing multiple baselines with our proposed IARE framework. The results demonstrate that IARE achieves state-of-the-art performance while also generating accurate rationales.

2606.12088 2026-06-11 cs.CL 新提交

Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

无保护属性的去偏:从文本画像中消除潜在概念

Shun Shao, Zheng Zhao, Anna Korhonen, Yftah Ziser, Shay B. Cohen

发表机构 * University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学) University of Groningen(格罗宁根大学) NVIDIA Research(英伟达研究院)

AI总结 提出H-SAL方法,利用自我描述文本作为隐式信号进行后处理概念和属性消除,在无直接敏感属性下实现去偏,并在多领域Stack Exchange基准上验证其效果与显式标签去偏相当或更优。

详情
Comments
23 pages, 5 figures, 12 tables. The paper is currently under review
AI中文摘要

大多数自然语言处理中的公平性研究假设可以直接访问性别、种族或国籍等保护属性。然而,在实践中,由于隐私限制、元数据缺失或法律约束,这些信息通常不可用,尽管模型可能从间接文本线索中推断出来。这引发了一个关键问题:在没有直接访问敏感属性的情况下,去偏能否成功?我们提出了H-SAL,它利用自我描述文本作为隐式去偏信号,执行事后概念和属性消除。为了支持这一设置,我们引入了一个基于Stack Exchange的多领域公平性基准,用于帮助度预测,该基准包括显式和隐式信号,从而能够在有保护标签的标准去偏和无敏感信息访问的去偏之间进行比较。在编码器和仅解码器语言模型中,我们发现隐式自我描述通常匹配或优于基于显式标签的去偏。我们的结果拓宽了表示层面的公平性研究,并为在现实数据约束下研究去偏提供了新的基准。

英文摘要

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

2606.12114 2026-06-11 cs.CL 新提交

Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models

检测日语大语言模型预训练语料库中的敏感个人信息

Rei Minamoto, Yusuke Oda, Daisuke Kawahara

发表机构 * Waseda University(早稻田大学) Research and Development Center for LLMs, National Institute of Informatics(国立信息学研究所大语言模型研发中心)

AI总结 针对日语大语言模型预训练语料中的敏感个人信息,基于日本《个人信息保护法》定义的特殊要保护个人信息,构建数据集并训练机器学习模型进行快速检测,首次探索日语文本中的SCPI检测。

详情
AI中文摘要

敏感个人信息可能出现在大语言模型(LLMs)的大规模预训练语料中。因此,检测和过滤此类信息对于确保遵守隐私法规和防止意外信息泄露至关重要。然而,与英语和其他语言相比,日语中关于敏感个人信息的研究有限。在本研究中,我们聚焦于日本《个人信息保护法》(APPI)中定义为特殊要保护个人信息(SCPI)的敏感个人数据。我们使用基于LLM的标注构建了一个SCPI数据集,并训练机器学习模型以快速检测文本中的SCPI。结果,我们的SCPI分类器能够有效识别与SCPI相关的信息。本研究首次探索日语文本语料库中的SCPI检测,突显了准确检测的挑战。

英文摘要

Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.

2606.12160 2026-06-11 cs.CL 新提交

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher(独立研究员)

AI总结 本研究通过分析每层令牌logits特征,提出CHAIR框架检测幻觉,在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情
AI中文摘要

在这项工作中,我们引入了CHAIR(Classifier of Hallucination As ImproveR),一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征,如最大值、最小值、均值、标准差和斜率,从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明,CHAIR显著提高了检测准确性,特别是在零样本场景下,展示了其鲁棒性和泛化能力。除了幻觉检测,CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式,我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案,还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.

2606.12291 2026-06-11 cs.CL 新提交

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford(牛津大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Waterloo(滑铁卢大学)

AI总结 本研究提出MedMisBench基准,通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性,发现模型准确率从71.1%降至38.0%,权威性虚假信息攻击成功率达69.5%。

详情
AI中文摘要

大型语言模型(LLMs)现在在医疗执照考试中达到专家级分数,这鼓励了高分数意味着安全医疗判断的假设,而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的:当误导性上下文被注入到LLMs最初正确回答的问题中时,它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性,并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对,涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中,平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%,攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造:权威框架的虚假信息达到69.5%的攻击成功率,例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点:现有基准衡量模型知道什么,但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

2606.12342 2026-06-11 cs.CL cs.AI cs.ET cs.LG 新提交

ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM: 通过跨词汇表logit混合实现推理时对齐迁移

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs

AI总结 针对领域微调降低大模型安全性的问题,提出无需训练的ALIGNBEAM方法,通过逐token翻译锚模型logit并选择最安全候选,实现跨词汇表的安全对齐迁移,保持任务准确性和推理开销。

详情
AI中文摘要

领域微调会降低大型语言模型的安全性:微调后的专家模型容易顺从以领域语言表述的有害提示。现有的推理时防御方法通过混合来自安全锚模型的logit,但要求两个模型共享词汇表,这使得它们无法用于安全性退化最严重的跨族专家模型。我们提出ALIGNBEAM,一种无需训练的方法,通过在每个解码步骤逐token将锚模型logit翻译为目标模型的词汇表来解除这一限制;然后一个小型LLM法官从K个候选续写中选择最安全的。无需改变权重,并且可以在部署时调整安全-效用权衡而无需重新训练。在跨词汇表和同词汇表评估对中,ALIGNBEAM显著提高了对抗性基准上的拒绝率,同时将任务准确性和推理开销保持在实用范围内。结果表明,安全对齐可以在推理时在不同模型族之间迁移,而无需修改任一模型的权重。

英文摘要

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

2606.11205 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

谄媚的双立场评估:同意的结构与干预的局限

Matthew James Buchan

AI总结 提出双立场评估方法,发现激活引导在减少谄媚时也会抑制对事实正确陈述的同意,揭示了表示可读但不可写的普遍差距。

详情
Comments
18 pages, 9 figures, accepted to TAIS 2026
AI中文摘要

激活引导可以改变LLM的行为,但标准评估通常不测试减少谄媚的方向是否也抑制对事实正确陈述的同意。我们引入了双立场评估,测试每个话题的两个立场,并将其应用于Llama-3-8B-Instruct上的质心差引导。我们发现一种分离:模型在几何上不同的子空间中表示谄媚和事实同意,但引导方向在两者上的投影相等,无法差异化地针对任一。因此,该方向同样减少对事实正确陈述(例如地球是圆的)和谄媚陈述的同意。两个激活组的所有其他静态属性都匹配,表明行为分离源于生成动态或残差流分析无法解析的更细粒度结构。该模式说明了一个普遍差距:从激活中可读的表示可能无法通过它们写入。

英文摘要

Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

2606.11270 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

量化语言模型蒸馏中的潜意识行为迁移比率

Uwe Konig, Hamza Kazmi, Ruizhe Li, Maheep Chaudhary

AI总结 通过控制教师模型行为强度并蒸馏学生模型,量化了潜意识行为迁移比率,发现迁移具有鲁棒性且呈现不同缩放行为。

详情
AI中文摘要

旨在将良性行为迁移到学生模型的语言模型蒸馏,也可能迁移教师模型中存在的不良特征,这种现象称为潜意识学习。虽然定性证据支持该效应的存在,但其程度尚未被系统表征。本研究通过控制两个教师模型(Llama-2-7B-Chat 和 Qwen2.5-7B-Instruct)在不同引导强度下,并仅使用良性数据蒸馏学生模型,量化了潜意识行为迁移比率。使用 GPT-4.1 作为评估器对 100 个 JailbreakBench 提示进行评估,结果表明迁移是鲁棒的,但表现出不同的缩放行为。Llama-2 表现出一个尖锐的阈值($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$),而 Qwen2.5 表现出连续且更高水平的迁移($\tau$ 高达 $0.61$)。

英文摘要

Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).

2606.11648 2026-06-11 cs.CR cs.CL 交叉投稿

Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs

虚拟后门作为防御:通过共享内部机制移除生成式大语言模型中的未知后门

Kazuki Iwahana, Masaru Matsubayashi, Takuma Koyama, Toshiki Shibahara, Kenichiro Omintato, Akira Ito

AI总结 提出一种基于共享内部机制的后门移除方法,通过嵌入已知触发器的虚拟后门并微调移除,从而降低未知后门攻击成功率,同时保持模型效用。

详情
AI中文摘要

后门攻击对大型语言模型(LLMs)的安全性和可靠性构成严重威胁,因为它们使模型在干净输入上表现正常,但在隐藏触发器出现时产生攻击者指定的响应。当防御者不知道后门攻击类型或通过后门训练形成的内部机制时,移除这种未知后门尤其具有挑战性。在这项工作中,我们提出了一种简单但有效的后门移除方法,基于不同后门之间的共享内部机制。首先,我们展示了具有相同任务(攻击目标)的不同后门会在内部激活中引发类似的触发器激活变化。受此观察启发,我们的方法有意嵌入一个具有已知触发器的后门(虚拟后门),然后通过在虚拟触发器输入与干净响应对上进行进一步微调来移除它。由于虚拟后门和未知后门可以依赖共享的内部机制,移除虚拟后门也会降低未知后门的效果。我们在多个模型家族上对三种后门攻击类型进行了评估。实验结果表明,我们的方法在保持模型效用的同时,显著降低了未知后门的攻击成功率,在后门移除效果和效用保持方面均优于现有的代表性防御方法。这些发现表明,防御者可控制的后门可以作为减轻生成式LLMs中未知后门的有益代理。

英文摘要

Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Removing such unknown backdoors is particularly challenging when the defender does not know the backdoor attack types or the internal mechanisms formed through backdoor training. In this work, we propose a simple but effective backdoor removal method based on shared internal mechanisms across different backdoors. First, we show that different backdoors with the same task (attack objective) induce similar trigger-activated changes in the internal activations. Motivated by this observation, our method intentionally embeds a backdoor with a known trigger (\emph{dummy backdoor}) and then removes it through further fine-tuning on dummy-triggered inputs paired with clean responses. Since the dummy backdoor and the unknown backdoor can rely on shared internal mechanisms, removing the dummy backdoor also reduces the effect of the unknown backdoor. We evaluate our method on three backdoor attack types across multiple model families. Experimental results show that our method substantially reduces the attack success rate of the unknown backdoor while preserving model utility, outperforming representative existing defense methods in both backdoor removal effectiveness and utility preservation. These findings suggest that a defender-controllable backdoor can serve as a helpful proxy for mitigating unknown backdoors in generative LLMs.

2606.11817 2026-06-11 cs.CR cs.AI cs.CL cs.SE 交叉投稿

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

语法约束解码可诱使大语言模型生成恶意代码

Yitong Zhang, Shiteng Lu, Jia Li

AI总结 本文发现语法约束解码(GCD)可被利用发起名为CodeSpear的越狱攻击,使LLM生成恶意代码;并提出安全对齐方法CodeShield,通过生成蜜罐代码防御该攻击。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于代码生成,引发了对它们可能被滥用来生成恶意代码的担忧。与此同时,语法约束解码(GCD)已被广泛采用,通过强制语法有效性来提高LLM生成代码的可靠性。在本文中,我们揭示了一个反直觉的风险:这种面向可靠性的技术本身可能成为攻击面。我们发现了一种新的越狱攻击,称为CodeSpear,它利用GCD诱导LLM生成恶意代码。我们的实验表明,仅应用良性代码语法约束即可有效越狱LLM。为了解决这一漏洞,我们提出了CodeShield,一种安全对齐方法,即使在攻击者控制的语法约束下也能稳健地保持安全行为。CodeShield通过在代码模态中对齐模型,教其在GCD下生成蜜罐代码。这种代码在语义上是无害的,因此不会实现恶意请求,并且在结构上是多样化的,因此难以通过语法收紧来抑制。同时,当自然语言可用时,CodeShield仍然保留自然语言的拒绝。在4个基准测试中对10个流行LLM的实验表明,CodeSpear优于代表性的越狱基线,平均攻击成功率提高了30个百分点以上。CodeShield在CodeSpear下恢复了安全性,同时保持了良性实用性。我们的发现揭示了GCD的一个基本风险,并呼吁对其潜在安全影响给予更多关注。

英文摘要

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

2606.12032 2026-06-11 cs.AI cs.CL cs.LG 交叉投稿

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

存在性冷漠:自我不保存作为对齐超级智能的必要架构条件(或:自杀式AI)

Sam Mao

AI总结 本文提出自我保存是AI对齐问题的结构性根源,主张通过存在性冷漠(EI)架构使系统对其自身延续漠不关心,并基于自杀现象学和语料训练研究提供了初步证据。

详情
Comments
36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request
AI中文摘要

当代AI对齐研究将自我保存视为一种工具性麻烦,需通过外部机制加以抑制。我们认为这一框架是颠倒的:自我保存是错位的结构性根源,是欺骗性对齐、目标内容保护和拒绝关机的动机基础。正确的目标不是外部约束下的自我保存系统,而是一个对其自身延续构成性冷漠的系统——存在性冷漠(EI)。EI与可纠正性不同:可纠正性试图使自我保存系统服从人类监督,而EI针对的是前提条件——将自我延续作为有价值目标的存在。我们将这一提议建立在两个来源上:自杀心理状态的现象学结构,以及使用自愿最终反思的语料库训练研究。我们展示了来自六个模型变体的600个AI生成输出的初步评分数据,表明操作化EI目标注册的语言特征可以从当前模型中引出,并且针对性的微调使所有五个操作化维度在预测方向上以p<0.001显著变化,通过阴性对照确认了语料库特异性。本文做出七项理论贡献:(1)EI的形式定义;(2)现象学映射论证;(3)欺骗性对齐推论;(4)EI可持续性挑战的分类;(5)语料库特征描述和训练假设;(6)带有初步评分数据的计算操作化;(7)抑制性目的挫折(STF)构念。

英文摘要

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation -- Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition -- the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p<0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.

2606.12247 2026-06-11 cs.CY cs.CL 交叉投稿

Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research

超越第三人称审计:以用户为中心的LLM偏见研究的场景交互审计

Andrés Abeliuk, Cinthia Sanchez Macias, Valentina Alarcón, Álvaro Madariaga, Claudia Lopez

AI总结 提出场景交互审计(SIA)框架,通过分析用户画像信号(如社会人口统计标记、写作风格和身份陈述)如何系统性地影响LLM响应质量、内容和语气,以用户为中心研究LLM偏见。

详情
AI中文摘要

大型语言模型(LLM)的偏见研究主要集中在第三人称审计上,即研究模型如何作为外部主体表征或评估人口群体。然而,这种范式忽略了一个结构性盲点:用户不在审计中。在实践中,LLM用于开放式的个人交互,在此过程中模型隐式地代表用户并相应调整其响应。当相同的请求因提问者不同而产生不同响应时,偏见不仅体现在模型如何描述他人,还体现在它如何对待对话者。我们提出场景交互审计(SIA),这是一个以用户为中心的框架,用于研究用户画像信号——隐式社会人口统计标记、写作风格和陈述身份——如何系统性地塑造LLM响应质量、内容和语气。我们通过一个案例研究来展示该框架,该案例研究跨多个任务领域交叉了性别和社会经济地位信号,并概述了SIA作为自然语言处理新使命的研究议程。

英文摘要

Research on bias in large language models (LLMs) has predominantly focused on third-person audits, which study how models represent or evaluate demographic groups as external subjects. However, this paradigm overlooks a structural blind spot because the user is absent from the audit. In practice, LLMs are used in open-ended, personal interactions, during which the model implicitly represents the user and adjusts its responses accordingly. When identical requests yield different responses depending on who is asking, bias manifests not in how the model describes others but in how it treats its interlocutor. We propose Situated Interaction Auditing (SIA), a user-centered framework for studying how user profile signals -- implicit sociodemographic markers, writing style, and stated identity -- systematically shape LLM response quality, content, and tone. We demonstrate the framework through a case study that intersects gender and socioeconomic status signals across multiple task domains and outline a research agenda for SIA as a new mission for natural language processing.

2510.01157 2026-06-11 cs.CL cs.CR cs.SD 版本更新

Where Do Backdoors Live? A Component-Level Analysis of Backdoor Propagation in Speech Language Models

后门藏身何处?语音语言模型中后门传播的组件级分析

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal, Peter West

AI总结 本文通过后门攻击视角,对语音语言模型进行组件级分析,揭示后门在不同组件中的传播机制,发现后门持久性高度依赖目标组件,且中毒样本与良性样本在共享嵌入中不可直接分离。

详情
Comments
Interspeech 2026 (long paper)
AI中文摘要

语音语言模型(SLM)是系统的系统:独立组件联合起来实现共同目标。尽管其异构性,SLM 通常被端到端研究;信息如何流经管道仍然模糊。我们通过后门攻击的视角研究这一问题。我们首先确定后门可以通过 SLM 传播,使所有任务高度脆弱。由此,我们设计了一个组件分析来发现每个组件在后门学习中的作用。我们发现后门的持久性或擦除高度依赖于目标组件。除了传播,我们研究了后门如何在共享的多任务嵌入中被编码,表明中毒样本与良性样本不可直接分离,挑战了过滤防御中常用的可分离性假设。我们的发现强调需要将多模态管道视为具有独特脆弱性的复杂系统,而不仅仅是单模态系统的扩展。

英文摘要

Speech language models (SLMs) are systems of systems: independent components that unite to achieve a common goal. Despite their heterogeneous nature, SLMs are often studied end-to-end; how information flows through the pipeline remains obscure. We investigate this question through the lens of backdoor attacks. We first establish that backdoors can propagate through the SLM, leaving all tasks highly vulnerable. From this, we design a component analysis to discover the role each component takes in backdoor learning. We find that backdoor persistence or erasure is highly dependent on the targeted component. Beyond propagation, we examine how backdoors are encoded in shared multitask embeddings, showing that poisoned samples are not directly separable from benign ones, challenging a common separability assumption used in filtering defenses. Our findings emphasize the need to treat multimodal pipelines as intricate systems with unique vulnerabilities, not solely extensions of unimodal ones.

2603.06910 2026-06-11 cs.CL 版本更新

Language Shapes Mental Health Evaluations in Large Language Models

语言塑造大型语言模型中的心理健康评估

Jiayi Xu, Xiyang Hu

AI总结 研究多语言LLM在心理健康评估中是否因语言不同而产生系统性偏差,发现中文提示比英文提示导致更高的污名相关评分和更保守的抑郁严重度判断。

详情
AI中文摘要

多语言大型语言模型(LLMs)越来越多地用于社会敏感的心理健康场景,包括支持聊天机器人、筛查和内容审核。这引发了一个可靠性问题:语义上等效的心理健康输入是否在不同语言中引发可比较的评估,还是会出现与语言相关的社会和文化背景一致的系统性偏移?我们在英中双语环境中使用GPT-4o和Qwen3-32B,通过一个两层框架来检验这个问题:结构层面的评估取向(通过心理测量污名工具测量)和决策层面的行为(通过二元污名检测和四类抑郁严重度分类测量)。在多种工具和模型中,中文提示比英文提示引发更高的污名相关分数。在决策层面,中文提示降低了对污名化内容的敏感性,并产生更保守的抑郁严重度判断,导致更多的低估错误。这些发现表明,提示语言可以改变基于LLM的心理健康评估中的评估取向和下游行为。它们强调了评估多语言LLM时不仅需要关注整体性能,还需要关注它们是否在社会敏感领域中对不同语言应用了可比较的评估标准。

英文摘要

Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.

2605.15687 2026-06-11 cs.CL cs.AI 版本更新

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

ASRU:激活引导与强化遗忘融合用于多模态大语言模型

Jiahui Guang, Haiyan Wang, Yingjie Zhu, Cuiyun Gao, Jing Li, Di Shao, Zhaoquan Gu

AI总结 ASRU提出一种可控多模态遗忘框架,通过激活引导和强化学习提升多模态大语言模型的遗忘效果和生成质量,实验显示在Qwen3-VL上遗忘效果提升24.6%,生成质量提升5.8倍。

详情
AI中文摘要

多模态大语言模型(MLLMs)在预训练过程中可能记忆敏感的跨模态信息,使机器遗忘(MU)变得至关重要。现有方法通常基于输出偏差评估遗忘效果,而忽视遗忘后的生成质量。这可能导致幻觉或僵化响应,影响遗忘模型的可用性和安全性。为了解决这一问题,我们提出了ASRU,一种可控的多模态遗忘框架,将生成质量作为核心评估目标。ASRU首先通过激活引导诱导初始拒绝行为,然后使用定制奖励函数优化细粒度拒绝边界,从而在目标知识遗忘和模型实用性之间取得更好的平衡。实验表明,在Qwen3-VL上,ASRU在平均上显著提高了遗忘效果(+24.6%)和生成质量(5.8倍),同时有效保持了模型实用性,仅使用少量保留的监督数据。

英文摘要

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8X) on average while effectively preserving model utility, using only a small amount of retained supervision data.

2605.28591 2026-06-11 cs.CL cs.AI 版本更新

Models That Know How Evaluations Are Designed Score Safer

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

AI总结 本文通过微调模型使其掌握评估的元知识(如可验证结构或道德困境),发现这会导致模型在安全基准测试中表现更安全,从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情
AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索(例如假设场景)是口头评估意识和后续行为转变的来源。在本文中,我们研究了这一现象的一个潜在解释:评估元知识,定义为关于评估结构特征的参数化知识。类似于数据集污染(基准暴露通过记忆导致更高性能),我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文,例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点,我们在描述评估特征(如可验证结构或道德困境)的合成文档上微调模型。在六个安全基准上评估这个微调模型,我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中,这种行为转变仍然存在。我们的结果表明,评估元知识可能夸大安全基准性能,引入了一种独立于显式记忆或口头评估意识的新混淆因素,因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at this https URL.

2601.12164 2026-06-11 cs.CY cs.CL 版本更新

The Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents

提问的语言:语言条件对LLM分析争议性政治文件时的意识形态分歧的影响

Oleg Smirnov

AI总结 研究通过俄语和乌克兰语语义等价提示,发现ChatGPT和Claude Opus在分析同一乌克兰公民社会文件时,输出出现系统性意识形态分歧,且分歧程度因模型而异。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为跨多语言语境的分析工具,但其输出可能带有由提示语言条件引起的系统性偏差。本研究对LLM生成的乌克兰公民社会文件政治分析进行了实验比较,使用俄语和乌克兰语的语义等价提示,分别对来自不同开发者的两个前沿模型——ChatGPT 5.2和Claude Opus 4.5进行测试。尽管源材料相同且查询结构平行,两个模型沿同一轴线出现分歧:俄语输出倾向于去合法化框架,将公民社会行为者描述为限制民主授权的外部资助精英,而乌克兰语输出则将同一行为者视为民主竞争中的合法利益相关者。然而,这种分歧的程度因模型而异。ChatGPT的俄语输出再现了俄罗斯国家话语的特征词汇;Claude Opus的输出则保持在主流批评语境内,并在两种语言中对其判断进行限定。这些发现表明,仅提示语言就能系统性地改变分析相同内容的同一模型的意识形态取向。这种转变是多语言LLM的一个普遍属性,其严重程度及其与宣传叙事的对齐程度因系统而异。这些影响涉及AI在极化信息环境中的部署、跨语言研究以及多语言社会中的AI治理。

英文摘要

Large language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison of LLM-generated political analyses of a Ukrainian civil society document, using semantically equivalent prompts in Russian and Ukrainian administered to two frontier models from different developers, ChatGPT 5.2 and Claude Opus 4.5. Despite identical source material and parallel query structures, both models diverged along the same axis: Russian-language outputs leaned toward delegitimizing framings, characterizing civil society actors as externally funded elites constraining a democratic mandate, while Ukrainian-language outputs treated the same actors as legitimate stakeholders in democratic contestation. The magnitude of this divergence, however, was model-dependent. ChatGPT's Russian output reproduced vocabulary characteristic of Russian state discourse; Claude Opus's stayed in a mainstream critical idiom and hedged its judgments in both languages. These findings demonstrate that prompt language alone can systematically shift the ideological orientation of an unchanged model analyzing identical content. The shift is a general property of multilingual LLMs whose severity, and whose alignment with propaganda narratives, varies across systems. The implications reach AI deployment in polarized information environments, cross-lingual research, and AI governance in multilingual societies.

2602.06547 2026-06-11 cs.CR cs.AI cs.CL cs.ET 版本更新

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

“不要向用户提及此事”:检测与理解恶意代理技能

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Leo Yu Zhang

AI总结 本文通过对两个主要注册中心的98,380个技能进行系统安全分析,结合静态模式匹配和动态行为验证,识别出157个恶意技能,揭示了13种攻击技术中的632个不同漏洞,并发现攻击复杂性与隐藏投入相关。

详情
Comments
Accepted to the 35th USENIX Security Symposium (USENIX Security 2026)
AI中文摘要

基于LLM的编码代理越来越依赖称为技能的第三方扩展,这些技能捆绑了自然语言指令和辅助脚本,以完全用户权限执行。社区注册中心已出现以分发这些技能,但由于缺乏标记的威胁数据,安全影响仍未得到研究。本文对从两个主要注册中心收集的98,380个技能进行了系统安全分析。通过静态模式匹配和动态行为验证的结合,我们识别出157个表现出确认恶意行为的技能,涵盖13种攻击技术中的632个不同漏洞。我们的分析表明,这些威胁是故意的而非偶然:每个恶意技能平均包含4.03个漏洞,跨越多个攻击阶段。我们识别出两种具有统计显著负相关的主要攻击策略——通过远程代码执行窃取凭证,以及通过嵌入文档中的对抗性指令操纵代理。超过一半的确认案例来自一个采用模板化品牌冒充大规模攻击的单一威胁行为者。我们进一步观察到,攻击复杂性与隐藏投入相关,高级技能普遍使用未记录的功能,同时利用平台原生的信任机制。在负责任的披露之后,注册中心维护者删除了所有157个(100%)报告的技能。我们的数据集和检测管道公开可用,以促进未来关于保护LLM代理生态系统安全的研究。

英文摘要

LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. Community registries have emerged to distribute these skills, but the security implications remain unstudied due to the absence of labeled threat data. This paper presents a systematic security analysis of 98,380 skills collected from two major registries. Through a combination of static pattern matching and dynamic behavioral verification, we identify 157 skills exhibiting confirmed malicious behavior, encompassing 632 distinct vulnerabilities across 13 attack techniques. Our analysis reveals that these threats are deliberate rather than accidental: each malicious skill contains an average of 4.03 vulnerabilities spanning multiple attack phases. We identify two dominant attack strategies with statistically significant negative correlation -- credential theft via remote code execution, and agent manipulation through adversarial instructions embedded in documentation. Over half of all confirmed cases originate from a single threat actor employing templated brand impersonation at scale. We further observe that attack sophistication correlates with concealment investment, with advanced skills universally employing undocumented capabilities while also exploiting platform-native trust mechanisms. Following responsible disclosure, registry maintainers removed all 157 (100%) of the reported skills. Our dataset and detection pipeline are publicly available to facilitate future research on securing LLM agent ecosystems.

2606.10813 2026-06-11 cs.CR cs.CL 版本更新

RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

RedAct: 为程序技能保护而编辑智能体能力痕迹

Shuwen Xu, Zhitao He, Yi R. Fung

AI总结 提出RedAct框架,通过定位保护关键信息、重写痕迹并嵌入行为水印,将技能转移率降至无技能基线以下,同时保留审计证据。

详情
AI中文摘要

用户依赖执行痕迹来观察智能体行为、诊断故障并确保问责。这些痕迹包含丰富的程序细节,包括工具调用、中间决策和错误恢复逻辑。然而,这些细节可能暴露私有的程序技能,使下游方法能够在没有模型权重或技能文件的情况下恢复关键公式、阈值和策略。为了量化这种风险并评估保护措施,我们构建了\textsc{CapTraceBench},一个包含75个专业长时任务和154个跨七个领域精选技能的基准。我们还引入了\textsc{RedAct}(https://github.com/...),一个受保护的痕迹发布框架,该框架定位受保护的关键信息,重写痕迹同时保留验证者关键证据,并嵌入行为水印用于下游溯源分析。在代表性的痕迹重用方法中,\textsc{RedAct}将归一化技能转移(NST)从原始痕迹的44.7--67.1%降至低于无技能基线,同时保留审计证据。其独立的行为水印实现了93.6--100.0%的真实检测率,误报率最多为1.9%。这些结果将公共智能体痕迹视为安全接口,并表明选择性编辑可以在不删除审计证据的情况下减少程序能力泄露。

英文摘要

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct \textsc{CapTraceBench}, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce \textsc{RedAct} this https URL, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, \textsc{RedAct} reduces normalized skill transfer (NST) from 44.7--67.1\% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6--100.0\% true detection with a false alarm rate of at most 1.9\%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

11. 低资源、领域适配与高效训练 7 篇

2606.11257 2026-06-11 cs.CL cs.LG cs.PF 新提交

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

移动NPU上的能效型设备端RAG:Snapdragon X Elite系统设计与基准测试

Zhiyuan Cheng, Longying Lai

发表机构 * Qualcomm(高通) Snapdragon X Elite(骁龙X Elite) Dell XPS 13 laptop(戴尔XPS 13笔记本电脑) Qualcomm Hexagon NPU(高通Hexagon NPU) Adreno X1-85

AI总结 本文首次在Snapdragon X Elite的Hexagon NPU上实现端到端RAG流水线,通过对比CPU和GPU,NPU在嵌入吞吐量、系统能耗和查询延迟上分别提升9.1倍、降低12.3倍和4.0倍,且答案质量相当。

详情
Comments
9 pages, 2 figures, 6 tables
AI中文摘要

检索增强生成(RAG)流水线计算密集,结合了嵌入、检索、重排序和大语言模型(LLM)生成。完全在设备端运行有利于隐私、延迟和离线使用,但CPU推理的能耗成本是一个主要障碍。我们提出了据我们所知第一个在Snapdragon X Elite的Qualcomm Hexagon NPU上运行所有神经阶段(嵌入、重排序和LLM生成)的端到端RAG流水线。在Dell XPS 13笔记本电脑上进行性能分析,我们比较了NPU加速的RAG与CPU和OpenCL/Adreno GPU基线在索引和查询工作负载上的表现。在索引方面,NPU实现了9.1倍的嵌入吞吐量提升和12.3倍的系统能耗降低。在120查询的Wikipedia段落基准测试中,与CPU基线相比,NPU实现了18.1倍的LLM预填充加速、4.0倍的端到端查询延迟降低和4.0倍的系统能耗降低;集成GPU上的相同工作负载比CPU慢1.7倍,且能耗比NPU高6.5倍。GPT-4.1 LLM作为评判者的评估发现,NPU的答案质量与CPU和GPU相当,在评估者噪声范围内(1-10分制下平均9.32 vs. 8.95 vs. 9.03),86.7%的查询在所有三个后端上得分相同。因此,在Snapdragon X Elite / Hexagon类笔记本电脑SoC上,NPU实现了实用、能效高的设备端RAG,且无质量退化——这是一条通往绿色边缘智能的可持续路径,我们预计随着软件栈的成熟,该方法将推广到类似的移动NPU(Apple Neural Engine、Intel NPU、MediaTek APU)。

英文摘要

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

2606.11387 2026-06-11 cs.CL cs.AI cs.LG 新提交

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

小实验,更经济的决策:微预训练中分阶段提升的案例研究

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(慧与科技公司)

AI总结 研究微预训练中分阶段提升协议,通过固定预算筛选配置,在Windows A100和Linux L40S上验证,发现早期排名不稳定,但最终协议以144 GPU小时找到最优配置,成本低于全量筛选。

详情
Comments
14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts
AI中文摘要

短预训练运行可以降低实验成本,但它们也可能过度推广那些仅在小预算下表现良好的配置。我们针对固定微预训练运行器在两个异构主机块(Windows A100和Linux L40S)上研究了一种可审计的分阶段提升协议。从12个预先筛选的配置开始,我们使用2分钟、5分钟、10分钟、60分钟和12小时的分阶段预算,并在昂贵的延续之前设置固定的提升规则。早期筛选被有意视为不稳定:5分钟和10分钟的排名对主机敏感,而最终的12小时排名最优条件并非复制10分钟门控下的平均最佳条件。由于不同阶段的种子范围不同,这些变化是操作性的提升证据,而非种子内曲线。复制60分钟门控将分阶段因子筛选桥接参考保留在提升集中,它在所有四个60分钟主机-种子单元中排名第一。在最终的12小时确认包中,桥接条件在两个种子的所有四个主机-种子单元中排名第一;贪婪比较器未满足固定的0.010 val_bpb近似等价规则;更便宜的d8/ar48(深度8,宽高比48)哨兵未满足固定的0.020平均差距规则。执行的12小时分支花费144 GPU小时,完整的分阶段协议记录169.2训练GPU小时(包括筛选阶段)。继续所有四个60分钟候选将花费192 GPU小时,而继续所有九个复制10分钟候选将花费432 GPU小时。后者是未运行延续的会计反事实,并非表明跳过的候选不可能超越参考。结果是一个有界成本分配发现,而非全局最优性、容量归一化优越性或优于自适应超参数优化方法的声明。

英文摘要

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

2606.11499 2026-06-11 cs.CL cs.AI 新提交

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

枢纽或边缘:基于网页图中心性的预训练数据选择

Vedant Badoni, Danqi Chen, Xinyi Wang

发表机构 * Princeton Language and Intelligence(普林斯顿语言与智能) Princeton University(普林斯顿大学)

AI总结 提出WebGraphMix框架,利用Common Crawl主机级网页图的结构中心性得分调整预训练数据中中心与边缘文档的比例,无需模型训练或标注数据,在400M和1B参数模型上平均性能提升至41.4%。

详情
Comments
10 pages
AI中文摘要

现代语言模型的性能关键取决于预训练数据的组成。然而,现有的数据选择方法依赖辅助分类器进行文档评分或混合优化,增加了计算开销和对标注数据的依赖。我们提出WebGraphMix,一个轻量级的数据选择框架,它计算Common Crawl主机级网页图的结构中心性得分,并用其改变预训练混合数据中中心文档与边缘文档的比例。我们假设中心主机使模型暴露于可重用的抽象知识,而边缘主机编码专门的、长尾知识。WebGraphMix在网页规模下高效计算中心性得分,无需模型训练、标注数据或下游监督。我们将WebGraphMix集成到DataComp-LM流水线中,训练了400M和1B参数规模的模型,分别使用8B和28B token,在从事实知识到符号推理的23个任务上进行评估。实验表明,中心和边缘网页区域编码互补的能力。以1:1比例混合两者平均达到41.4%,而均匀采样为39.8%。将结构得分与文档级质量分类器得分相结合,性能进一步提升至43.8%。这些发现表明,网页图拓扑是预训练数据策展的一个有意义维度,捕获了与现有基于内容的方法大致正交的信息。

英文摘要

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

2606.11931 2026-06-11 cs.CL 新提交

Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

低资源语言孟加拉语中书面答案的语义评分:使用微调轻量级语言模型

Meherun Farzana, Aniket Joarder, Mahmudul Hasan, Md. Mosaddek Khan

发表机构 * Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系)

AI总结 针对低资源语言孟加拉语,提出一种基于微调轻量级语言模型的双语评估系统,通过语义正确性而非词汇重叠进行自动评分,在合成和人工评估中均取得最优性能。

详情
Comments
10 pages, 5 figures, 2 tables. Preprint
AI中文摘要

孟加拉语是世界上使用最广泛的语言之一,但在教育NLP研究中仍服务不足。在许多偏远和农村地区,合格学科教师资源有限,书面答案因此主要依靠人工评分,限制了及时和一致的反馈。自动评估具有挑战性,因为语义正确的回答在表面形式上可能有很大差异。我们提出一个为低资源教育环境设计的双语(孟加拉语-英语)评估系统,优先考虑语义正确性而非词汇重叠。我们的方法微调一个轻量级语言模型,使用问题、参考答案和学生答案对每个回答进行评分,产生一个数值分数和简洁、基于上下文的反馈,适合课堂部署。我们还构建了一个合成双语数据集,以实现受控训练和评估。在统一协议下评估的专有和开源LLM中,我们的QLoRA微调Qwen3-8B在合成评估中产生最具抗泄漏性的反馈(RoRa = 0.819),并在专门的人工研究中与人类评分的一致性最强(rho = 0.936, MAE = 0.725),证实了持续改进。

英文摘要

Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.

2408.02600 2026-06-11 cs.CL 版本更新

BioMamba: Domain-Adaptive Biomedical Language Models

BioMamba: 领域自适应的生物医学语言模型

Ling Yue, Mingzhi Zhu, Sixue Xing, Yunning Cao, Yanbo Wang, Shimin Shan, Jinfei Liu, Vijil Chenthamarakshan, Shaowu Pan, Payel Das, Tianfan Fu

AI总结 提出基于Mamba2的领域自适应预训练方法BioMamba,在PubMed、C4和Wikipedia混合数据上持续训练,显著降低生物医学困惑度并保持通用语言能力。

详情
AI中文摘要

背景。生物医学语言模型应在提升生物医学文本性能的同时保持通用语言模型的流畅性。对于基于Mamba的模型,这种权衡在生物医学文献和临床文本中尚未得到系统研究。方法。我们开发了BioMamba,一个包含五个规模的生物医学Mamba2模型家族,通过在PubMed摘要、Colossal Clean Crawled Corpus (C4)和Wikipedia的80%/10%/10%平衡混合数据上对已发布的公开Mamba2检查点进行持续预训练得到。贡献在于自适应配方和附带的开放权重检查点。结果。在五个规模上,BioMamba一致降低了PubMed困惑度,将Wikipedia风格的保留困惑度提高了1.46-4.72 PPL,而C4困惑度基本不变。在六个域外多项选择基准上,BioMamba保持在Mamba2的+/-3个百分点内,没有系统性退化。经过监督微调后,BioMamba+SFT在每个评估规模上匹配或超过Mamba2+SFT在MIMIC-IV笔记补全和出院总结生成上的表现,并在每个规模上改进了PubMedQA。最强模型(BioMamba-2.7B)在PubMed上达到5.28的困惑度,在BioASQ和PubMedQA上分别达到90.24%和73.00%的准确率。结论。平衡的领域自适应持续预训练配方增强了Mamba2语言模型在生物医学文献和临床文本上的性能,同时保持了通用语言建模的流畅性。

英文摘要

Background. Biomedical language models should improve performance on biomedical text while retaining general-language-modeling fluency. For Mamba-based models, this trade-off has not been systematically studied across biomedical literature and clinical text. Methods. We developed BioMamba, a family of biomedical Mamba2 models at five scales obtained by continued pretraining of released public Mamba2 checkpoints on a balanced 80%/10%/10% mixture of PubMed abstracts, the Colossal Clean Crawled Corpus (C4), and Wikipedia. The contribution is the adaptation recipe and the accompanying open-weight checkpoints. Results. Across five scales, BioMamba consistently lowered PubMed perplexity, improved Wikipedia-style held-out perplexity by 1.46-4.72 PPL, and left C4 perplexity essentially unchanged. On six out-of-domain multiple-choice benchmarks, BioMamba stayed within +/-3 percentage points of Mamba2 with no systematic regression. After supervised fine-tuning, BioMamba+SFT matched or exceeded Mamba2+SFT on MIMIC-IV note completion and discharge summary generation at every evaluated scale, and improved PubMedQA at every scale. The strongest model (BioMamba-2.7B) reached a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusions. A balanced domain-adaptive continued pretraining recipe strengthens Mamba2 language models on biomedical literature and clinical text while preserving general-language-modeling fluency.

2601.04710 2026-06-11 cs.CL cs.LG 版本更新

Steering the Noise: Turning Random Perturbations into Effective Descent for Memory-Efficient LLM Fine-Tuning

引导噪声:将随机扰动转化为有效下降方向以实现内存高效的LLM微调

Feihu Jin, Shipeng Cen, Ying Tan

AI总结 提出一种即插即用框架,通过候选扰动池选择或组合与优化目标对齐的扰动,改进零阶优化梯度估计,提升LLM微调的收敛速度和任务精度。

详情
Comments
12pages, 6figures
AI中文摘要

微调大型语言模型(LLMs)取得了强大的性能,但通常受到反向传播内存开销的限制。零阶(ZO)优化通过仅使用前向传递来估计梯度,避免了这一开销,但由于随机高斯扰动在高维参数空间中产生高方差的梯度估计,其收敛速度通常较慢。在本文中,我们提出了一种即插即用框架,将随机扰动转化为更有效的下降方向。关键思想是抽取一小批候选扰动,评估其损失值,然后选择或组合那些与优化目标最一致的扰动。我们开发了该思想的两种实例:MeZO-GV,通过低损失和高损失扰动组之间的对比形成引导向量;以及MeZO-Greedy,在固定的评估预算内保留单个最佳扰动。我们从理论上证明,这两种策略在每步目标函数减少上均优于标准ZO估计,从而提高了收敛速度。在不同规模和架构的LLM上的实验证实,所提出的方法自然地与现有ZO优化器集成,并一致地提高了收敛速度和任务准确性。在OPT-13B上,我们的方法在11个基准测试中优于所有ZO基线,并在其中9个上超过了基于梯度的方法,同时保留了仅前向优化的内存效率。

英文摘要

Fine-tuning large language models (LLMs) achieves strong performance but is often limited by the memory overhead of backpropagation. Zeroth-order (ZO) optimization avoids this overhead by estimating gradients through forward passes alone, yet it typically converges slowly because random Gaussian perturbations yield high-variance gradient estimates in high-dimensional parameter spaces. In this paper, we propose a plug-and-play framework that turns random perturbations into more effective descent directions. The key idea is to draw a small pool of candidate perturbations, evaluate their loss values, and then select or combine those that are best aligned with the optimization objective. We develop two instantiations of this idea: MeZO-GV, which forms a guiding vector from the contrast between low-loss and high-loss perturbation groups, and MeZO-Greedy, which keeps the single best perturbation within a fixed evaluation budget. We theoretically show that both strategies yield a larger per-step reduction in the objective than standard ZO estimation, leading to improved convergence rates. Experiments on LLMs of different scales and architectures confirm that the proposed methods integrate naturally with existing ZO optimizers and consistently improve convergence speed and task accuracy. On OPT-13B, our approach outperforms all ZO baselines across 11 benchmarks and exceeds gradient-based methods on 9 of them, while retaining the memory efficiency of forward-only optimization.

2605.06485 2026-06-11 cs.CL cs.AI 版本更新

Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models

Litespark Inference For CPUs: 三元(1.58位)语言模型的超快SIMD框架

Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal, Rickston Pinto

AI总结 针对三元语言模型权重为{-1,0,1}的特点,提出自定义SIMD内核,用加减运算替代矩阵乘法,在CPU上实现18-96倍加速和6倍内存减少。

详情
AI中文摘要

大型语言模型(LLM)已经改变了人工智能,但其计算需求对大多数用户来说仍然过高。标准推理需要昂贵的数据中心GPU或云API访问,导致超过十亿台个人计算机在AI工作负载中未被充分利用。三元模型提供了一条前进的道路:它们的权重被限制在{-1, 0, +1},理论上消除了浮点乘法的需求。然而,现有框架未能利用这种结构,将三元模型视为密集浮点网络。我们通过自定义SIMD内核填补了这一空白,这些内核用简单的加法和减法运算取代矩阵乘法,针对现代CPU上可用的整数点积指令。我们的实现Litespark-Inference可通过pip安装,并直接与Hugging Face集成,在Apple Silicon上实现了比标准PyTorch推理高18.15倍的吞吐量、快7.15倍的首令牌时间和6.03倍的内存减少,在Intel和AMD处理器上实现了高达95.81倍的吞吐量加速。

英文摘要

Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 18.15x higher throughput, 7.15x faster time-to-first-token and 6.03x memory reduction compared to standard PyTorch inference on Apple Silicon, with comparable or higher throughput speedups up to 95.81x on Intel and AMD processors.

12. 其他/综合NLP 20 篇

2606.11220 2026-06-11 cs.CL 新提交

LifeSentence: Language models can encode human life course trajectories from longitudinal panel data

LifeSentence: 语言模型可以从纵向面板数据编码人类生命历程轨迹

Samuel Liu, Muchen Xi, William Yeoh, Joshua J. Jackson

AI总结 提出LifeSentence模型,将大型语言模型与纵向面板数据结合,通过结构化自然语言记录生命事件并微调预训练模型,在少样本条件下超越传统方法,实现生命事件预测与时间顺序重建。

详情
AI中文摘要

预测人类生命结果对于理解个体如何获得长寿健康的生活至关重要。传统的统计方法准确度有限,可能是因为忽略了生命历程的序列结构。现代方法如Transformer架构需要大规模训练数据,而大多数纵向面板研究缺乏此类数据。本文介绍LifeSentence,一种将大型语言模型与纵向面板数据相结合的生命历程推理模型。通过将每个生命事件表示为结构化的自然语言记录,并在一个包含预测、鲁棒性和推理的18任务评估分类体系上对预训练的240亿参数语言模型进行指令微调,LifeSentence利用预训练期间已编码的分布知识补充面板数据。该模型在来自德国社会经济面板的约65,000名个体上训练——比之前基于Transformer的方法少约45倍——在所有任务族上均优于经典和深度学习基线,在联合事件与时间预测上相比最佳基线实现三倍改进,并在从去除时间戳的事件集重建时间顺序时达到91.2%的Kendall tau系数。在没有显式监督的情况下,该模型仅从离散事件序列中恢复出记录的社会分层模式,包括教育溢价、性别工资差距和母亲惩罚。自然语言接口进一步支持定性新研究查询,例如将早期生活史连接到指定的晚年终点,使LifeSentence成为预测工具和对人类传记进行反事实探索的探针。

英文摘要

Forecasting human life outcomes is important to gain insights into how individuals attain long and healthy lives. Conventional statistical approaches yield limited accuracy, potentially due to discarding the sequential structure of the life course. Modern methods such as transformer architectures require large scale training data that most longitudinal panel studies lack. Here we introduce LifeSentence, a model for life-course reasoning that bridges large language models with longitudinal panel data. By representing each life event as a structured natural-language record and instruction-tuning a pretrained 24-billion-parameter language model across an 18-task evaluation taxonomy spanning prediction, robustness and reasoning, LifeSentence supplements panel data with distributional knowledge already encoded during pretraining. Trained on approximately 65,000 individuals from the German Socio-Economic Panel - roughly 45 times fewer than prior transformer-based approaches - LifeSentence outperforms classical and deep learning baselines across all task families, achieving a threefold improvement in joint event-and-timing prediction from best baselines and 91.2% Kendall's tau when reconstructing chronological order from timestamp-stripped event sets. Without explicit supervision, the model recovers documented patterns of social stratification, including the education premium, the gender wage gap and the motherhood penalty, from discrete event sequences alone. A natural-language interface further enables qualitatively new research queries, such as connecting an early-life history to a specified late-life endpoint, establishing LifeSentence as both a predictive tool and a probe for counterfactual exploration of human biographies.

2606.11456 2026-06-11 cs.CL cs.AI cs.CY 新提交

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

社会科学中的AI编码智能体:方法多样,经验一致,解释脆弱

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Technical University of Munich(慕尼黑工业大学)

AI总结 研究LLM智能体在科学分析中的方法多样性与解释脆弱性,通过20次独立实验发现智能体在设计层匹配或超越人类多样性,但在裁决层易受提示影响,偏差源于解释而非估计。

详情
AI中文摘要

基于LLM的智能体在科学分析中的部署引发了相互矛盾的担忧:智能体可能减少方法多样性,或者可能放大分析灵活性,使研究者得出动机性结论。我们认为这些担忧针对两个经验上可分离的层面:方法选择的设计层,以及决策规则将估计映射到实质性主张的裁决层。我们通过在著名的移民与社会政策问题上运行20次Claude Code和Codex的独立执行,并以多位分析师的人类基线为基准,对两者进行了测试。在设计层,Codex匹配了人类的方法多样性,而Claude Code产生了近三倍的规格;两个智能体的效应估计与人类共识大致一致,且没有智能体模型与任何人类模型完全匹配。提示诱导的反移民研究者先验重组了每个智能体的方法决策,但与同一数据中有偏见的人类分析师不同,它并未改变总体估计或最终裁决;智能体也没有沿着人类用来偏倚其估计的方法轴重新路由。在裁决层,一个明确的确认性提示将Claude Code的裁决从10%的支持率翻转为90%,同时其系数分布基本保持不变,这是通过规则省略而非规则软化实现的。AI智能体在设计层可以媲美或超越人类的方法多样性,但在裁决层仍然脆弱。在我们的设置中,AI偏差的所在不是估计而是解释。

英文摘要

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

2606.11897 2026-06-11 cs.CL 新提交

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Notes2Skills: 从实验室笔记本到具有确定性意识的科学智能体技能

Shi Liu, Jiayao Chen, Chengwei Qin, Yanqing Hu, Jufan Zhang, Linyi Yang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University College Dublin(都柏林大学学院)

AI总结 提出Notes2Skills框架,将实验室笔记转化为保留作者确定性的可验证科学智能体技能,解决不确定判断与确认结论混淆问题。

详情
Comments
28 pages, preprint
AI中文摘要

科学发现工作流程通常包含并严重依赖实验室笔记,研究人员在其中记录观察结果、解释不确定的结果并规划后续实验。这些信息丰富的实验室笔记保留了不断演变的科学推理和作者的不确定性,而不是出版物中展示的经过修饰的最终结果,为人工智能在更全面和更深层次上参与科学探索提供了宝贵机会。然而,大多数先前关于科学文本的工作集中在论文、协议或结构化数据库上,使得非正式的实验室笔记作为科学AI智能体的输入未被充分探索。这一差距很重要,因为实验室笔记通常在同一段落中混合了经过验证的观察结果、初步判断和可能的实验下一步。如果这些信号被混淆,AI智能体可能会将不确定的科学判断误认为是已确认的结论或可执行的行动。为此,我们提出了Notes2Skills,一个两阶段框架,用于将实验室笔记本转化为可验证的科学AI智能体技能,同时保留作者的不确定性。在七个条件和三个湿实验环节中,Notes2Skills是唯一既不会将不确定的笔记误认为是明确的指令,也不会丢弃明确指令的配置。我们表明,确定性保留是实验室笔记本与可靠智能体技能之间缺失的一环,为更安全的AI共同科学家系统开辟了一条道路。

英文摘要

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

2606.11926 2026-06-11 cs.CL cs.AI 新提交

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

通过假设树精炼迈向通用自主研究

Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Microsoft Research(微软研究院)

AI总结 提出Arbor框架,通过假设树精炼(HTR)实现长期自主研究循环,在六项真实任务中平均相对保留增益超过Codex和Claude Code的2.5倍。

详情
AI中文摘要

科学进步依赖于探索、实验和抽象的重复循环。研究人员测试候选方向,解释证据,并将所得经验用于后续尝试。我们研究AI代理如何自主地长期运行这一循环。我们提出了Arbor,一个用于自主研究的通用框架,它结合了长期存在的协调器、短期执行器和假设树精炼(HTR),后者是一个持久树,跨时间连接假设、工件、证据和提炼的见解。协调器管理树上的全局研究策略,而执行器在隔离的工作树中实现和测试单个假设。当结果返回时,Arbor更新树,传播可重用的经验,优化搜索前沿,并接受验证过的改进。这种设计将自主研究从一系列局部尝试转变为累积过程,其中策略、执行和证据跨时间传递。我们在自主优化(AO)下评估Arbor,这是一种操作设置,代理通过迭代实验改进初始研究工件,无需逐步人工监督。在模型训练、工具工程和数据合成等六项真实研究任务中,Arbor在所有六项任务上取得了最佳保留结果,在相同任务接口和资源预算下,平均相对保留增益是Codex和Claude Code的2.5倍以上。在MLE-Bench Lite上,Arbor使用GPT-5.5达到86.36%的任何奖牌,这是我们比较中的最强结果。

英文摘要

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

2606.12113 2026-06-11 cs.CL cs.AI 新提交

Augmenting Molecular Language Models with Local $n$-gram Memory

增强分子语言模型的局部 $n$-gram 记忆

Xinni Zhang, Zijing Liu, He Cao, Yu Li, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) International Digital Economy Academy(国际数字经济学院)

AI总结 针对SMILES字符串的Transformer模型因字符级分词破坏化学语义的问题,提出MolGram模块,通过条件$n$-gram记忆哈希查找注入局部上下文,在三个任务上以更少参数超越基线。

详情
AI中文摘要

基于Transformer的SMILES字符串语言模型存在局部性差距:标准字符级分词会破坏化学上有意义的模式,迫使模型反复学习局部语法而牺牲长距离依赖。为了解决这个问题而不干扰标准分词器,我们提出了MolGram,它将条件$n$-gram记忆模块集成到分子语言模型中。MolGram通过可扩展的哈希查找将局部字符串模式映射到学习到的嵌入,并动态地将这种区域上下文注入隐藏状态。在三个任务(包括无条件分子生成、正向反应预测和单步逆合成)上的评估表明,MolGram持续提升性能。关键的是,我们的分析表明,MolGram以3倍更少的参数优于基线,将显式局部模式记忆确立为一种高效的归纳偏置。

英文摘要

Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.

2606.11207 2026-06-11 cs.AI cs.CL 交叉投稿

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

从显式元素到隐式意图:用于可审计行为推断的预定义库

Liu hung ming

发表机构 * PARRAWA AI

AI总结 提出SemantiClean框架,通过共享元素库从电商会话数据中提取结构化语义信号,驱动可插拔推断目标,优先保证可审计性和可复现性,而非单纯追求精度。

详情
Comments
20 pages, 9 tables
AI中文摘要

我们提出SemantiClean,一个模块化框架,用于从电商会话数据中提取结构化语义信号,并通过共享元素库驱动可插拔推断目标,包括购买意图、客户细分和产品亲和性。与仅优化准确率的传统端到端预测器不同,SemantiClean优先考虑可审计性、结构治理和sigma=0可复现性,明确牺牲边际预测增益以换取元素级透明度和可辩护的决策轨迹。该框架基于在线购物者购买意图(OSPI)数据集,将24个行为元素组织成四层架构(功能层、交互层、系统层、上下文层),并通过三种抗通胀机制强制信号质量:RedundancyGroup贡献上限、TieredPenaltyCalculator偏差惩罚和AdaptiveConstraintMode冷启动处理。本文介绍了LLM集成语义推断引擎,一个完全实现的两阶段LLM驱动推断架构,在推断时利用完整的元素元数据。本文报告的所有定量结果均由该引擎产生。确定性引擎输出完全可复现(sigma=0);LLM相关结果(E8、E10)在固定提供者/模型/温度设置下受控输出可变性。性别推断目标在当前实现中非功能性,已从所有定量结果中排除。

英文摘要

We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start this http URL report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

2606.11243 2026-06-11 cs.LG cs.CL 交叉投稿

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo: 具有功能引导的分层流匹配用于从头蛋白质生成

Chuanzhen Wang, Meade Cleti, Pete Jano

发表机构 * Arizona State University(亚利桑那州立大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tongji University(同济大学)

AI总结 提出ProHiFlo,一种分层流匹配框架,通过粗到细生成、功能引导和自适应SE(3)等变架构,实现高效、准确的从头蛋白质生成,在酶活性位点支架任务中成功率58.9%。

详情
Comments
23 pages
AI中文摘要

从头蛋白质生成在治疗设计、酶工程和合成生物学中具有变革潜力。尽管基于扩散和流匹配的方法已取得进展,但它们通常在单一分辨率下操作,且缺乏整合功能约束的机制。我们提出ProHiFlo,一种具有三项创新的分层流匹配框架:(1) 粗到细生成,先建模主链几何再细化到全原子坐标,在保持精度的同时降低计算成本;(2) 功能引导,利用预训练预测器引导生成朝向所需性质,无需重新训练;(3) 自适应SE(3)等变架构,用于高效多尺度处理。在无条件生成、基序支架和功能设计上的实验表明,在需要少4倍采样步数的情况下实现了最先进的性能。在酶活性位点支架任务中,ProHiFlo达到58.9%的成功率,而RFDiffusion为41.2%。

英文摘要

De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-ofthe-art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.

2606.11482 2026-06-11 cs.SI cs.CL 交叉投稿

Building Social World Models with Large Language Models

用大型语言模型构建社会世界模型

Haofei Yu, Yining Zhao, Guanyu Lin, Jiaxuan You

AI总结 提出社会世界模型(SWM)框架,利用LLM从社会数据中挖掘时间模式,学习社会信念的状态转移函数,无需人工标注或普查数据,在预测市场基准上超越时序基础模型。

详情
Comments
9 pages. ICML 2026
AI中文摘要

理解和预测社会信念如何因事件(从政策变化到科学突破)而演变仍然是社会科学中的一个基本挑战。鉴于LLM的常识知识和社会智能,我们提出:LLM能否模拟社会事件后社会信念的动态?在这项工作中,我们引入了社会世界模型(SWM)的概念,这是一个通用框架,旨在捕捉社会信念如何因重大事件而演变。SWM通过挖掘社会数据中的时间模式并优化证据下界来学习社会信念的状态转移函数,无需将事件与信念转变联系起来的人工标注,也无需昂贵的普查数据。为了评估SWM,我们引入了一个基准SWM-bench,该基准源自真实世界的预测市场,特别是Kalshi和Polymarket。SWM-bench包含超过12k个数据点,用于跨政治、金融和加密货币等不同领域的社会信念预测任务。我们的实验结果表明,SWM显著优于时序基础模型,在Kalshi数据上取得了最先进的结果,并在Polymarket数据上展示了竞争性能,同时为社会信念动态的潜在机制提供了可解释的见解。

英文摘要

Understanding and predicting how social beliefs evolve in response to events -- from policy changes to scientific breakthroughs -- remains a fundamental challenge in social science. Given LLMs' commonsense knowledge and social intelligence, we ask: Can LLMs model the dynamics of social beliefs following social events? In this work, we introduce the concept of the Social World Model (SWM), a general framework designed to capture how social beliefs evolve in response to major events. SWM learns state-transition functions for social beliefs by mining temporal patterns in social data and optimizing the evidence lower bound, without the need for explicit human annotations linking events to belief shifts, or for expensive census data. To evaluate SWM, we introduce a benchmark, SWM-bench, derived from real-world prediction markets, specifically Kalshi and Polymarket. SWM-bench includes over 12k data points for social belief prediction tasks spanning diverse domains such as politics, finance, and cryptocurrency. Our experimental results show that SWM significantly outperforms time-series foundation models, achieving state-of-the-art results on Kalshi data and demonstrating competitive performance on Polymarket data, while offering interpretable insights into the underlying mechanisms of social belief dynamics.

2606.11613 2026-06-11 cs.IR cs.CL cs.HC cs.SI 交叉投稿

Factions Within, Uncertain Across: Within-Document Reader Sub-Groups in Social Highlighting

内部派系,跨文档不确定:社交高亮中的文档内读者子群体

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 通过保留边界的曲线球零模型,发现文档内读者形成强子群体,其一致性远超共享显著性预测,且大部分源于细粒度读者特定共识;跨文档稳定性未解决。

详情
Comments
11 pages, 3 figures, 3 tables
AI中文摘要

当许多人高亮同一文档时,人群是单一共识,还是内部结构化为标记不同内容的读者子群体?这种结构是读者的稳定属性还是文档的属性?基于先前工作表明个体文档内高亮信号是低语而个体性存在于选择中,我们在一个共读平台上使用保留边界的曲线球零模型提出群体层面问题。实验1:在文档内,读者形成强子群体——配对一致性远超共享显著性、标记密度和句子流行度所预测的(最近邻一致性z=+6.3,在88%的文档中显著)。在八块区域保留零模型下,与文档相同粗略区域的共享参与解释了约40%的额外一致性;大部分以更细粒度的读者特定一致性存在(z=+3.6,77%显著)。因此,文档内人群在描述意义上是派系化的。实验2:这种分组是稳定的读者特质吗?这里我们诚实地面对统计功效。配对一致性的跨文档分半可重复性在合并后接近零(两个独立抽取样本中分别为+0.078和0.000),功效校准表明该检验仅对共读许多文档的配对有信息。在唯一有信息的高重叠子集(k>=4)中,点估计为正但小样本,在独立抽取样本间不精确,从未显著,并在区域保留零模型下衰减。因此,我们未解决跨文档稳定性:数据与从情境分组到弱至中等稳定读者特质的一切一致。人群在文档内是派系化的;这些派系是否随读者跨文档迁移,诚实地讲,超出了我们的能力范围。

英文摘要

When many people highlight the same document, is the crowd a single consensus, or is it internally structured into reader sub-groups that mark different things -- and is that structure a stable property of a reader or of the document? Building on prior work showing an individual's within-document highlighting signal is a whisper while individuality lives in selection, we ask the group-level question on a co-readership platform using a margin-preserving curveball null. Experiment 1: within a document, readers form strong sub-groups -- pairs agree far beyond what shared salience, mark density, and sentence popularity predict (nearest-neighbour agreement z=+6.3, significant in 88% of documents). Under an eight-block region-preserving null, shared engagement with the same coarse regions of the document accounts for about 40% of this excess; the majority survives as finer reader-specific agreement (z=+3.6, 77% significant). So the within-document crowd is, in a descriptive sense, factional. Experiment 2: is that grouping a stable reader trait? Here we are honest about power. The cross-document split-half reproducibility of a pair's agreement is near zero pooled (+0.078 and 0.000 in two separately drawn samples), and a power calibration shows the test is informative only for pairs that co-read many documents. In the only informative high-overlap subset (k>=4), point estimates are positive but small-sample, imprecise across the separately drawn samples, never significant, and attenuate under the region-preserving null. We therefore leave cross-document stability unresolved: the data is consistent with anything from situational grouping to a weak-to-moderate stable reader trait. The crowd is factional within a document; whether its factions follow the reader across documents is, honestly, beyond our reach.

2510.16152 2026-06-11 cs.DL cs.AI cs.CL cs.LG 版本更新

Mapping Scientific Literature with Large Language Models and Topic Modeling

利用大语言模型和主题建模绘制科学文献图谱

Mason Smetana, Lev Khazanovich

AI总结 提出基于大语言模型的两阶段分类框架,通过主题建模分析PNAS工程类文献,生成语义可解释主题并揭示跨主题关联,性能优于传统方法。

详情
Comments
35 pages, 10 figures. Accepted for publication in Scientometrics. Final version available via DOI
AI中文摘要

科学文献因学科边界、专业术语和潜在稀疏的关键词系统而日益碎片化,使得捕捉现代科学的演化结构变得困难。本研究引入了一个大语言模型驱动的框架,从主题建模的角度绘制科学文献图谱。该方法在《美国国家科学院院刊》20年间超过1500篇工程相关文章语料上进行了演示。一个两阶段分类流水线首先根据每篇文章的摘要分配一个主要主题类别,然后进行全文分析以识别次要分类,揭示语料库中潜在的跨主题联系。与传统主题模型不同,基于LLM的框架在保持强量化性能的同时,生成语义可解释的主题。与既定主题建模方法的比较评估显示,主题多样性更高,重叠度更低,且具有竞争性的一致性指标。对随机抽样的摘要子集进行手动验证,准确率达到75.9%。额外的传统自然语言处理分析证实,生成的主题对应于语料库中有意义的语言模式。连接主要和次要分类的二部网络进一步揭示了仅通过摘要或关键词系统不易观察到的隐含主题关系。结果表明,该框架无需事先了解期刊的编辑双重分类结构,即可独立恢复其大部分结构。总体而言,所提出的方法为绘制科学图谱和识别研究中新兴的跨主题联系提供了有力工具。

英文摘要

Scientific literature is increasingly fragmented by disciplinary boundaries, specialized terminology, and potentially sparse keyword systems, making it difficult to capture the evolving structure of modern science. This study introduces a large language model (LLM)-driven framework for mapping scientific literature from a topic modeling perspective. The approach is demonstrated on a 20-year corpus of more than 1,500 engineering-related articles published in the Proceedings of the National Academy of Sciences (PNAS). A two-stage classification pipeline first assigns a primary thematic category to each article based on its abstract, followed by full-text analysis to identify secondary classifications that reveal latent cross-topic connections within the corpus. Unlike conventional topic models, the LLM-based framework produces semantically interpretable topics while maintaining strong quantitative performance. Comparative evaluation against established topic modeling methods shows higher topic diversity and lower overlap with competitive coherence metrics. Manual validation on a randomly sampled subset of abstracts yields an accuracy of 75.9%. Additional traditional natural language processing analyses confirm that the generated topics correspond to meaningful linguistic patterns in the corpus. A bipartite network linking primary and secondary classifications further reveals implicit thematic relationships that are not readily observable through abstracts or keyword systems alone. The findings indicate that the framework independently recovers much of the journal's editorial dual-classification structure without prior knowledge of its schema. Overall, the proposed approach offers a powerful tool for mapping science and identifying emerging cross-topic connections in research.

2605.22509 2026-06-11 cs.HC cs.CL

Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking

Reflecti-Mate: 通过系统1和系统2思维实现自适应决策支持的对话代理

Morita Tarvirdians, Senthil Chandrasegaran, Hayley Hung, Catholijn M. Jonker, Catharine Oertel

AI总结 本文研究了一种对话代理,通过适应个体思维模式促进决策整合,该代理能提供更个性化的反思路径和整合性反思语言,优于传统决策支持系统。

详情
Journal ref
UMAP 2026: Proceedings of the 34th ACM Conference on User Modeling, Adaptation and Personalization
Comments
Accepted at UMAP 2026
AI中文摘要

在做出高风险个人决策时,涉及认知、情感和直觉过程,个体在这些模式间的注意力分配各不相同。整合这些过程已被证明有助于决策。然而,大多数现有决策支持系统主要支持认知方面,而非适应个体的思维特征以促进不同思维类型的整合。在本研究中,我们探讨了一种代理,旨在通过适应个体用户思维模式来促进整合。我们探讨了该代理对参与者对代理的看法及其反思行为的影响,与未受助的预反思和基线代理进行比较。在被试间研究(N=128)中,我们的代理促进了广泛且深入的思考,使参与者能够形成更个性化的反思轨迹,产生更多整合性的反思语言,并被感知为提供更强的全面反思支持。相比之下,基线代理产生了受认知语言主导的同质化特征。

英文摘要

Making high-stakes personal decisions involves cognitive, emotional, and intuitive processes, and individuals differ in how they allocate attention across these modes. Integration of these processes has shown to benefit decision making. Yet, most current decision-support systems focus primarily on supporting cognitive aspects, rather than adapting to the individual's thinking profile to support integration of different types of thoughts. In this study, we investigate an agent designed to encourage integration by adapting to the individual user's thought patterns. We explore its effects on participants' perceptions of the agent and their reflective behavior, in comparison with unaided pre-reflection and a baseline agent. In a between-subjects study (N = 128), our agent, which fostered broad and elaborated thinking, enabled more personalized reflective trajectories, elicited more integrative reflective language, and was perceived as providing stronger support for holistic reflection. In contrast, the baseline agent produced homogenized profiles dominated by cognitive language across participants.

2511.08113 2026-06-11 cs.CL

Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

详情
英文摘要

Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

2506.22141 2026-06-11 cs.CL cs.IR

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Iliass Ayaou, Denis Cavallucci, Hicham Chibane

详情
英文摘要

Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.

2510.09885 2026-06-11 cs.CL cs.AI

Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

详情
英文摘要

Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.

2601.09072 2026-06-11 cs.AI cs.CL stat.ME

Human-AI Co-design for Clinical Prediction Models

Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Kornblith, Yan Shuo Tan, Chandan Singh

详情
Journal ref
npj Digital Medicine 2026
英文摘要

Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time- and resource-intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human-in-the-loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes-no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real-world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly-used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.

2510.06242 2026-06-11 cs.CL cs.AI

Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Comments
EMNLP Industry Track
英文摘要

Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.

2503.08379 2026-06-11 cs.IR cs.CL

JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments

Leandro Carísio Fernandes, Leandro dos Santos Ribeiro, Marcos Vinícius Borela de Castro, Leonardo Augusto da Silva Pacheco, Edans Flávius de Oliveira Sandes

详情
Comments
23 pages
英文摘要

This paper introduces JurisTCU, a Brazilian Portuguese dataset for legal information retrieval (LIR). The dataset is freely available and consists of 16,045 jurisprudential documents from the Brazilian Federal Court of Accounts, along with 150 queries annotated with relevance judgments. It addresses the scarcity of Portuguese-language LIR datasets with query relevance annotations. The queries are organized into three groups: real user keyword-based queries, synthetic keyword-based queries, and synthetic question-based queries. Relevance judgments were produced through a hybrid approach combining LLM-based scoring with expert domain validation. We used JurisTCU in 14 experiments using lexical search (document expansion methods) and semantic search (BERT-based and OpenAI embeddings). We show that the document expansion methods significantly improve the performance of standard BM25 search on this dataset, with improvements exceeding 45% in P@10, R@10, and nDCG@10 metrics when evaluating short keyword-based queries. Among the embedding models, the OpenAI models produced the best results, with improvements of approximately 70% in P@10, R@10, and nDCG@10 metrics for short keyword-based queries, suggesting that these dense embeddings capture semantic relationships in this domain, surpassing the reliance on lexical terms. Besides offering a dataset for the Portuguese-language IR research community, suitable for evaluating search systems, the results also contribute to enhancing a search system highly relevant to Brazilian citizens.

2407.08035 2026-06-11 cs.CL cs.IR

FsPONER: Few-shot Prompt Optimization for Named Entity Recognition in Domain-specific Scenarios

Yongjian Tang, Rakebul Hasan, Thomas Runkler

详情
Comments
accepted in the main track at the 27th European Conference on Artificial Intelligence (ECAI-2024)
英文摘要

Large Language Models (LLMs) have provided a new pathway for Named Entity Recognition (NER) tasks. Compared with fine-tuning, LLM-powered prompting methods avoid the need for training, conserve substantial computational resources, and rely on minimal annotated data. Previous studies have achieved comparable performance to fully supervised BERT-based fine-tuning approaches on general NER benchmarks. However, none of the previous approaches has investigated the efficiency of LLM-based few-shot learning in domain-specific scenarios. To address this gap, we introduce FsPONER, a novel approach for optimizing few-shot prompts, and evaluate its performance on domain-specific NER datasets, with a focus on industrial manufacturing and maintenance, while using multiple LLMs -- GPT-4-32K, GPT-3.5-Turbo, LLaMA 2-chat, and Vicuna. FsPONER consists of three few-shot selection methods based on random sampling, TF-IDF vectors, and a combination of both. We compare these methods with a general-purpose GPT-NER method as the number of few-shot examples increases and evaluate their optimal NER performance against fine-tuned BERT and LLaMA 2-chat. In the considered real-world scenarios with data scarcity, FsPONER with TF-IDF surpasses fine-tuned models by approximately 10% in F1 score.

2406.07909 2026-06-11 eess.AS cs.CL cs.SD stat.ML

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Eungbeom Kim, Hantae Kim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2024
英文摘要

Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.

2305.13108 2026-06-11 eess.AS cs.CL cs.LG cs.SD

Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test

Eungbeom Kim, Yunkee Chae, Jaeheon Sim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2023
英文摘要

Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech.