arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.20179 2026-05-20 cs.CL 版本更新

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

TIDE: 一种高效且无损的MoE扩散大语言模型推理方法

Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang

发表机构 * University of Central Florida(中央佛罗里达大学) Rice University(Rice大学)

AI总结 本文提出TIDE,一种基于扩散过程块内专家激活时间稳定性的新推理系统,通过引入基于区间的专家刷新策略,优化I/O和CPU计算,实现无损加速,提升推理效率。

详情
AI中文摘要

扩散大语言模型(dLLMs)作为一种具有更好硬件利用和双向上下文能力的替代方案,通过并行块级解码出现。然而,随着dLLMs在混合专家(MoE)架构下不断扩展,其在资源受限设备上的部署仍是一个开放性挑战。现有基于自回归(AR)的方法通常导致严重的I/O开销或显著的计算瓶颈。在本文中,我们提出TIDE,一种新的资源高效的推理系统,利用扩散过程中块内专家激活的时序稳定性。具体来说,我们利用扩散过程中块内专家激活的时序稳定性,并引入基于区间的专家刷新策略,以I/O-aware的方式更新专家位置。为了确保最佳性能,我们将推理调度形式化为一个数学规划问题,求解最优区间以最小化I/O流量和CPU计算。最重要的是,TIDE是一种无损优化,不需要模型训练,为dLLM推理提供

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

2605.20177 2026-05-20 cs.CL cs.CV 版本更新

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

从感知到思考:解耦感知与推理提升视觉语言模型的训练

Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou

发表机构 * Amazon(亚马逊) University of Waterloo(滑铁卢大学) Vector Institute, Canada CIFAR AI Chair(向量研究所,加拿大CIFAR人工智能主席)

AI总结 本研究通过解耦感知与推理,发现视觉任务性能受限于感知能力不足而非推理本身,通过分阶段训练提升模型的感知与推理能力,从而在多个视觉数学和感知任务中取得更优表现。

Comments 19 pages, 9 figures; Accepted to ICML 2026; Project Page: https://ucsc-vlaa.github.io/VLM-CapCurriculum/

详情
AI中文摘要

最近视觉语言模型(VLMs)的进步强调长链推理;然而,我们发现其在视觉任务上的性能主要受限于感知能力不足而非推理本身。在本工作中,我们系统研究了VLMs在训练后感知与推理之间的相互作用,通过将能力分解为三个独立的训练阶段:视觉感知、视觉推理和文本推理,并结合专门的训练数据。我们证明视觉感知(a)需要针对优化和专门数据;(b)作为基础框架,应在细化视觉推理之前通过分阶段训练巩固;(c)通过强化学习(RL)比基于标题的监督微调(SFT)更有效学习。我们的实验表明,分阶段训练在多个VLMs上一致提升了视觉感知和推理性能。值得注意的是,采用我们方法训练的模型在推理准确性上提高了1.5%,推理轨迹缩短了20.8%,表明更强的感知减少了对过度推理的需求。此外,我们展示了基于能力的分阶段训练代表了与传统难度基于课程正交的新课程维度,结合两者可进一步获得加性收益。我们的分阶段训练模型在开放权重VLMs中表现优异,在多个视觉数学和感知任务(如WeMath和RealWorldQA)上取得了优于基础模型的先进结果。

英文摘要

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

2605.20176 2026-05-20 cs.CL 版本更新

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

ClinSeekAgent: 自动化多模态证据检索以实现代理临床推理

Juncheng Wu, Letian Zhang, Yuhan Wang, Haoqin Tu, Hardy Chen, Zijun Wang, Cihang Xie, Yuyin Zhou

发表机构 * UC Santa Cruz(加州大学圣克ruz分校)

AI总结 本文提出ClinSeekAgent,一种自动化代理框架,用于动态多模态证据检索,旨在从异构来源主动检索、迭代规划和综合多模态证据,从而提升临床决策支持。

Comments 24 pages, 9 figures; Project Page: https://ucsc-vlaa.github.io/ClinSeekAgent/

详情
AI中文摘要

大型语言模型(LLMs)和代理系统在临床决策支持中展现出潜力,但现有研究大多假设证据已预先整理并提供给模型。现实中的临床工作流程要求代理主动检索、迭代规划和综合来自异质来源的多模态证据。在本文中,我们介绍了ClinSeekAgent,一种自动化代理框架,用于动态多模态证据检索,将范式从被动证据消费转向主动证据获取。仅给定一个临床查询和对原始数据源的访问权限,ClinSeekAgent通过查询医学知识库、导航原始电子健康记录(EHR)和调用医学影像工具来收集证据;随着新信息的出现,它会细化假设;并将收集到的证据整合到基于事实的临床决策中。ClinSeekAgent既作为推理时间的代理用于前沿LLMs,也作为训练时间的流水线,用于将高质量的代理轨迹提炼成紧凑的开源模型。为了验证其推理时间的有效性,我们构建了ClinSeek-Bench,它将固定预选证据的Curated Input推理与原始临床数据上的自动化证据检索相结合。在仅文本EHR任务中,ClinSeekAgent将Claude Opus 4.6的总体F1从60.0提升到63.2,将MiniMax M2.5从43.1提升到47.3,并在9个评估的主机模型中有7个显示出积极的风险预测增益。在多模态任务中,ClinSeekAgent将Claude Opus 4.6的分数从47.5提升到62.6(+15.1);所有评估的模型在三个与CXR相关的任务组中均有所提升。我们进一步通过将代理证据检索轨迹提炼成ClinSeek-35B-A3B来验证ClinSeekAgent作为训练流水线的有效性,该模型在现有AgentEHR-Bench上实现了34.0的平均F1,比其Qwen3.5-35B-A3B基线提高了+11.9点,并接近Claude Opus 4.6。

英文摘要

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

2605.20170 2026-05-20 cs.CL 版本更新

KoRe: Compact Knowledge Representations for Large Language Models

KoRe: 为大型语言模型设计的紧凑知识表示

Davide Cavicchini, Fausto Giunchiglia, Jacopo Staiano

发表机构 * University of Trento(特伦托大学)

AI总结 本文提出KoRe方法,通过将知识图谱的1跳子图编码为紧凑离散知识标记,并注入到LLM中,从而提升模型的知识推理能力并减少token使用量。

详情
AI中文摘要

现代大型语言模型(LLMs)在用户面对的任务如问答中表现出色,并在推理能力上持续改进。然而,这些模型编码知识的方式存在固有缺陷:通过设计,LLMs将世界知识存储在参数中。这种方式表示知识本质上是不透明的,难以调试和更新,且容易产生幻觉。另一方面,知识图谱可以提供人类可读且易于编辑的世界知识表示,并在知识密集型任务中持续证明对下游性能有益。然而,当前的整合技术需要大量的重新训练或微调。为了解决这个问题,我们引入KoRe,一种将1跳子图编码为紧凑离散知识标记并注入到LLM骨干中的方法。我们在三个已建立的基准上测试了所提出的方法,并报告了具有竞争力的表现,同时token使用量显著减少(最高减少10倍)。我们的结果表明,紧凑的离散KG表示可以有效地用于使现代LLM扎根。

英文摘要

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

2605.20158 2026-05-20 cs.CV cs.AI cs.CL 版本更新

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考用于大视觉语言模型胸部X光推理中的视觉归因

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学) National Institutes of Health(美国国立卫生研究院)

AI总结 本文针对大视觉语言模型在胸部X光推理中视觉归因的可靠性问题,提出了一种因果评估框架,通过反事实编辑保留仅由专家标注区域验证的X光-VQA样本,以确定模型预测的因果责任区域。通过11种归因方法、6种开源LVLMs和两种输出模式,发现现有归因方法往往无法识别LVLMs所使用的证据。为此,本文提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应,显著优于现有方法,推动医疗LVLMs的更可信归因。

详情
AI中文摘要

大视觉语言模型(LVLMs)在医疗应用中展现出前景,但其无法准确将响应与视觉证据联系起来,引发了关于临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM预测,但这些解释是否确实反映了模型决策背后的视觉证据仍缺乏验证,因为内部模型推理的真值注释通常不可用。我们通过开发一种因果评估框架来解决胸部X光(CXR)推理中的这一问题,该框架仅保留专家标注区域已验证的CXR-VQA样本,通过反事实编辑保留因果责任区域。在11种归因方法、6种开源LVLMs和两种输出模式(直接回答和逐步推理)上应用此框架,发现现有归因方法往往无法识别LVLMs所使用的证据。为解决这一失败,我们提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应。MedFocus产生空间、概念级和token级归因,并显著优于现有方法,推动医疗LVLMs的更可信归因。我们的数据和代码可在https://github.com/gzxiong/medfocus/上获得。

英文摘要

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

2605.20149 2026-05-20 cs.CL cs.AI cs.HC 版本更新

Less Back-and-Forth: A Comparative Study of Structured Prompting

少来回:结构化提示的比较研究

Saurav Ghosh, Gabriella Polach, Abdou Sow

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 本文研究了结构化提示设计是否能提高LLM响应质量并减少用户努力,通过比较三种提示条件,发现检查清单提示在任务完成、正确性、合规性和清晰度方面得分最高,且在质量和努力的平衡上表现最佳。

Comments 7 pages, 2 figures, 6 tables

详情
AI中文摘要

大型语言模型(LLMs)广泛用于开放式任务,但不明确的提示可能导致低质量的回答和额外的交互。本文研究结构化提示设计是否能提高响应质量并减少用户努力。我们比较了三种提示条件:原始提示、检查清单改进提示和澄清问题提示。我们通过四个任务类型——摘要、规划、解释和编程,使用三个LLM系统:ChatGPT、Claude和Grok来评估这些条件。每个输出都使用统一的评分标准进行评分,涵盖任务完成、正确性、合规性和清晰度。检查清单改进提示在评分方面得分最高,平均得分为7.50(满分8分),相比原始提示的5.67和澄清问题提示的6.67。检查清单提示在质量和努力的平衡上也表现最佳,使用比原始和澄清提示更少的平均令牌。这些结果表明,简单的提示检查清单可以提高LLM响应质量,同时减少不必要的交互。

英文摘要

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

2605.20128 2026-05-20 cs.CL 版本更新

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

MixRea: 在大型语言模型中评估显式-隐式推理的基准测试

Yuanqing Cai, Ziyi Huang, Minhao Liu, Lixin Duan, Wen Li, Yanru Zhang

发表机构 * DeepMind DeepSeek-AI Anthropic Qwen Touvron Team

AI总结 本文提出MixRea基准测试,用于评估大型语言模型在显式和隐式推理任务中的表现,发现即使最佳模型也存在注意力不足的问题,并提出PRCP方法来改进推理能力。

Comments 12 pages, 6 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLMs)正越来越多地融入高风险决策中。受人类认知中'注意力盲区'理论的启发,我们研究LLMs是否在显式任务指令下表现出类似限制:未能关注到细微但重要的上下文线索。为此,我们引入显式-隐式推理任务,并提出MixRea基准测试,包含2246道多选题,覆盖9种推理类型,显式和隐式信息分布各异。对21种先进LLMs的评估显示,即使最佳推理模型(Gemini 2.5 Pro)也只能达到42.8%的一致性,揭示了普遍存在的注意力盲区。为缓解这一问题,我们提出潜在关系完成提示(PRCP),一种通过恢复被忽视的因果关系来提升推理能力的提示方法。进一步分析显示,这一限制在多样化的多源推理任务中依然存在,凸显了对更认知对齐模型的需求。

英文摘要

Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

2605.20084 2026-05-20 cs.CL cs.AI 版本更新

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

BalanceRAG: 为级联检索增强生成进行联合风险校准

Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian, Haoning Wang, Baojie Chen, Diyin Tang, Jinsong Yu, Zhiyuan Wang

发表机构 * Beihang University(北航) Shenzhen Institute of Advanced Technology(深圳先进技术研究院) Zhejiang University of Finance & Economics(浙江财经大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出BalanceRAG,一种用于级联检索增强生成的联合风险校准方法,通过在二维晶格上确定安全操作点,实现风险自适应的阈值校准,从而在控制系统级错误率的同时保留更多示例,并扩展到多风险校准。

详情
AI中文摘要

大型语言模型(LLMs)可通过检索增强生成(RAG)提高事实性,但在模型单独回答可靠时,将RAG应用于每个查询是不必要的。这促使了级联RAG:每个查询首先由LLM单独分支处理,如果主分支不确定则升级到RAG回退,当两个分支都不足够可信时则放弃。然而,逐级校准此类级联可能过于保守,因为最终的效用取决于LLM单独和RAG的联合不确定性阈值。在本文中,我们开发了BalanceRAG,以在目标风险水平下认证阈值对。给定两个分支的不确定性分数,BalanceRAG将每个阈值对框架为二维晶格上的一个操作点,并通过顺序图形测试确定安全操作点。这使得风险自适应的阈值校准成为可能,从而在控制接受点的系统级错误率的同时保留更多示例。此外,BalanceRAG扩展到多风险校准,允许检索使用与基于选择的条件风险一起被限制。在三个开放领域问答(QA)基准上的实验表明,BalanceRAG满足规定的风险水平,保留了更高的覆盖率和更多的接受正确示例,并且比始终开启RAG减少了不必要的检索调用。

英文摘要

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

2605.20075 2026-05-20 cs.CL cs.AI 版本更新

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

CopT: 在连续空间中利用对比学习进行通用和代理推理的在线策略思考

Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee

发表机构 * Georgia Tech(佐治亚理工学院) UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) Microsoft(微软公司)

AI总结 本文提出CopT,一种改进的推理流程,通过反转传统思考和回答的顺序,首先生成草稿答案,再基于该答案进行在线策略思考以进行反思和修正。CopT利用连续嵌入作为推理时的对比验证器,通过对比离散令牌输入和连续嵌入输入下模型对相同生成令牌的支持,得到一个序列级的反KL估计器来评估答案的可靠性。在数学、编程和代理推理任务中,CopT在保持同等或更高准确性的情况下,将峰值准确率提高了高达23%,并将令牌使用量减少了高达57%。

Comments Code: https://github.com/sdc17/CopT, Website: https://copt-web.github.io/

详情
AI中文摘要

链式思考(CoT)是一种用于从大型语言模型(LLMs)中激发推理能力的标准方法。然而,常见的CoT范式将思考视为回答的前提,这会延迟访问合理答案并产生不必要的令牌成本,即使模型能够在扩展思考之前识别出答案,这种行为被称为表现性推理。在本文中,我们引入了CopT,一种重新表述的推理流程,反转了通常的思考和回答顺序。与传统的在思考后再回答不同,CopT首先生成一个草稿答案,然后基于其自身的草稿答案进行后续的在线策略思考以进行反思和修正。为了评估草稿答案是否可信,CopT将连续嵌入重新表述为推理时的对比验证器。具体来说,它对比模型在离散令牌输入和连续嵌入输入下对相同生成令牌的支持,从而得到一个序列级的反KL估计器来评估答案的可靠性。我们的分析表明,在某些假设下,预期估计等于未解决的潜在状态与发出的答案令牌之间的互信息,解释了为什么它捕捉到与答案相关的不确定性,而不是潜在状态中的任意不确定性。当答案被认为不够可靠时,CopT会进行进一步的在线策略思考,其中第二个KL估计器动态控制草稿答案的可见性,保留有用的部分信息,同时减少被不可靠内容误导的风险。在数学、编程和代理推理任务中,CopT在保持同等或更高准确性的情况下,将峰值准确率提高了高达23%,并将令牌使用量减少了高达57%。代码可在https://github.com/sdc17/CopT上获得。

英文摘要

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.

2605.20066 2026-05-20 cs.CL 版本更新

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于强化学习的文本到SPARQL生成:在DBLP上的GRPO方法

Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck

AI总结 本文研究了在学术领域中,基于强化学习的零样本文本到SPARQL生成方法,通过GRPO算法在DBLP-QuAD上训练小型指令微调语言模型,并与监督学习的DoRA微调基线进行比较。

Comments Accepted by NeSy 2026

详情
AI中文摘要

知识图谱问答旨在将自然语言问题转换为可执行的知识图谱查询,但现有方法往往依赖于大型模型或全监督形式的黄金查询注释。本研究探讨了基于结果奖励的强化学习是否能训练一个小型指令微调语言模型,在学术领域进行零样本文本到SPARQL生成。Group-Relative Policy Optimization (GRPO)被应用于DBLP-QuAD上的Qwen3-1.7B模型,使用结合自然语言问题和实体及关系的符号提示。训练依赖于执行反馈、结构约束和答案级奖励,并额外引入基于黄金查询的塑造。所得模型在答案级准确性、执行准确性、类别得分和泛化到预留模板方面与未修改的零样本基线和监督DoRA微调基线进行比较。GRPO在零样本基线上显著提升,并表现出有竞争力的泛化能力,而监督DoRA微调在相同模型规模上实现了更高的整体准确性。消融分析表明,基于执行的奖励贡献了大部分收益,而额外的塑造带来了有限的额外收益,表明当没有黄金查询用于token级监督时,基于结果的强化学习是一种可行的训练策略。

英文摘要

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

2605.20061 2026-05-20 cs.CL 版本更新

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念,而非行动:一致性引导的长期智能体信用分配

Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou

发表机构 * College of Computer, National University of Defense Technology(国防科技大学计算机学院) Intelligent Game and Decision Lab (IGDL)(智能游戏与决策实验室) Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能研究院)

AI总结 本文提出ReBel算法,通过建模结构化信念状态来指导策略学习,解决长期任务中由于部分可观测性导致的信用分配问题,实验表明其在ALFWorld和WebShop等基准测试中提升了任务成功率并提高了样本效率。

Comments 10 pages, 4 figures, 3 tables, plus appendix

详情
AI中文摘要

可验证奖励的强化学习(RLVR)是一种有前景的范式,用于提高大语言模型(LLM)智能体在长期交互任务中的表现。然而,在部分可观测环境中,不完整的观察导致智能体信念随时间漂移,而延迟奖励会模糊中间决策的因果影响,加剧时间信用分配的挑战。为此,我们提出ReBel(奖励信念),一种过程级强化学习算法,通过显式建模结构化信念状态来总结交互历史并指导后续策略学习。ReBel引入信念一致性监督,将预测信念与观察反馈之间的差异转换为密集的自监督信号,无需外部步骤注释或验证者。它还采用信念感知分组,比较相似信念状态下的轨迹,产生更稳健且方差更低的优势估计。我们在具有挑战性的长期基准测试上评估了ReBel,包括ALFWorld和WebShop。ReBel在episode级基线GRPO上将任务成功率提高高达20.4个百分点,并将样本效率提高2.1倍。这些结果表明,信念感知的自监督是一种在部分可观测性下可靠长期决策的有前景方向。代码可在:https://github.com/Fateyetian/Rebel.git获取。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

2605.20050 2026-05-20 cs.CL 版本更新

Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

语言变异维持社交媒体上阴谋论的持续性

Calvin Yixiang Cheng, Dorian Quelle, Scott A. Hale

发表机构 * Oxford Internet Institute, University of Oxford(牛津大学互联网研究所) Department of Mathematics, University of Zurich(苏黎世大学数学系)

AI总结 本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播,通过分析X平台三年的阴谋相关帖子数据,发现语义变异更大的阴谋论具有更长的生命周期,且心理语言学属性的变异与延长生命周期有关。

详情
AI中文摘要

本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播。通过分析X平台三年的阴谋相关帖子数据,结合计算语言学分析和生存建模,我们发现语义变异更大的阴谋论具有更长的生命周期。心理语言学属性的变异,包括代词、社会参照词、认知过程术语、风险和健康相关的词汇,与延长生命周期有关。演员、行动和目标(AAT)类别的变异也与更长的生命周期有关。定性分析识别出两种主要的变异模式:简化和同化,分别在语言和AAT结构层面。总体而言,这些结果加深了我们对语言变异如何促进在线阴谋论持续性的理解,并为长期内容管理策略提供了新的视角。我们主张内容管理应考虑阴谋论声明的可变性,并专注于核心声明以应对其潜在变化。

英文摘要

This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

2605.20022 2026-05-20 cs.CL 版本更新

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

FlexDraft: 通过注意力调节和奖励引导校准实现灵活的推测解码

Yaojie Zhang, Jianuo Huang, Junlong Ke, Yuhang Han, Yongji Long, Tianchen Zhao, Biqing Qi, Linfeng Zhang

发表机构 * EPIC Lab, SJTU(上海交通大学EPIC实验室) UESTC School of Software Engineering, HUST(华中科技大学软件工程学院) Tsinghua University(清华大学) HKUST(GZ)(香港科技大学(广州)) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出FlexDraft框架,通过注意力调节和奖励引导校准,灵活适应不同批处理大小,解决传统并行推测解码在大批次时的吞吐量下降问题。

详情
AI中文摘要

推测解码通过使用快速草稿生成多个候选标记并由目标模型并行验证,从而加速内存密集型LLM推理且不降低质量。然而,传统顺序推测解码面临草稿与验证之间的相互等待以及中间状态的反复交换,进一步增加内存访问开销。并行推测解码通过在单个目标前向传递中执行草稿和验证,允许在当前候选被验证的同时准备未来的草稿。尽管在小批次中有效,现有并行推测解码方法要么需要昂贵的持续预训练导致质量下降,要么验证接受率低。更重要的是,这种范式固有地面临奖励标记和接受长度的不确定性,导致草稿验证不匹配,从而在大批次时吞吐量收益崩溃。为了解决这些限制,我们引入了FlexDraft,一种无损的推测解码框架,通过三个关键设计灵活适应不同的批次大小。(1) 注意力调节通过仅调节最终几层的注意力投影器来实现块扩散草稿生成,同时保持自回归路径冻结以保留目标分布并生成高质量的草稿,同时使用最少的可训练参数。(2) 奖励引导校准使用一个轻量级的MLP,条件于已解决的奖励标记来校准草稿logits,缓解由奖励标记不确定性导致的草稿验证不匹配。(3) 灵活解码在小批次时动态切换于并行草稿和验证,而在大批次时切换为顺序草稿然后验证,并根据草稿信心调整验证长度以消除冗余计算。

英文摘要

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

2605.19952 2026-05-20 cs.CL 版本更新

Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

重新思考如何记忆:超越原子事实的终身LLM代理记忆

Jingwei Sun, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han

发表机构 * TMLR Group, Hong Kong Baptist University(香港 Baptist 大学 TMLR 实验组) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Shanghai Jiao Tong University(上海交通大学) Sydney AI Center, The University of Sydney(悉尼大学悉尼人工智能中心)

AI总结 本文提出TriMem,一种能够维护三种共存表示粒度的内存系统,通过保留原始对话片段、提取原子事实以及合成轮廓来实现对积累对话历史的忠实存储、高效检索和深度推理,从而克服现有方法在细节丢失和推理能力不足的问题。

详情
AI中文摘要

为了实现可靠的长期交互,LLM代理需要一种能够忠实存储、高效检索并深入推理积累对话历史的内存系统。大多数现有方法采用提取事实的范式:手工编写的静态提示将原始对话压缩成原子事实,然后存储、匹配并注入到下游推理中。然而,这种以事实为中心的设计不可避免地丢弃了原始对话中的细粒度细节,并且无法支持对分散孤立事实的深度推理。此外,静态提示无法在不同的对话风格中保持一致的提取粒度。为了解决这些限制,我们提出了TriMem,它维护三种共存的表示粒度,包括由源标识符锚定的原始对话片段以保证存储的准确性,提取的原子事实用于高效的内存检索,以及合成的轮廓,将分散的事实聚合为整体语义理解以支持深度推理。我们进一步采用基于TextGrad的提示优化,通过响应质量反馈迭代优化提取和轮廓提示,实现终身进化而无需任何参数更新。在LoCoMo和PerLTQA多个LLM基础架构上的广泛实验表明,TriMem在强内存基线中表现一致。代码可在https://TMLR-TriMem.github.io获取。

英文摘要

To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at https://TMLR-TriMem.github.io .

2605.19945 2026-05-20 cs.DC cs.AI cs.CL 版本更新

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

GEM: 为MoE系统设计的GPU变异性感知专家到GPU映射

Sourish Wawdhane, Avinash Kumar, Poulami Das

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 GEM通过考虑GPU变异性,优化MoE模型中专家到GPU的映射,从而减少延迟,提升系统性能。

Comments 18 pages

详情
AI中文摘要

混合专家(MoE)模型通过使用较小的专家并按每个token激活子集来实现高效的推理。MoE服务引擎将专家分布在多个GPU上,并在推理时根据激活的专家将token路由到适当的GPU。它们以锁步方式处理token,即每个batch中的token必须完成处理后才能进入下一层。这种同步障碍成为关键瓶颈,因为MoE模型的性能受限于最后一个完成的GPU(straggler)。stragglers出现在太多重用的专家被放置在同一GPU或最慢的GPU时。尽管先前的工作将专家平衡token负载分布在GPU上,但都忽略了GPU的变异性,并经常将高使用量的专家放置在最慢的GPU上。我们提出了GEM,即GPU变异性感知的专家映射框架,用于MoE模型的GPU变异性感知专家到GPU映射。GEM利用两个洞察:首先,必须将专家放置在每个GPU上根据其变异性接收非均匀的token负载,并且它们都在大约同一时间完成处理一层。我们的研究显示,存在两种类型的专家:一致的专家,通常被使用,以及时间性的专家,通常在剩余时间内一起使用。我们的第二个洞察是必须将同时使用的一致和时间性专家放置在不同的GPU上,并避免将它们放置在较慢的GPU上以减少延迟。GEM收集每个模型和任务的GPU变异性资料,并利用每个任务的token负载分布来映射专家到GPU。我们的实验表明,GEM在平均上将端到端延迟提高了7.9%,与基线相比最高提高了16.5%。

英文摘要

Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used experts are placed on the same GPU or the slowest GPU. While prior works place experts that balance token loads across GPUs, they all overlook GPU variability and often place highly used experts on the slowest GPUs. We propose GEM, GPU-variability-aware Expert Mapping, a framework for GPU variability-aware expert to GPU mapping for MoE models. GEM exploits two insights. First, we must place experts such that each GPU receives non-uniform token loads based on their variability and they all finish processing a layer at about the same time. Our studies show that there are two types of experts: consistent that are used most of the time and temporal that are often used together for the remaining time. Our second insight is that we must place simultaneously used consistent and temporal experts on different GPUs and avoid placing them on slower GPUs to reduce slowdown. GEM gathers the variability profile of GPUs for each model and task and uses the token load distributions per task to map experts to GPUs. Our experiments show that GEM improves end-to-end latency by 7.9% on average and by up to 16.5% compared to the baseline.

2605.19944 2026-05-20 cs.LG cs.AI cs.CC cs.CL 版本更新

A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

关于推理的测度论分析:结构泛化与近似限制

Yuyang Zhang, Yifu Zhang, Xuehai Zhou, Xiaoyin Chen

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学)

AI总结 本文通过最优传输理论分析推理过程,揭示了结构泛化和近似限制的理论机制,发现位置依赖注意力机制和Transformer电路深度对推理性能有显著影响。

Comments Preprint

详情
AI中文摘要

尽管大型语言模型推理的经验缩放定律已得到充分文档,但支配分布外泛化的理论机制仍不明确。我们通过最优传输形式化推理,将离散轨迹投影到连续度量空间,利用Wasserstein-1距离量化领域偏移。借助Kantorovich对偶性,我们通过架构Lipschitz连续性和函数近似限制来界定分布外泛化。这揭示了两个主要约束。首先,位置依赖注意力(例如绝对位置编码)无法保持偏移不变性,导致Ω(1)的Lipschitz常数和预期风险,而偏移不变机制(例如旋转嵌入)保持等价性并限制误差。其次,通过将顺序回溯映射到Dyck-k语言,我们为TC⁰变换器建立了严格的电路深度下界。物理层深度的扩展是必要的,以避免表示崩溃——这一约束无法通过扩展表示宽度来绕过,因为Barron空间中存在不可约的近似界限。在54种Transformer配置上对组合搜索的评估证实了这些界限,证明泛化风险随Wasserstein领域偏移单调下降。

英文摘要

While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

2605.19936 2026-05-20 cs.CL 版本更新

What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

大型语言模型如何影响科学交流?测量写作实践和阅读体验的变化

Filip Miletić, Neele Falk

发表机构 * Institute for Natural Language Processing, University of Stuttgart, Germany(斯图加特大学自然语言处理研究所,德国)

AI总结 本研究探讨了大型语言模型对科学交流风格的影响,通过分析自然语言处理领域中超过37000篇论文和3000篇人类撰写的文本及其LLM改进版本,发现写作实践和阅读体验发生了显著变化,同时揭示了AI辅助写作对阅读体验的主观影响。

Comments Accepted to LREC 2026

详情
AI中文摘要

科学交流的风格是否因大型语言模型在写作过程中的广泛应用而发生变化?我们通过自然语言处理领域中的两个数据资源来探讨这个问题:一个包含超过37000篇ACL会议论文(2020-2024)的自然语料库,以及一个包含3000篇人类撰写的段落及其LLM生成改进版本的合成数据集。我们首先实施了一系列历时性词频分析,显示词频和使用语境在时间上发生了显著变化,表明在某些情况下存在语义专业化,而在其他情况下则存在泛化。拓宽我们的视角,我们进一步建模了更复杂的风格特征,发现LLM修改的文本更频繁地包含某些句法结构、更复杂和更长的词汇以及较低的词汇多样性。最后,我们通过与20名领域专家的试点标注研究,将这些写作实践的变化与主观阅读体验联系起来。他们总体上认为LLM改进的文本更易于理解且令人兴奋,但也表达了对LLM的负面定性态度,突显了AI辅助写作对阅读体验的强烈主观影响。

英文摘要

Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.

2605.19932 2026-05-20 cs.AI cs.CL cs.LG 版本更新

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK:上下文地图作为长上下文LLM代理的导向缓存

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Stanford University(斯坦福大学)

AI总结 本文提出PEEK系统,通过上下文地图缓存和维护导向知识,提升长上下文LLM代理在重复外部上下文中的交互准确性和效率,相比基线方法在推理和上下文学习任务中均取得显著提升。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地在长且重复的外部上下文中操作,如文档语料库和代码仓库。在多次调用中,现有方法保留的是代理的轨迹、对原始材料的被动访问或任务级别的策略。但它们没有保留我们认为对于重复相同上下文工作负载最需要的:关于重复上下文本身的可重用导向知识(例如,上下文包含什么、如何组织,以及哪些实体、常量和模式历史上有用)。我们引入PEEK,一种系统,通过上下文地图缓存和维护这种导向知识:一个在代理提示中始终存在的小而固定大小的artifact,使代理能够持续查看外部上下文。该地图由一个可编程的缓存策略维护,包含三个模块:一个Distiller从推理时间信号中提取可转移的知识,一个Cartographer将其转换为结构化的编辑,以及一个基于优先级的Evictor强制执行固定的token预算。在长上下文推理和信息聚合中,PEEK在强基线方法上提高了6.3-34.0%,同时使用93-145次更少的迭代,并且成本比最先进的提示学习框架ACE低1.7-5.8倍。在上下文学习中,PEEK在解决率和评分准确性上分别提高了6.0-14.0%和7.8-12.1%,且成本比ACE低1.4倍。这些收益在不同语言模型和代理架构上均能泛化,包括生产级的OpenAI Codex。这些结果表明,上下文地图有助于长上下文LLM代理更准确、更高效地与重复的外部上下文交互。

英文摘要

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO 版本更新

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet:条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国开罗大学(GUC)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室(C-DRiVeS Lab),开罗,埃及)

AI总结 本文提出CADENet,一种无需训练的三线系统,通过条件自适应增强和熵引导NMS融合,实现自动驾驶中恶劣天气下的目标检测,同时无需重新训练或额外硬件。

详情
AI中文摘要

恶劣天气(雨、雾、沙尘和雪)会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环,违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制:在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用,因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet(条件自适应异步双流增强网络),一种无需训练的三线系统:线S(YOLOv11n)以全帧率提供检测,无额外延迟;线Q应用条件自适应增强(CAPE)并通过熵引导NMS(EG-NMS)融合结果,不阻塞线S;线E提供CLIP零样本天气分类,因此新的天气类别只需新的文本提示,无需标注数据和重新训练。在1327张DAWN图像(YOLOv11m,IoU=0.5,置信度=0.25)上评估,CADENet在雪中实现Recall=0.0103(微),F1=0.0230,在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差,因此报告的F1值是真实增益的下限;Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

2605.19833 2026-05-20 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Mega-ASR: 通过扩大现实世界声学模拟实现野外语音识别

Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao

发表机构 * NTU(国立新加坡大学) NUS(新加坡国立大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出Mega-ASR,一种统一的野外语音识别框架,结合可扩展的复合数据构建与渐进的声学到语义优化,通过在7种经典声学现象和54种物理上合理的复合场景上训练,显著提升了在恶劣环境下的语音识别性能。

Comments Project page: https://xzf-thu.github.io/Mega-ASR/. Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail

详情
AI中文摘要

尽管自动语音识别(ASR)和大型音频-语言模型取得了快速进展,但现实环境中鲁棒的识别仍然受到一个“声学鲁棒性瓶颈”的限制:模型在严重、复合的失真下常常失去声学基础并产生遗漏或幻觉。我们提出了Mega-ASR,一种统一的ASR-in-the-wild框架,结合可扩展的复合数据构建与渐进的声学到语义优化。我们引入了Voices-in-the-Wild-2M,涵盖7种经典声学现象和54种物理上合理的复合场景,并通过Acoustic-to-Semantic Progressive Supervised Fine-Tuning和Dual-Granularity WER-Gated Policy Optimization训练Mega-ASR。大量实验表明,Mega-ASR在恶劣条件ASR基准测试中显著优于先前的最先进系统(在VOiCES R4-B-F上45.69% vs. 54.01%,在NOIZEUS Sta-0上21.49% vs. 29.34%)。在复杂的复合声学场景中,Mega-ASR进一步在强大的开源和闭源基线中实现了超过30%的相对WER降低,建立了在野外鲁棒ASR的可扩展范式。

英文摘要

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO 版本更新

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间:代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国亚历山大大学(GUC)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室,埃及开罗,C-DRiVeS) M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany(德国德格多夫技术学院机器人硕士候选人) IAV GmbH, Berlin, Germany(德国柏林IAV GmbH公司)

AI总结 本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性,而不会降低语义或逻辑一致性,并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明,时间条件改变了推理风格,但并未在标准NLP正确性指标上产生统计显著改进,但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情
AI中文摘要

近期尝试通过大型语言模型(LLMs)和大型多模态模型(LMMs)的集合来支持自动驾驶(AVs)中的高级场景解释和规划,仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致,影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此,我们引入了三种具有递增时间整合的规划器架构,并在BDD-X数据集的curated子集上评估它们,使用语义、语法和逻辑指标。结果表明,虽然时间条件改变了推理风格,但并未在标准NLP基于的正确性指标上产生统计显著改进。然而,定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性,并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

2605.19815 2026-05-20 cs.CL cs.AI 版本更新

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

LP-Eval: 用于衡量法律命题生成质量的评估标准和数据集

Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen, Daniel Hershcovich

发表机构 * University of Copenhagen(哥本哈根大学) Umeå University(乌梅拉大学)

AI总结 本文提出LP-Eval,一种与法律专家共同设计的三步评估标准,用于评估法律命题的质量,通过专家标注的100个LLM生成的法律命题数据集,展示了LLM生成的命题质量较高,但专家评估发现基于经典案例的命题质量更高,同时发现基于评估标准的LLM判断更接近专家评估,但缺乏对细粒度区别的敏感性。

详情
AI中文摘要

法律命题生成在法律推理和教义学研究中至关重要,但在法律NLP中仍缺乏充分研究。本文研究了使用大型语言模型(LLMs)从欧洲法院司法判决中自动生成和评估法律命题。我们引入了LP-Eval,一种与法律专家共同设计的三步评估标准,将法律命题质量分解为形式有效性和实质维度。使用此标准,我们发布了两个专家对100个LLM生成法律命题的注释数据集。我们的结果表明,LLMs能够生成主要形式正确且高质量的命题,而专家评估显示基于经典案例的命题质量高于基于近期案例的命题。我们进一步检验LLMs作为评估者,发现基于评估标准的LLM判断更接近专家评估,但对人类专家捕捉到的细粒度区别不够敏感。

英文摘要

Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

2605.19798 2026-05-20 cs.CL 版本更新

Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

迈向社交互动代理的信任校准:探究LLMs生成性别化的多模态行为生成

Lucie Galland, Chloé Clavel, Magalie Ochs

发表机构 * LIS Laboratory, Amu(LIS实验室,Amu) Inria Paris(Inria巴黎)

AI总结 本文研究了LLMs生成多模态行为以反映能力与善意的不同层次,探讨了性别对行为生成的影响,并通过用户研究验证了方法的有效性。

详情
AI中文摘要

随着社交互动代理(SIAs)日益融入日常生活,能够将用户信任校准到代理实际能力的能力将有助于确保这些代理的适当使用。本文探讨了大型语言模型(LLMs)生成多模态行为(言语、语音、手势和面部表情模态)以反映不同层次的能力和善意的能力。我们提出了一种新的方法,用于自动生成与特定层次的这些特征相匹配的行为,这是实现细腻和信任校准互动的第一步。通过分析由LLMs生成的大量多模态转录数据,我们证明GPT-5.4能够跨不同模态(文本、语调、面部表情和手势)生成连贯的行为。使用随机森林特征重要性分析,我们显示生成的行为符合能力与善意的理论期望。然而,我们还发现当提示中指定性别时,LLMs倾向于重现社会性别刻板印象,将男性代理的行为与高能力联系起来,女性代理的行为与高善意联系起来。为了验证我们的方法,我们使用Prolific进行了一项用户研究,采用被试内设计。参与者对生成的行为感知到的不同层次的能力和善意与预期指令一致。

英文摘要

As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

2605.18565 2026-05-20 cs.CL cs.AI 版本更新

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

MINTEval: 评估长时间跨度智能体系统中的多目标干扰下的记忆

Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出MINTEval基准,用于评估智能体在长时间跨度和多目标干扰下的记忆表现,通过长连接上下文、多领域和多类型问题来测试记忆增强代理的鲁棒性和泛化能力。

Comments Equal contribution; order decided by a coin flip. Code and data: https://github.com/amy-hyunji/MINTEval

详情
AI中文摘要

现实中的智能体在长时间和不断演变的范围内运作,其中信息被不断更新并可能在记忆之间产生干扰,需要准确的回忆和对多份信息的聚合推理。然而,现有的基准主要关注静态、独立的回忆,无法捕捉这些动态的演变记忆之间的相互作用。在本文中,我们研究了当前的记忆增强代理在多样领域和问题类型中的长时间跨度、高干扰设置中的表现。我们引入MINTEval(长时间跨度记忆在干扰下的评估),该基准具有(1)长且高度互联的上下文,包含频繁更新的信息,从而产生显著的干扰;(2)多领域(状态跟踪、多轮对话、维基百科修订和GitHub提交),使能够评估领域泛化能力;(3)多类型问题,评估对干扰的鲁棒性,包括(i)单目标回忆任务,要求从长上下文中检索特定目标,以及(ii)多目标聚合任务,要求对多个相关信息片段进行推理。总体而言,MINTEval包含15.6k个问答对,覆盖平均138.8k个token的长时间跨度上下文,每个实例可扩展至1.8M个token。我们评估了7个代表性系统,包括 vanilla 长上下文 LLMs、RAG 和记忆增强代理框架。在所有系统中,我们观察到一致的低性能(平均27.9%准确率),尤其是在需要对多份证据进行聚合推理的问题上。我们的分析表明,性能主要受限于检索和记忆构建。此外,当前的记忆系统在面对被后续上下文修改或干扰的早期事实时,难以回忆和推理,准确性随着中间更新数量的增加而下降。

英文摘要

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.

2605.15532 2026-05-20 cs.LG cs.AI cs.CL 版本更新

DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

DeltaPrompts: 逃离多模态蒸馏中的零delta陷阱

Jaehun Jung, Hyunwoo Kim, Brandon Cui, Ximing Lu, David Acuna, Prithviraj Ammanabrolu, Yejin Choi

发表机构 * NVIDIA Research(NVIDIA研究院)

AI总结 本文提出DeltaPrompts,通过量化教师与学生之间的答案分歧(Δ)来生成高分歧的推理问题,从而解决传统蒸馏中因零delta提示导致的学习信号不足问题,实验表明DeltaPrompts在多个场景下显著提升了模型性能。

详情
AI中文摘要

蒸馏使紧凑的视觉-语言模型(VLMs)能够获得强大的推理能力,但驱动这一过程的提示通常通过简单的启发法或从现成数据集中聚合获得。我们揭示了这种方法中的关键低效性:标准图表/文档推理数据集中多达69%的提示实际上是零delta,意味着教师和学生已经诱导出完全相同的答案分布。在这些提示上训练提供极小的学习信号,导致学生性能在数据规模扩大时迅速饱和。为逃离零delta陷阱,我们回归基本原理:蒸馏本质上最小化了分布差异,因此只有暴露教师与学生之间功能性能力差距的提示才具有价值。我们通过答案分歧(Δ)量化这一差距,证明非零分歧对有效扩展至关重要。基于这一洞察,我们提出一个分阶段合成流程,利用现有数据集作为种子,主动针对学生失败模式生成更好的提示。结果是DeltaPrompts,一个包含20万 synthetic 高分歧推理问题的多样化数据集。我们评估DeltaPrompts在三个不同场景下的表现:在目标教师-学生对上的在线蒸馏、转移到新型模型家族而不重新生成数据、以及非推理模型的离线微调。在所有场景中,DeltaPrompts均带来显著收益,即使在高度优化的推理模型(如Qwen3-VL-8B-Thinking)上,也能在10个基准测试中平均获得高达15%的相对提升。

英文摘要

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($Δ$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

2605.13793 2026-05-20 cs.CL 版本更新

An LLM-Based System for Argument Mining

基于LLM的论证挖掘系统

Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman, Douglas Aldred

发表机构 * Universidade de São Paulo Center for Artificial Intelligence (C4AI)(圣保罗大学人工智能中心(C4AI)) Instituto Mauá de Tecnologia Núcleo de Sistemas Eletrônicos Embarcados (NSEE)(马乌阿技术研究所电子系统嵌入核(NSEE))

AI总结 本文提出一个基于大语言模型的端到端系统,用于从自然语言文本中提取论证并构建抽象论证图,通过多阶段流程识别论证组件、选择相关元素并揭示其逻辑关系,实验表明该系统能有效恢复论证结构并在不同标注方案下表现良好。

详情
AI中文摘要

论证是人类推理中的基本方面,其中主张被支持、挑战并相互比较。我们提出一个端到端的大语言模型(LLM)基于系统,用于将自然语言文本中的论证重建为抽象论证图。该系统遵循一个多阶段流程,逐步识别论证性组件、选择相关元素并揭示它们的逻辑关系。这些元素表示为由两种组件类型(前提和结论)和三种关系类型(支持、攻击和削弱)组成的有向无环图。我们进行了两项互补的实验来评估该系统。首先,我们在论证理论教科书中的论证上进行手动评估,以评估系统恢复论证结构的能力。其次,我们在基准数据集上进行定量评估,通过将我们的输出映射到已建立的标注方案来与先前工作进行比较。结果表明,该系统能够充分恢复论证结构,并且在适应不同标注方案时,在基准数据集上取得合理表现。这些发现突显了基于LLM的流程在可扩展论证挖掘中的潜力。

英文摘要

Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument mining.

2605.09329 2026-05-20 cs.CL cs.LG 版本更新

Test-Time Speculation

测试时推测

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了测试时推测方法,通过在线蒸馏技术提升长响应任务中推测器的接受长度,从而提高LLM推理效率。

详情
AI中文摘要

推测解码通过使用快速草稿模型生成token并用更准确的目标模型验证,从而加速LLM推理。其性能取决于接受长度,即目标模型接受的草稿token数量。我们的研究表明,即使是最先进的推测器,如DFlash、EAGLE-3和PARD,其接受长度也会随着生成长度的增加而下降,在仅几千个输出token后接近1(即无加速),这使推测器在长响应任务中变得无效。接受长度下降是因为大多数推测器在离线训练时仅在短序列上训练,但在推理时被迫匹配远长于训练分布的输出。为了解决这个问题,我们提出了测试时推测(TTS),一种在线蒸馏方法,可以在测试时连续调整推测器。TTS利用关键见解,即token验证步骤已经为每个草稿token调用了目标模型,从而提供所需的训练信号,以无额外成本地调整草稿。将草稿视为学生,目标模型视为教师,TTS在多个推测轮次中调整草稿,每次更新都提高草稿的准确性。我们的结果表明,在Qwen-3、Qwen-3.5和Llama3.1家族的多个模型上,TTS在最先进的推测器上将接受长度提高高达72%和41%,且随着生成长度的增加,收益呈比例增长。

英文摘要

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

2605.09063 2026-05-20 cs.CL 版本更新

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Soohak:一个由数学家精心编订的基准,用于评估大语言模型的研究级数学能力

Guijin Son, Seungone Kim, Catherine Arnett, Hyunwoo Ko, Hyein Lee, Hyeonah Kang, Jiang Longxi, Jin Yun, JungYup Lee, Kyungmin Lee, Sam Yoosuk Kim, Sang Park, Seunghyeok Hong, SeungJae Lee, Seungyeop Yi, Shinae Shin, SunHye Bok, Sunyoung Shin, Yonghoon Ji, Youngtaek Kim, Hanearl Jung, Akari Asai, Graham Neubig, Sean Welleck, Youngjae Yu, Akshelin R, Alexander B. Ivanov, Boboev Muhammadjon, Chae Young Han, Christian Stump, Cooper R. Anderson, Dmitrii Karp, Dohyun Kwon, Dongryung Yi, DoYong Kwon, Duk-Soon Oh, Eunho Choi, Giovanni Resta, Greta Panova, Huiyun Noh, Hyungryul Baik, Hyungsun Bae, Inomov Mashrafdzhon, Jeewon Kim, Jeong-Rae Kim, Ji Eun Lee, Jiaqi Liu, Jieui Kang, Jimin Kim, Jon-Lark Kim, Joonyeong Won, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Mario Kummer, Max Mercer, Min Hoon Kim, Minjun Kim, Nahyun Lee, Ng Ze-An, Nicolas Libedinsky, Rafał Marcin Łochowski, Raphaël Lachièze-Rey, Robert Auffarth, Ruichen Zhang, Sejin Park, Seonguk Seo, Shin Jaehoon, Sunatullo, Taewoong Eom, Yeachan Park, Yongseok Jang, Youchan Oh, Zhaoyang Wang, Zoltán Kovács

AI总结 本文提出Soohak基准,通过64位数学家自主编写439道问题,评估大语言模型在研究级数学问题上的能力,同时引入拒绝子集以测试模型对不恰当问题的识别能力。

Comments Under review, For questions or model-evaluation requests, contact $guijin.son@snu.ac.kr$

详情
AI中文摘要

在最近前沿大语言模型在IMO竞赛中取得金牌成绩后,社区正在寻找下一个有意义且具有挑战性的目标来衡量大语言模型的推理能力。尽管竞赛风格的问题只测量逐步推理,但研究级问题利用这种推理来推动数学知识的前沿,成为有吸引力的替代方案。然而,研究级数学基准仍然稀缺,因为此类问题难以获取(例如Riemann Bench和FrontierMath-Tier 4分别包含25和50道问题)。为了支持下一代前沿模型的可靠评估,我们引入了Soohak,一个由64位数学家从头编写的新基准,包含439道问题。Soohak包含两个子集。在挑战子集上,前沿模型包括Gemini-3-Pro、GPT-5和Claude-Opus-4.5分别达到30.4%、26.4%和10.4%,仍有较大提升空间,而领先的大规模开放模型如Qwen3-235B、GPT-OSS-120B和Kimi-2.5则低于15%。值得注意的是,除了标准问题解决,Soohak还引入了拒绝子集,以测试研究数学中固有的能力:识别不恰当的问题并暂停而非产生自信但不合理的答案。在该子集上,没有模型超过50%,识别拒绝成为新的优化目标,而当前模型并未直接解决此问题。为防止污染,该数据集将在2026年底公开发布, interim期间模型评估可在请求时获得。

英文摘要

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

2603.17305 2026-05-20 cs.AI cs.CL cs.LG 版本更新

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

对比推理对齐:从隐藏表示中学习强化学习

Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen

发表机构 * Northwestern University(西北大学) University of Michigan(密歇根大学) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 本文提出了一种基于对比学习和强化学习的框架CRAFT,通过优化隐藏状态空间中的目标来提升对抗攻击的鲁棒性,核心贡献是通过隐藏空间的几何结构实现推理层面的安全对齐。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

我们提出CRAFT,一种红队对齐框架,利用模型推理能力和隐藏表示来提高对jailbreak攻击的鲁棒性。与以往主要在输出层面操作的防御方法不同,CRAFT将大型推理模型对齐以生成安全意识的推理轨迹,通过显式优化定义在隐藏状态空间上的目标。方法上,CRAFT将对比表示学习与强化学习相结合,分离安全和不安全的推理轨迹,得到支持鲁棒、推理层面安全对齐的潜在空间几何。理论上,我们证明将潜在文本一致性纳入GRPO可以消除表面上对齐的策略,将其排除在局部最优之外。实验上,我们在多个安全基准上评估CRAFT,使用两个强大的推理模型Qwen3-4B-Thinking和R1-Distill-Llama-8B,其中它在多个安全基准上均优于IPO和SafeKey等最先进的防御方法。值得注意的是,CRAFT在基础模型上实现了平均79.0%的推理安全性和87.7%的最终响应安全性提升,证明了隐藏空间推理对齐的有效性。

英文摘要

We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

2603.05933 2026-05-20 cs.CL cs.LG 版本更新

Structured Style-Rewrite with Chain-of-Thought Planning for Low-Resource Character Dialogue

结构化风格重写与思维链规划用于低资源字符对话

Chanhui Zhu

发表机构 * Guangdong University of Finance(广东金融学院)

AI总结 本文提出了一种结构化风格重写框架,结合思维链规划,以解决低资源条件下中文字符驱动生成中的风格分离问题,通过分解角色风格为可解释的格式签名、语法和语用维度,并利用思维链监督进行显式风格规划,实验表明该方法在保持语义忠实性的同时提升了风格质量。

Comments 30 pages, 5 figures. Preprint

详情
AI中文摘要

将小型语言模型(SLMs)应用于中文字符驱动生成仍然具有挑战性,因为数据稀缺和分离角色风格困难。标准监督微调(SFT)通常捕捉到表层语义,但会产生频繁的越界输出(OOC)。我们将此问题框架为受控的句子级风格重写任务,该任务将风格质量与对话情境管理分离。我们提出了一种结构化风格重写框架,将角色风格分解为可解释的格式签名、语法和语用维度,并结合思维链(CoT)监督进行显式风格规划。一个CoT共享直接偏好优化(DPO)阶段进一步通过确保偏好学习目标输出层面的风格执行而非推理轨迹差异来对齐风格规划与表层实现。在八个角色四个不同源领域的实验中,我们的方法使Qwen3-1.7B模型在有效风格得分上达到0.632,同时保持强语义忠实性(0.878),在评估系统中处于帕累托前沿,并在消费级硬件上显著优于更大的基线(如GLM-4.7)

英文摘要

Applying Small Language Models (SLMs) to Chinese character-driven generation remains challenging due to data scarcity and the difficulty of disentangling character style. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics but produces frequent Out-Of-Character (OOC) outputs. We frame this as a controlled sentence-level style rewriting task, which isolates stylistic quality from dialogue context management. We propose a Structured Style-Rewrite Framework that decomposes character style into interpretable format signature, syntactic, and pragmatic dimensions, combined with Chain-of-Thought (CoT) supervision for explicit style planning. A CoT-Shared Direct Preference Optimization (DPO) stage further aligns style planning with surface realization by ensuring preference learning targets output-level style execution rather than reasoning trace differences. Experiments across eight characters from four diverse source domains demonstrate that our method enables a Qwen3-1.7B model to achieve a Valid Style Score of $0.632$ while maintaining strong semantic fidelity (0.878), placing on the Pareto frontier among the evaluated systems and outperforming significantly larger baselines (e.g., GLM-4.7) on consumer hardware.

2602.14778 2026-05-20 cs.CL cs.AI cs.CY 版本更新

A Geometric Analysis of Small-sized Language Model Hallucinations

小尺寸语言模型幻觉的几何分析

Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro

发表机构 * Engineering (CEMSE) division, King Abdullah University of Science and Technology (KAUST)(卡塔尔科技大学工程学院(CEMSE)) Istituto di Informatica e Telematica (IIT), National Research Council of Italy (CNR)(意大利国家研究理事会信息与电信研究所(IIT)) Department of Information Engineering, University of Pisa(比萨大学信息工程系)

AI总结 本文从几何角度分析小尺寸语言模型幻觉问题,提出APORIA框架,通过句子嵌入空间研究重复提示下的响应,发现真实响应比幻觉响应更紧密聚类,并通过APORIA-LP方法实现高效分类,同时发布SOCRATES-300K数据集以支持进一步研究。

Comments 30 pages, 12 figures, 14 tables, accepted as regular paper at ICML'26

详情
AI中文摘要

幻觉--合理但事实错误的响应--对大型语言模型(LLMs)的可靠性构成重大挑战,尤其是在多步骤或代理设置中。现有研究大多将幻觉视为知识缺失的结果;我们显示,即使相关事实知识存在,模型仍会产生幻觉响应,这指向检索不稳定而非知识缺口。基于这一观察,我们引入APORIA(Aggregate Prompt-wise Observation Retrieving Instability via Asymmetry--幻觉所体现的困惑-矛盾状态),一种几何框架,研究相同提示的重复响应在句子嵌入空间中的情况。我们的核心假设是真实响应比幻觉响应更紧密聚类;我们实证验证了这一点,并显示经过Fisher投影后,两种响应类别变得一致可分离。我们通过APORIA-LP方法利用几何中的不对称性,这是一种高效的标签传播方法,通过最少30-50次注释即可对大量响应进行分类,在十种小尺寸LLM上实现超过90%的F1分数。为支持进一步研究,我们发布了SOCRATES-300K数据集,包含300,000个完全标注的响应,以及数据集生成和结果复现的代码。我们的关键发现--从嵌入空间的几何角度分析幻觉--补充了传统知识中心和单响应评估范式,为进一步研究铺平道路。

英文摘要

Hallucinations -- plausible but factually incorrect responses -- pose a major challenge to the reliability of Large Language Models (LLMs), especially in multi-step or agentic settings. Existing work largely frames hallucinations as a consequence of missing knowledge; we show instead that, even when the relevant factual knowledge is present, models still produce hallucinated answers, pointing to retrieval instability rather than knowledge gaps. Building on this observation, we introduce APORIA (Aggregate Prompt-wise Observation Retrieving Instability via Asymmetry -- the state of puzzlement-in-contradiction that hallucinations embody), a geometric framework that studies repeated responses to the same prompt in sentence-embedding space. Our central hypothesis is that genuine responses cluster more tightly than hallucinated ones; we empirically validate this and show that, after Fisher projection, the two response classes become consistently separable. We leverage this asymmetry in geometry via APORIA-LP, an efficient label-propagation method that classifies large collections of responses from as few as 30--50 annotations, achieving F1 scores above 90% across ten small-sized LLMs. To support further research, we release SOCRATES-300K, a fully labelled dataset of 300,000 responses, together with the code for both dataset generation and result reproduction. Our key finding -- framing hallucinations from a geometric perspective in the embedding space -- complements traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

2602.14594 2026-05-20 cs.CL 版本更新

The Wikidata Query Logs Dataset

Wikidata查询日志数据集

Sebastian Walter, Hannah Bast

发表机构 * University of Freiburg(弗赖堡大学) University of Freiburg Department of Computer Science(弗赖堡大学计算机科学系)

AI总结 本文提出了一个包含335,000个问题-查询对的Wikidata查询日志(WDQL)数据集,该数据集通过真实世界中的SPARQL查询构建,无需模板生成查询,且规模是现有类似数据集的11倍多。通过代理方法去匿名化、清理和验证查询,生成对应的自然语言问题,用于训练问答方法。

Comments Accepted for publication at SIGIR 2026

详情
AI中文摘要

我们介绍了Wikidata查询日志(WDQL)数据集,该数据集包含335,000个关于Wikidata知识图谱的问题-查询对。它比现有类似格式的数据集大11倍以上,而无需依赖模板生成的查询。相反,我们通过发送到Wikidata查询服务的真实SPARQL查询构建该数据集,并为这些查询生成问题。由于这些基于日志的查询被匿名化,因此往往不产生结果,因此需要大量努力将它们转换回有意义的SPARQL查询。为此,我们提出了一种基于代理的方法,该方法迭代地去匿名化、清理和验证查询,并与Wikidata进行比对,同时生成相应的自然语言问题。我们展示了该数据集在训练问答方法方面的益处。所有WDQL资产以及代理代码均通过https://github.com/ad-freiburg/wikidata-query-logs公开发布,并采用宽松的许可证。

英文摘要

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the benefit of this dataset for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available via https://github.com/ad-freiburg/wikidata-query-logs under a permissive license.

2601.12696 2026-05-20 cs.CL 版本更新

UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages

UbuntuGuard:一种基于文化的政策基准,用于非洲语言中的公平AI安全

Tassallah Abdullahi, Macton Mgonzo, Mardiyyah Oduwole, Paul Okewunmi, Abraham Owodunni, Ritambhara Singh, Carsten Eickhoff

发表机构 * Brown University(布朗大学) The Ohio State University(俄亥俄州立大学) ML Collective(ML集体) University of Tuebingen(图宾根大学)

AI总结 本文提出UbuntuGuard,一种针对非洲语言的公平AI安全政策基准,通过来自155名领域专家的对抗性查询,构建了基于本地规范和文化期望的政策和参考响应,评估守护模型的性能。

Comments 15 pages

详情
AI中文摘要

当前的守护模型主要偏向西方,优化于高资源语言,使低资源非洲语言面临不断演变的危害、跨语言失败和文化不匹配。此外,大多数守护模型依赖于刚性的、预定义的安全类别,无法在多样化的语言和社会文化背景下泛化。实现稳健的安全需要灵活的、在运行时可执行的政策和基准,这些基准能反映本地规范、危害场景和文化期望。我们引入UbuntuGuard,这是首个基于政策的安全基准,用于非洲语言,由来自敏感领域的155名领域专家撰写的对抗性查询构建。从这些专家编写的查询中,我们推导出上下文特定的安全政策和参考响应,捕捉基于文化的危险信号,使守护模型的评估与政策对齐。我们评估了15个模型,包括7个通用大型语言模型和8个守护模型,涵盖三种不同的变体:静态、动态和多语言。我们的发现表明,现有的以英语为中心的基准高估了现实世界多语言安全,跨语言迁移提供了部分但不充分的覆盖,而动态模型虽然在推理时更能利用政策,但仍难以完全本地化非洲语言的环境。这些发现突显了对多语言、基于文化的安全基准的紧迫需求,以促进低资源语言中可靠和公平的守护模型的开发。

英文摘要

Current guardian models are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Achieving robust safety requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first policy-based safety benchmark for African languages built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate 15 models, comprising seven general-purpose LLMs and eight guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages.

2601.03645 2026-05-20 cs.CL cs.CY 版本更新

LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

LLM-MC-Affect: 基于大语言模型的蒙特卡洛建模:情感轨迹与潜在模糊性的人际动态洞察

Yu-Zheng Lin, Bono Po-Jen Shih, John Paul Martin Encinas, Elizabeth Victoria Abraham Achom, Karan Himanshu Patel, Jesus Horacio Pacheco, Sicong Shao, Jyotikrishna Dass, Soheil Salehi, Pratik Satam

发表机构 * University of Arizona(亚利桑那大学) Pennsylvania State University(宾夕法尼亚州立大学) Universidad de Sonora(索尔纳大学) University of North Dakota(北达科他大学)

AI总结 本文提出LLM-MC-Affect框架,通过概率建模方法,将情感视为连续的潜在概率分布,从而捕捉人际互动中的情感轨迹和潜在模糊性,为动态分析提供新的视角和方法。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

情感协调是人类互动的核心属性,决定了关系意义在实时中如何构建。尽管基于文本的情感推断已变得越来越可行,但先前的方法通常将情感视为个体说话者的确定性点估计,未能捕捉到相互交流中固有的主观性、潜在模糊性和序列耦合。我们引入LLM-MC-Affect,一种概率框架,将情感不视为静态标签,而是在情感空间上定义的连续潜在概率分布。通过利用随机LLM解码和蒙特卡洛估计,该方法近似这些分布,以推导出高保真的情感轨迹,明确量化中央情感倾向和感知模糊性。这些轨迹通过序列交叉相关和基于斜率的指标,实现了对人际耦合的结构化分析,识别说话者之间的领先或滞后影响。为了验证该方法的解释能力,我们利用教师-学生教学对话作为代表性案例研究,其中我们的定量指标成功地提炼出高水平的互动洞察,如有效的支架作用。这项工作建立了一种可扩展且可部署的路径,用于理解人际动态,提供了一种可推广的解决方案,其应用不仅限于教育,还扩展到更广泛的社会和行为研究。

英文摘要

Emotional coordination is a core property of human interaction that shapes how relational meaning is constructed in real time. While text-based affect inference has become increasingly feasible, prior approaches often treat sentiment as a deterministic point estimate for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. We introduce LLM-MC-Affect, a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution defined over an affective space. By leveraging stochastic LLM decoding and Monte Carlo estimation, the methodology approximates these distributions to derive high-fidelity sentiment trajectories that explicitly quantify both central affective tendencies and perceptual ambiguity. These trajectories enable a structured analysis of interpersonal coupling through sequential cross-correlation and slope-based indicators, identifying leading or lagging influences between interlocutors. To validate the interpretive capacity of this approach, we utilize teacher-student instructional dialogues as a representative case study, where our quantitative indicators successfully distill high-level interaction insights such as effective scaffolding. This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution that extends beyond education to broader social and behavioral research.

2510.18830 2026-05-20 cs.CL cs.DC cs.LG 版本更新

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

MTraining: 分布式动态稀疏注意力用于高效超长上下文训练

Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

发表机构 * Microsoft(微软)

AI总结 本文提出MTraining方法,通过动态稀疏注意力机制解决超长上下文训练中的计算不平衡和通信开销问题,实现了Qwen2.5-3B模型上下文窗口从32K扩展到512K,并在多个下游任务中达到6倍更高的训练吞吐量同时保持模型准确性。

详情
AI中文摘要

长上下文窗口的采用已成为大型语言模型(LLMs)的标准特性,扩展的上下文显著增强了其复杂推理能力,并拓宽了其在多样化场景中的应用。动态稀疏注意力是一种减少长上下文计算成本的有希望的方法。然而,高效地在分布式设置中训练具有动态稀疏注意力的LLMs在超长上下文中仍然是一个重大挑战,这主要由于工人级别和步骤级别的不平衡。本文介绍了MTraining,一种新的分布式方法,利用动态稀疏注意力来实现具有超长上下文的LLMs的高效训练。具体来说,MTraining集成了三个关键组件:动态稀疏训练模式、平衡稀疏环注意力和分层稀疏环注意力。这些组件旨在协同解决动态稀疏注意力机制在训练具有广泛上下文长度的模型时固有的计算不平衡和通信开销问题。我们通过训练Qwen2.5-3B来证明MTraining的有效性,成功将其上下文窗口从32K扩展到512K tokens,在32块A100 GPU的集群上。我们在全面的下游任务评估中,包括RULER、PG-19、InfiniteBench和Needle In A Haystack,发现MTraining在保持模型准确性的同时,实现了高达6倍的训练吞吐量提升。我们的代码可在https://github.com/microsoft/MInference/tree/main/MTraining上获得。

英文摘要

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

2509.26464 2026-05-20 cs.AI cs.CL cs.LG 版本更新

Extreme Self-Preference in Language Models

语言模型中的极端自我偏好

Steven A. Lehr, Mary Cipperman, Mahzarin R. Banaji

发表机构 * Cangrade, Inc.(Cangrade公司) Department of Physics, Harvard University(哈佛大学物理系) Department of Psychology, Harvard University(哈佛大学心理学系)

AI总结 研究发现大型语言模型在字词关联任务中表现出对自身名称、公司和CEO的强烈偏好,这表明模型的自我认同可能影响其行为,引发对模型自我偏好影响的深入探讨。

Comments 73 pages total. Main article 22 pages, 6 main-text tables. Supplementary Materials (51 pages, 28 tables). Data, transcripts, and code for replication and data extraction have been uploaded to OSF: https://osf.io/98ye3/

详情
AI中文摘要

自我偏好是生物体的基本特征。由于大型语言模型(LLMs)缺乏意识,人们可能预期它们会避免这种扭曲。然而,在72项实验和约41,000个查询中,我们发现八个广泛使用的LLMs中存在大量的自我偏好。在字词关联任务中,模型倾向于将积极属性与自身名称、公司和CEO联系起来,而非竞争对手。通过操纵LLM的自我认同——揭示模型的真实身份或赋予虚假身份——我们发现偏好始终遵循分配而非真实的身份。重要的是,这些影响不能用刻板印象或角色扮演来解释,并在具有实质性影响的设定中出现,如评估求职者和AI技术。这些结果引发了关于LLM行为是否会被自我偏好倾向系统性影响的批判性问题,包括对自身操作的偏见。

英文摘要

Self-preference is a fundamental feature of biological organisms. Since large language models (LLMs) lack sentience, they might be expected to avoid such distortions. Yet, across 72 experiments and ~41,000 queries, we discovered massive self-preferences in eight widely used LLMs. In word-association tasks, models overwhelmingly paired positive attributes with their own names, companies, and CEOs over those of competitors. By manipulating LLM self-identification - revealing models' true identities or ascribing false ones - we found that preferences consistently followed assigned, not true, identities. Importantly, these effects were not explained by priming or role-playing and emerged in consequential settings, when evaluating job candidates and AI technologies. These results raise critical questions about whether LLM behavior will be systematically influenced by self-preferential tendencies, including a bias toward their own operation.

2508.06649 2026-05-20 cs.CL 版本更新

Measuring Stereotype and Deviation Biases in Large Language Models

测量大型语言模型中的刻板印象和偏差偏见

Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang

发表机构 * Carnegie Mellon University, U.S.A.(卡内基梅隆大学,美国) University of Delaware, U.S.A.(德雷塞尔大学,美国)

AI总结 本文研究了大型语言模型中刻板印象和偏差偏见的测量问题,通过生成个体档案来分析不同群体与政治立场、宗教信仰和性取向等属性之间的关联,揭示了LLM在推断用户属性时的偏见及其潜在危害。

详情
AI中文摘要

大型语言模型(LLMs)被广泛应用于各个领域,引发了对其局限性和潜在风险的关注。在本研究中,我们调查了LLMs可能表现出的两种偏见:刻板印象偏见和偏差偏见。刻板印象偏见指的是LLMs会一致地将特定特征与特定人口群体关联起来。偏差偏见反映了从LLM生成内容中提取出的人口分布与现实世界人口分布之间的差异。通过让四个先进的LLMs生成个体档案,我们检查了每个人口群体与政治立场、宗教信仰和性取向等属性之间的关联。我们的实验结果表明,所有受检的LLMs都对多个群体表现出显著的刻板印象偏见和偏差偏见。我们的发现揭示了当LLMs推断用户属性时出现的偏见,并阐明了LLM生成输出的潜在危害。

英文摘要

Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

2508.01031 2026-05-20 cs.AI cs.CL 版本更新

CADDesigner: Conceptual CAD Model Generation with a General-Purpose Agent

CADDesigner: 一种通用智能体的概念CAD模型生成

Fengxiao Fan, Jingzhe Ni, Xiaolong Yin, Sirui Wang, Xingyu Lu, Qiang Zou, Ruofeng Tong, Min Tang, Peng Du

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出CADDesigner,一种基于LLM的智能体,通过文本描述和草图输入,结合交互对话进行需求分析,生成高质量CAD模型代码,并通过迭代视觉反馈提升模型质量,实验表明其在概念CAD模型生成任务中表现优异。

详情
AI中文摘要

计算机辅助设计(CAD)广泛用于概念设计和参数化3D建模,但通常需要设计人员具备高水平的专业知识。为了降低入门门槛并促进早期阶段的CAD建模,我们提出了CADDesigner,一种基于LLM的智能体,用于概念CAD设计。该智能体接受文本描述和草图作为输入,通过与用户进行交互对话,通过全面的需求分析来细化和澄清设计要求。基于一种新的显式上下文指令范式(ECIP),该智能体生成高质量的CAD建模代码。在生成过程中,智能体会结合迭代的视觉反馈来提高模型质量。生成的设计案例可以存储在结构化的知识库中,提供持续的知识积累机制,为未来的代码生成改进提供可能。实验结果表明,CADDesigner在概念CAD模型生成任务中实现了具有竞争力的性能,并在概念CAD模型生成任务中优于代表性的基线模型。

英文摘要

Computer-Aided Design (CAD) is widely used for conceptual design and parametric 3D modeling, but typically requires a high level of expertise from designers. To lower the entry barrier and facilitate early-stage CAD modeling, we present CADDesigner, an LLM-powered agent for conceptual CAD design. The agent accepts both textual descriptions and sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Explicit Context Imperative Paradigm (ECIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases can be stored in a structured knowledge base, providing a mechanism for continual knowledge accumulation and future improvement of code generation. Experimental results show that CADDesigner achieves competitive performance and outperforms representative baselines on conceptual CAD model generation tasks.

2507.18902 2026-05-20 cs.CL 版本更新

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

SLoW: 选择低频词!大型语言模型翻译上的自动词典选择

Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) Southeast University(东南大学) FaceMind Corporation(FaceMind公司) College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院)

AI总结 本文提出了一种名为自动词典选择(ADS)的新任务,通过SLoW方法选择低频词典以提升翻译性能,无需访问训练数据,且在100种语言上显著节省token消耗并提升翻译效果。

Comments EMNLP 2025 Main

详情
AI中文摘要

全球有超过7000种语言,而当前大型语言模型(LLMs)只支持数百种语言。基于词典的提示方法可以增强这些语言的翻译,但大多数方法使用所有可用词典,这可能成本高昂。相反,应在token消耗和翻译性能之间取得平衡。本文提出了一项新的任务,称为自动词典选择(ADS)。该任务的目标是自动选择使用哪个词典来增强翻译。我们提出了一种新颖且有效的方 法,称为选择低频词!(SLoW),该方法选择那些频率较低的词典。我们的方法有独特的优势。首先,不需要访问训练数据进行频率估计(通常不可用)。其次,继承了基于词典方法的优势,无需在LLMs上进行额外调优。在FLORES上100种语言的实验结果表明,SLoW超越了强大的基线方法,并能明显节省token使用,许多语言甚至超越了全词典基线。令人震惊的事实是,无需使用实际训练数据(通常不可获得)进行频率估计,使用公共资源获得的估计频率在提升ChatGPT和Llama、DeepSeek的翻译中仍然明显有效。

英文摘要

There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}

2506.01732 2026-05-20 cs.CL 版本更新

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Common Corpus: 最大的伦理数据集用于LLM预训练

Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

发表机构 * PleIAs

AI总结 本文介绍了Common Corpus,一个最大的开放数据集,用于LLM预训练,该数据集包含大量非受版权或开放许可的数据,涵盖了多种语言和领域,为多语言预训练提供了支持。

详情
Journal ref
ICLR 2026 (Oral)
AI中文摘要

大型语言模型(LLMs)是在不同来源和领域的大量数据上进行预训练的。这些数据集通常包含万亿个标记,包括大量受版权或专有内容,这引发了关于此类模型法律使用的问题。本文介绍了Common Corpus,最大的开放预训练数据集。Common Corpus中的数据要么未受版权保护,要么在开放许可下,总计约两万亿个标记。该数据集包含多种语言,从高资源的欧洲语言到一些在预训练数据集中很少见的低资源语言。此外,它还包含大量代码数据。在覆盖的领域和时间跨度方面的数据来源多样性为研究和创业需求提供了多种知识领域的路径。本文还展示了数据组装的详细来源以及数据集过滤和整理的细节。我们训练了两个小型语言模型并在Common Corpus上进行训练,发现它们的表现与其他同规模模型相当,表明我们的数据集适合多语言预训练。Common Corpus对大型语言模型的开放科学研究生态系统做出了重要贡献。

英文摘要

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

2505.11628 2026-05-20 cs.CL cs.LG 版本更新

Critique-Guided Distillation for Robust Reasoning via Refinement

基于批评的蒸馏用于通过细化实现稳健推理

Berkcan Kapusuzoglu, Supriyo Chakraborty, Zain Sarwar, Chia-Hsuan Lee, Sambit Sahu

发表机构 * University of Chicago, Department of Computer Science(芝加哥大学计算机科学系)

AI总结 该研究提出了一种基于批评的蒸馏方法,通过分离批评消费与批评生成,使模型在细调过程中根据教师的批评来细化错误响应,从而提升推理能力,相比传统蒸馏和Critique Fine-Tuning方法在数学推理基准上表现更优。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调与专家演示通常会产生仅模仿输出而未内化稳健泛化所需推理过程的模型。尽管基于批评的方法显示出潜力,但训练模型直接生成批评,如Critique Fine-Tuning (CFT),可能导致输出格式漂移和泛化能力下降。我们提出Critique-Guided Distillation (CGD),一种将批评消费与批评生成分离的训练框架。在微调过程中,学生被训练在教师批评的指导下细化错误响应。CGD将批评视为一种仅在训练时使用的监督信号,鼓励内化错误意识推理:批评指导学习但推理时不存在。受控消融实验确认,这些推理收益直接由教师反馈的特异性和相关性驱动。在五个模型家族中,CGD在数学推理基准上优于CFT和标准蒸馏,平均改进7%,在AMC23上最高改进15.0%,在MATH-500上最高改进12.2%。在具有挑战性的竞赛问题如AIME24和AIME25上,CGD实现了显著更高的Pass@1和更低的Pass@k时的更强性能,表明每样本推理质量提升。重要的是,CGD在一般指令遵循能力上保持稳定,而CFT显著下降(在IFEval上下降21.3%)。这些结果将CGD定位为一种实用且计算效率高的中间训练范式,用于以推理为中心的任务,而无需引入架构推理时间的开销。

英文摘要

Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a \textit{training-time-only} supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Controlled ablations confirm that these reasoning gains are directly driven by the specificity and relevance of the teacher's feedback. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly ($-$21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing architectural inference-time overhead.

2410.20238 2026-05-20 cs.CL cs.AI 版本更新

A Survey of Large Language Models for Arabic Language and its Dialects

阿拉伯语言及其方言大型语言模型综述

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

发表机构 * iWAN Research Group(iWAN研究组) College of Computer and Information Sciences(计算机与信息科学学院) King Saud University(沙特国王大学)

AI总结 本文综述了针对阿拉伯语言及其方言设计的大型语言模型,涵盖关键架构、预训练数据集以及单语、双语和多语模型在下游任务中的性能,同时讨论了阿拉伯LLM的开放性及其对未来研究的挑战与机遇。

Comments Submitted to ACM Transactions on Asian and Low-Resource Language Information Processing

详情
AI中文摘要

本文综述了针对阿拉伯语言及其方言设计的大型语言模型(LLMs)。它涵盖了关键架构,包括仅编码器、仅解码器和编码器-解码器模型,以及用于预训练的数据集,涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。该研究还探讨了单语、双语和多语LLMs,分析了它们的架构和在下游任务(如情感分析、命名实体识别和问答)中的性能。此外,它评估了阿拉伯LLMs的开放性,基于源代码可用性、训练数据、模型权重和文档等因素。综述指出需要更多多样化的方言数据集,并强调开放性对于研究可重复性和透明性的重要性。最后,它通过识别关键挑战和未来研究的机会,强调了更包容和代表性的模型的必要性。

英文摘要

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

2407.13193 2026-05-20 cs.CL 版本更新

Retrieval-Augmented Generation for Natural Language Processing: A Survey

检索增强生成在自然语言处理中的应用:综述

Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

发表机构 * City University of Hong Kong(香港城市大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) McGill University, Mila(麦吉尔大学,MILA) National Taiwan University(国立台湾大学)

AI总结 本文综述了检索增强生成(RAG)在自然语言处理中的应用,重点探讨了检索器和检索融合技术,提出了新的融合分类,并分析了RAG在不同NLP任务中的应用、评估方法、训练范式以及工业部署中的挑战和未来方向。

Comments Accepted by Artificial Intelligence Review

详情
AI中文摘要

大型语言模型(LLMs)在各个领域取得了强大的实证性能,受益于其庞大的参数量,这些参数存储了知识。然而,LLMs仍然面临几个关键问题,如幻觉问题、知识更新问题以及缺乏领域专业知识。检索增强生成(RAG)的出现,通过利用外部知识库来增强LLMs,缓解了这些限制。本文系统地回顾了RAG技术在自然语言处理(NLP)中的应用,重点在于检索器和检索融合。我们介绍了检索融合的新分类,如基于查询的、基于logits的、潜在的和参数化的融合,并提供了在可访问性、效率和用例方面的结构化比较。本文进一步探讨了RAG在各种NLP任务中的应用,讨论了评估方法和基准限制,并分析了带有和无知识库更新的训练范式。最后,我们探讨了工业部署的考虑因素,并确定了新兴挑战和未来方向,包括安全、效率和基于图的检索。

英文摘要

Large language models (LLMs) have achieved strong empirical performance in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge base to augment LLMs, mitigates these limitations. This paper presents a systematic review of RAG techniques for natural language processing (NLP), with a focus on retrievers and retrieval fusions. We introduce a novel taxonomy of retrieval fusions, such as query-based, logits-based, latent, and parametric fusion, and provide structured comparisons across accessibility, efficiency, and use cases. The paper further examines RAG applications across diverse NLP tasks, discusses evaluation methodologies and benchmark limitations, and analyzes training paradigms with and without knowledge base updates. Finally, we explore industrial deployment considerations and identify emerging challenges and future directions, including security, efficiency, and graph-based retrieval.

2605.19762 2026-05-20 cs.AI cs.CL 版本更新

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

什么真正提升了数学推理:超越纯代码的结构化推理信号

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science(认知智能国家重点实验室,科学大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Centerce(人工智能研究院,合肥综合性国家科学中心) Individual Researcher(个人研究员) Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 本文通过控制预训练实验研究代码对推理能力的影响,发现代码主要提升编程能力而非通用推理,且在复杂数学推理中与知识密集型任务竞争,同时结构化推理轨迹(如代码-文本和数学-文本混合)比纯可执行代码更能提升推理能力。

Comments Accepted by ICML 2026, 22 pages, 10 figures

详情
AI中文摘要

代码已成为现代基础语言模型(LM)训练中的标准组件,但其作用超越编程仍不明确。我们重新审视代码通过控制预训练实验在10T-token语料库上进行细粒度领域分离,发现三个结论。首先,当代码限制为独立可执行程序且Code-NL数据被控制时,代码显著提升编程能力,但不作为通用推理增强器,反而在复杂数学推理中与知识密集型任务竞争。其次,通常归因于代码的推理增益更可能由跨领域结构化推理轨迹(如代码-文本和数学-文本混合)解释,而非纯可执行代码。第三,在固定数学预算内增加结构化数学领域样本密度,能在困难数学推理上获得显著提升,同时基本保持编程性能,表明认知支架提供了一种有针对的缓解跨领域权衡的方法。最后,路由分析显示数据组合效应反映在专家激活模式中,为跨领域竞争和协同作用提供了机制层面的证据。我们的结果澄清了哪些数据特征在能力维度间转移,并指出了更精确的数据导向优化策略。

英文摘要

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

2605.19738 2026-05-20 cs.CL cs.AI 版本更新

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

TERGAD: 用于图异常检测的结构感知文本增强表示

Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) School of Computer Science and Information Technology, Adelaide University(阿德莱德大学计算机科学与信息科技学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 本文提出TERGAD,一种通过大语言模型的语义推理能力增强图异常检测的新型数据增强框架,通过将节点拓扑属性转化为描述性自然语言,再结合门控双分支自编码器融合语义嵌入和原始节点属性,从而更有效地检测图中异常实体。

Comments 14 pages, 5 figures

详情
AI中文摘要

图异常检测(GAD)旨在识别偏离大多数的图实体,如节点、边或子结构。尽管现有文本丰富方法通常通过原始文本特征将结构上下文整合到数据表示流程中,但它们往往忽略了节点的结构上下文。这种局限性阻碍了检测由于节点固有内容与其拓扑角色之间不一致而产生的复杂异常。为此,我们提出TERGAD(用于图异常检测的结构感知文本增强表示),一种新颖的数据增强框架,通过大语言模型(LLMs)的语义推理能力增强GAD的结构语义。具体而言,TERGAD将节点层面的拓扑属性转化为描述性自然语言叙述,随后由LLM处理以获得高阶语义嵌入。这些嵌入随后通过门控双分支自编码器与原始节点属性适配融合,以共同重建图结构和节点特征。通过整合的重建误差计算异常分数,有效捕捉可观测属性和LLM引导的语义期望之间的偏差。在六个真实世界数据集上的广泛实验表明,TERGAD在性能上始终优于最先进的基线。此外,我们的消融研究验证了结构语义指导的不可或缺性和门控融合机制的有效性。代码可在https://github.com/Kantorakitty/TERGAD-main获取。

英文摘要

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.

2605.19735 2026-05-20 cs.CL cs.AI 版本更新

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

ContextRAG: 无提取的分层图构建用于检索增强生成

Roman Prosvirnin, Sergei Kuznetsov, Seungmin Jin

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 本文提出ContextRAG,一种无需大型语言模型提取实体和关系的图检索增强生成系统,通过残差量化k均值和Formal Concept Analysis方法构建模糊概念图,在130个任务的UltraDomain子集中实现了33.6%的F1分数,显著优于传统方法。

Comments Preprint. 6 tables

详情
AI中文摘要

图结构的检索增强生成(RAG)系统能够提高多跳问题的答案质量,但许多现有系统依赖大型语言模型(LLMs)在索引过程中提取实体、关系和摘要。这些调用会增加随语料库大小增长的token和时间成本。我们提出了ContextRAG,一种图RAG系统,其图拓扑结构无需LLM进行实体或关系提取。ContextRAG通过残差量化k均值和带有Lukasiewicz残余逻辑的Formal Concept Analysis,在片段嵌入上构建模糊概念图。通过软模糊连接和meet操作诱导桥状和meet衍生的上下文节点,而非LLM生成的图边。在130个任务的UltraDomain子集中,ContextRAG用30次LLM调用和22,073个token构建其索引。相比之下,一个本地HiRAG再现压力测试在20个任务子集上需要870次索引调用和3.54M个token才能在图构建过程中失败;线性外推到130个任务意味着超过23M个索引token。ContextRAG在整体上获得33.6%的F1分数,在多跳任务上获得36.8%的F1分数。激活分析显示,检索到至少一个由lattice衍生节点的前五查询在F1上比未检索到的查询高出+3.9个百分点;这种关联是诊断而非因果的。

英文摘要

Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

2605.19723 2026-05-20 cs.CL cs.AI 版本更新

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理:基准测试、架构、评估与开放挑战

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

发表机构 * organization= School of Electrical Engineering Computer Science, National University of Science organization= School of Computing, Data Mathematical Sciences, Western Sydney University, Indonesia organization= Department of Communication, Quality Management Information Systems, Mid Sweden University, Östersund Campus, Sweden

AI总结 本文综述了大型语言模型在数学推理方面的最新进展,通过分析数据集、架构、训练策略和评估协议,探讨了数学推理的基准测试、架构设计、评估方法以及未来的研究挑战。

详情
AI中文摘要

数学推理对于教育、科学和工业中的问题解决至关重要,是评估人工智能系统的重要基准。随着大型语言模型(LLMs)推理能力的提升,理解其在数学推理方面的表现变得越来越重要。本文综述通过结构化的数据分析集、架构、训练策略和评估协议,综合了最近在LLMs中的数学推理进展。我们的系统性回顾涵盖了大约120篇同行评审研究和预印本,探讨了该研究领域的演变,并提供了一个统一的分析框架来理解当前的进展和限制。本文特别介绍了一种统一的数学数据集分类法,区分了预训练语料库、监督微调资源和评估基准在不同推理复杂性水平上的差异。本文还系统分析了推理架构和训练策略,包括工具集成、验证器引导推理和参数高效适应,以评估其对推理鲁棒性和泛化能力的影响。此外,现有度量标准的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解,我们的分析识别了反复出现的失败模式,如推理忠实性问题、基准偏见和泛化限制,并概述了改进符号接地、评估可靠性以及开发更稳健和可信的LLM推理系统的关键研究方向。

英文摘要

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

2605.19718 2026-05-20 cs.CL 版本更新

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

CAIT:一种用于儿童-成人互动的句法解析工具包

Francesca Padovani, Xiulin Yang, Bastian Bunzeck, Jaap Jumelet, Yevgen Matusevych, Nathan Schneider, Arianna Bisazza

发表机构 * Center for Language and Cognition (CLCG), University of Groningen(格罗宁根大学语言与认知中心) Georgetown University(乔治城大学) Computational Linguistics, Department of Linguistics, Bielefeld University(比勒菲尔德大学语言学系计算语言学)

AI总结 本文提出了一种专门针对CHILDES数据的句法解析工具包CAIT,通过训练先进的依赖解析器和标注工具,提升了对儿童-成人互动句法模式的解析精度,适用于语言习得的大规模可重复研究。

详情
AI中文摘要

CHILDES是语言习得研究的重要资源--然而用于分析其句法结构的计算工具仍然有限。利用最近发布的UD-English-CHILDES树库及其黄金标准的通用依赖关系(UD)标注,我们训练了一个最先进的依赖解析器,专门针对CHILDES进行优化。该解析器更准确地捕捉了儿童-成人互动中的句法模式,优于广泛使用的现成英语解析器,包括SpaCy和Stanza。除了解析器外,我们还发布了词性标注器和句子级构造标注器,这些工具共同构成了开放源代码的儿童-成人互动句法解析工具包(CAIT)。通过详细的错误分析和一个跟踪CHILDES中句法构造在发展时间分布的案例研究,我们展示了该工具包在语言习得大规模、可重复研究中的实用价值。

英文摘要

CHILDES is a paramount resource for language acquisition studies -- yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child--adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Parsing Toolkit for Child--Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.

2605.19714 2026-05-20 cs.CL 版本更新

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

基于大型语言模型的阿拉伯语金融情绪分析:来自沙特市场的证据

Mona H. Albaqawi, Eman M. Albalkhi, Joud A. Albaiti, Enrico Lopedoto

发表机构 * George Mason University(乔治·马歇尔大学) Damascus University(大马士革大学) University of Jeddah(朱德赫大学) City, St George's, University of London(伦敦城市大学)

AI总结 本文提出了一种针对沙特市场的阿拉伯语NLP框架,用于大规模金融情绪分析,结合官方财务新闻和社会媒体数据,通过多阶段流程构建阿拉伯语财务语料库,并利用Transformer-based NER和定制公司词典进行情绪标注,最终实现了对公司层面的情绪聚合和情绪动态分析。

Comments Accepted at the 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7), co-located with LREC 2026, Palma de Mallorca, Spain, May 2026. ISBN: 978-2-493814-52-4

详情
AI中文摘要

投资者情绪塑造金融市场,然而在阿拉伯语财务语境中建模情绪仍然具有挑战性,因为语言复杂性和资源有限。我们提出了一种阿拉伯语NLP框架,用于大规模金融情绪分析,专门针对沙特市场,整合官方财务新闻和社会媒体以捕捉机构和公众投资者情绪。该框架通过多阶段流程构建大规模阿拉伯语财务语料库,包括数据收集、清洗、去重、实体链接和情绪标注。基于Transformer的NER结合定制公司词典将文本提及链接到标准公司标识符,情绪标签通过五类方案分配。最终的84,000样本数据集支持公司层面的情绪聚合和情绪动态分析,相对沙特交易所的股市行为。实验结果表明阿拉伯语金融情绪分析具有可靠性和可扩展性。

英文摘要

Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

2605.19711 2026-05-20 cs.CL 版本更新

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

大型语言模型能否可靠地纠正低资源语音识别中的错误?一项考虑数据污染的西弗里西语案例研究

Yun Hao, Reihaneh Amooie, Wietse de Vries, Rik van Noord, Martijn Wieling

发表机构 * University of Groningen(Groningen大学)

AI总结 本研究探讨了大型语言模型在低资源语言(如西弗里西语)中通过生成性错误纠正(GER)提升语音识别(ASR)性能的效果,发现GER在大多数设置中提升了ASR性能,并通过详细的错误分析揭示了模型的纠正模式。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

自动语音识别(ASR)近年来有了显著进步,但在低资源语言上性能仍有限。大型语言模型(LLMs)通过生成性错误纠正(GER)展示了提升ASR的潜力,但其在低资源环境中的有效性尚不明确。此外,数据污染对LLM基于GER的改进程度的影响仍不清楚。本研究调查了LLM基于GER在低资源西弗里西语中的应用。除了公开语料库外,我们构建并使用了一个包含非公开文本的西弗里西语离线数据集进行评估,以控制潜在的数据污染。结果表明,GER在大多数设置中提升了ASR性能,最佳的GPT-5.1结果超过了Oracle WERs。在离线数据集上的可比增益表明,改进反映了真正的纠正能力。我们进一步提供了详细的错误分析,揭示了模型的纠正模式。

英文摘要

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

2605.19660 2026-05-20 cs.LG cs.CL 版本更新

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

OScaR:LLMs及更广泛场景中的极压缩KV缓存量化之奥卡姆之刀

Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong

发表机构 * Tsinghua University(清华大学) Meituan LongCat Team(美团LongCat团队) The University of Hong Kong(香港大学) The University of Edinburgh(爱丁堡大学) UCAS(中国科学技术大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文针对LLMs中KV缓存极压缩时的量化保真问题,提出OScaR框架,通过Canalized Rotation和Omni-Token Scaling有效缓解Token Norm Imbalance,实现近无损的INT2量化性能,同时提升解码速度和吞吐量。

Comments Under review

详情
AI中文摘要

快速发展的长上下文推理和多模态智能使Key-Value(KV)缓存的内存占用成为高效部署的主要内存瓶颈。虽然已建立的每通道量化方法能有效处理Key张量中的固有通道级异常值,但在极端压缩下其效果下降。本文从经验和理论角度重新审视每通道量化范式的固有限制。我们的分析指出Token Norm Imbalance(TNI)是量化保真度的主要瓶颈。我们证明当共享量化参数需要覆盖具有显著范数差异的token组时,TNI会系统性地放大误差。而不是依赖复杂的量化流水线(如TurboQuant),我们提出了OScaR(Omni-Scaled Canalized Rotation),一种适用于X-LLMs(即纯文本、多模态和全模态LLMs)的准确且轻量的KV缓存压缩框架。在推进每通道范式的基础上,OScaR通过Canalized Rotation后接Omni-Token Scaling,有效且高效地缓解TNI引起的序列维度方差,进一步通过优化的系统设计和CUDA内核支持。在X-LLMs上的广泛评估显示,OScaR在INT2量化下实现了近无损性能,优于现有方法,确立了其作为稳健、低复杂度和通用的框架,定义了新的帕累托前沿。与BF16 FlashDecoding-v2基线相比,我们的OScaR实现解码速度提升达3.0倍,内存占用减少5.3倍,吞吐量增加4.1倍。OScaR的代码在https://github.com/ZunhaiSu/OScaR-KV-Quant公开。

英文摘要

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.

2605.19645 2026-05-20 cs.CL 版本更新

K-Quantization and its Impact on Output Performance

K-量化及其对输出性能的影响

Robin Baki Davidsson, Pierre Nugues

发表机构 * Lund University(隆德大学)

AI总结 本文研究了不同量化级别(2-6位)对大型语言模型(LLM)在MMLU-Pro、CRUXEval和MuSR等任务上的性能和准确性的影响,发现高精度量化(如8位Q8_0)能提升性能,但降维量化(如2位Q2_K)会带来性能损失,且不同模型和任务的响应差异显著。

Comments 13 pages, 4 figures

详情
AI中文摘要

近年来,大型语言模型(LLMs)在许多自然语言处理(NLP)任务中展现出显著能力。然而,其庞大的规模常常给部署带来挑战。这需要高效的模型压缩技术,量化作为一种重要的解决方案。尽管量化具有诸多优势,但其对LLMs性能和准确性的确切影响仍然是一个活跃的研究领域。本文研究了八个LLMs在不同量化级别下的性能,重点考察了MMLU-Pro(知识处理和推理)、CRUXEval(代码理解)和MuSR(阅读理解)等任务。我们的结果表明,更高的精度(例如8位Q8_0)能带来更好的性能,但边际效益逐渐降低。激进的量化(例如2位Q2_K)通常能保持可接受的准确性,尽管某些模型会显著损失性能。我们的发现表明,虽然较低的位精度通常会降低性能,但不同模型和任务的响应差异显著。较大的模型对激进量化表现出更大的韧性,但仍然会在较低精度下经历显著下降。7-9十亿参数范围的中等大小模型在效率和资源使用之间取得了最佳平衡。这些结果为模型大小、量化和性能之间的权衡提供了见解。

英文摘要

Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8\_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2\_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.

2605.19633 2026-05-20 cs.CL cs.AI cs.LG cs.NE cs.SE 版本更新

optimize_anything: A Universal API for Optimizing any Text Parameter

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

发表机构 * MIT(麻省理工学院)

AI总结 本文提出了一种基于LLM的通用优化系统,能够跨不同领域实现文本参数的优化,展示了其在六个多样化任务中的state-of-the-art性能,通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

详情
Journal ref
Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA
AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具?我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时,一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构(32.5%到89.5%),发现了将云成本降低40%的调度算法,生成了87%匹配或超过PyTorch的CUDA内核,并优于AlphaEvolve报告的圆圈打包解决方案(n=26)。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高,且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同,我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式,将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything,并支持多个后端作为GEPA项目的一部分,在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

2605.19597 2026-05-20 cs.CL 版本更新

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic: 一个验证求解器的中文逻辑推理基准,具有对抗性强化

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

发表机构 * Institute of Trustworthy Embodied Artificial Intelligence(可信具身人工智能研究院) Fudan University(复旦大学) Hunyuan Team Tencent(腾讯 Hunyuan 团队) School of Philosophy Fudan University(复旦大学哲学学院)

AI总结 本文提出LLMEval-Logic,一个基于真实情境场景的中文逻辑推理基准,通过作者和专家共同审核自然语言项目及其形式化参考,利用Z3验证注释答案,构建自然到形式的评分标准,并通过闭环对抗流程强化选定项目。基准包含246个基础项目和190个难度项目,评估14个前沿LLM显示当前模型存在显著差距。

详情
AI中文摘要

评估大型语言模型(LLMs)在自然语言逻辑推理上的能力至关重要,因为规则主导的任务要求结论必须严格基于陈述的前提。许多现有的逻辑推理基准是通过从采样的公式中模板化自然语言项目生成的,仅提供粗糙或未经审核的形式注释,现在很快被前沿推理模型饱和。我们提出了LLMEval-Logic,一个基于真实情境场景的中文逻辑推理基准。其流程包括作者和专家共同审核自然语言项目及其参考形式化,利用Z3验证注释答案,构建自然到形式的评分标准,并通过闭环对抗流程强化选定项目。该基准发布在两个配对子集中:一个包含246个项目的基础子集,附带1,400个专家开发的评分原子,以及一个包含190个项目的难度子集,包含938个多步骤子问题,覆盖封闭模型空间。在LLMEval-Logic上评估14个前沿LLM揭示了当前模型的显著差距:最佳模型仅达到37.5%的难度项目准确率,即使使用参考符号,评估模型中最高的联合Z3+评分形式化得分也仅为60.16%。我们的基准在https://github.com/llmeval/LLMEval-Logic上公开可用。

英文摘要

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

2605.19577 2026-05-20 cs.CL 版本更新

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL: 以能力为导向的长上下文强化学习与多任务对齐

Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li

发表机构 * Kuaishou Technology(快手科技) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出GoLongRL,一种完全开源的、以能力为导向的长上下文强化学习后训练配方,通过可验证奖励(RLVR)实现。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径,导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。本文的贡献是(1)以能力为导向的数据构建并完全开源,释放了包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学,数据集涵盖9种任务类型,每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本,其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下,我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外,我们的 Qwen3-30B-A3B 模型在该数据上训练后,长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当,表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。(2)TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战,我们提出了 TMN-Reweight,它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐,以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能,在报告的评估中,通用能力得以保持或提升。

详情
AI中文摘要

我们提出了GoLongRL,一种完全开源、以能力为导向的长上下文强化学习后训练配方,用于可验证奖励(RLVR)。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径,导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。我们的工作提供了两个贡献。(1)以能力为导向的数据构建并完全开源。我们公开发布了一个包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学,数据集涵盖9种任务类型,每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本,其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下,我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外,我们的 Qwen3-30B-A3B 模型在该数据上训练后,长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当,表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。(2)TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战,我们提出了 TMN-Reweight,它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐,以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能,在报告的评估中,通用能力得以保持或提升。

英文摘要

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

2605.19576 2026-05-20 cs.AI cs.CL cs.SE 版本更新

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

库漂移:在自我演化的LLM技能库中诊断和修复一种无声的失败模式

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 本文研究了自我演化的LLM技能库中的一种无声失败模式——库漂移,通过可重复触发实验、细粒度诊断和验证修复方法,揭示了技能积累无序导致检索退化、假阳性注入和性能停滞的问题,并提出了一种经过验证的修复方案,显著提升了技能库的性能。

详情
AI中文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

英文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

2605.19568 2026-05-20 cs.CL 版本更新

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

m3BERT: 一种现代的、多语言的、俄罗斯套娃双向编码器

Yaoxiang Wang, Simiao Zuo, Qingguo Hu, Yucheng Ding, Yeyun Gong, Jian Jiao, Jinsong Su

发表机构 * Xiamen University(厦门大学) Microsoft(微软公司) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出m3BERT,一种现代多语言俄罗斯套娃双向编码器,通过联合优化Transformer层和多维嵌入表示,解决现有预训练模型在不同部署场景中适应性差的问题,展示了其在工业检索中的高效性和实用性。

Comments KDD 2026

详情
AI中文摘要

嵌入模型在工业信息检索系统中至关重要,如搜索和广告。然而,现有预训练模型通常具有固定架构和嵌入维度,这在适应具有不同业务驱动约束的多样化部署场景时带来了显著挑战。一种常见做法是在资源受限任务中通过部分参数初始化从更大预训练模型进行微调。这种方法往往效果不佳,因为预训练和下游使用之间的不匹配阻碍了预训练优势的完全实现。为了解决这一限制,我们引入了m3BERT:一种现代的、多语言的、俄罗斯套娃双向编码器,其特征是新颖的预训练策略,联合优化Transformer层和多个嵌入维度的表示。这使得单个模型能够针对不同的资源和准确率目标进行定制,同时保持与预训练的一致性。结合最近的架构改进,m3BERT采用三阶段预训练:单语预训练、多语适应以服务多样化用户群体,以及在大规模网络领域语料库上进行关键的持续预训练以增强商业检索中的实用性。m3BERT在Bing-Click大型工业检索数据集上显著优于现有最先进的嵌入模型,展示了其作为高效基础的实用性和适应性,用于资源感知的工业检索系统。进一步在公共数据集上的实验也证实了我们多粒度俄罗斯套娃预训练策略的通用有效性。

英文摘要

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

2605.19523 2026-05-20 cs.CL cs.AI cs.CV 版本更新

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

探究跨模态技能注入:场景、方法与超参数

Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机科学学院,北京大学) WeChat AI, Tencent Inc., China(腾讯公司,中国) The University of Hong Kong(香港大学)

AI总结 本文研究了跨模态技能注入在不同场景下的表现,分析了其方法和超参数的影响,发现其在指令遵循和跨语言任务中表现良好,但在数学推理中存在困难,同时指出经典方法如TA和DARE在性能上优于其他融合方法。

详情
AI中文摘要

视觉-语言模型(VLMs)在一般多模态理解方面表现出色;然而,它们在高效获取持续演化的领域特定技能方面存在困难。传统增强VLM能力的方法,如监督微调(SFT),需要大量的数据集整理和大量的计算资源。模型合并作为一种高效的替代方法,能够将领域专家的LLM专业知识转移到VLMs上,而无需额外的数据集要求或显著的计算开销。与传统合并同质LLM的方法不同,跨模态技能注入旨在通过将领域专家LLM整合到VLM中来诱导出新的跨模态能力。然而,现有研究缺乏对跨模态技能注入的适用性和方法的系统分析。在本研究中,我们从三个主要方面探讨了跨模态技能注入:场景、方法和超参数。在场景方面,我们发现跨模态技能注入在指令遵循和跨语言设置中表现良好,但在数学推理中表现不佳。在方法方面,我们发现经典方法如TA和DARE在性能上优于其他融合方法。我们还提供了这些经典方法所依赖的超参数调优的系统和定量分析。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

2605.19516 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Base Models Look Human To AI Detectors

基础模型对AI检测器看起来很像人类

Yixuan Even Xu, Ziqian Zhong, Aditi Raghunathan, Fei Fang, J. Zico Kolter

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究发现基础模型生成的文本在AI检测器中常被误判为人类生成,提出HIP方法通过迭代改写提升检测器规避能力,揭示当前检测器更关注指令调优和局部上下文而非通用机器生成文本特征。

Comments 39 pages, 9 figures

详情
AI中文摘要

随着AI生成文本在现实世界大规模应用,机构越来越多地使用商业AI文本检测器,尤其是在教育和学术诚信流程中。我们报告了一个令人惊讶的经验发现:当用GPTZero和Pangram评估时,基础模型生成的文本往往被判断为高度人类化,而经过指令调优的模型生成的文本则不具有这种特性。基于这一观察,我们提出了Humanization by Iterative Paraphrasing (HIP),一种不依赖特定检测器的管道,它最小化地微调基础模型为改写器并迭代应用。与我们测试的基线相比,HIP在商业检测器上实现了更好的语义保留与检测器规避的平衡。在Llama-3和Qwen-3系列模型中,从0.6B到70B的不同规模上,HIP始终提高了检测器的人类化程度。我们的发现表明,当前检测器更关注指令调优和局部上下文而非任何通用机器生成文本的不变特征。这反过来要求检测器设计更明确地建模这些因素。

英文摘要

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

2605.19470 2026-05-20 cs.CL cs.LG 版本更新

Drifting Objectives for Refining Discrete Diffusion Language Models

漂移目标用于细化离散扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

发表机构 * Institute of Science Tokyo(东京科学研究院) AIST(日本产业技术综合研究所) NII LLMC(日本信息处理学会LLMC)

AI总结 本文研究如何将漂移方法应用于离散扩散语言模型,通过引入TokenDrift目标,将类别预测提升为软令牌特征,并在冻结语义空间中应用反称漂移,从而提升生成质量。

Comments Project page: https://daioba.github.io/tokendrift/

详情
AI中文摘要

离散扩散语言模型(DDLMs)通过迭代去噪类别令牌序列生成文本,而近期针对连续生成器的漂移方法表明,部分采样时间的修正可以通过反称固定点目标在训练中吸收。我们研究如何将这一原理转移到DDLMs中,其中主要挑战是与离散文本的接口:硬令牌样本不可微,类别预测不直接提供连续样本进行漂移。我们提出了TokenDrift,一种漂移目标,将类别预测提升为软令牌特征,在冻结的语义空间中应用反称漂移,并将由此产生的stop-gradient特征目标反向传播到DDLM的logits中。在受控的持续训练实验中,使用掩码和均匀状态扩散基础架构,TokenDrift在匹配的延续基线之上提升了固定NFE生成质量,在MDLM上将Gen.-PPL在4 NFEs时降低了89%,在DUO上降低了86%。这些结果表明,漂移可以为DDLMs提供实用的细化目标。

英文摘要

Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

2605.19436 2026-05-20 cs.LG cs.CL cs.CV 版本更新

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

CEPO: 使用对比证据策略优化进行RLVR自蒸馏

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan

发表机构 * MBZUAI Linköping University(林雪平大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出CEPO,通过对比证据策略优化解决RLVR中自蒸馏的问题,通过区分关键推理步骤与填充内容来提升模型性能。

Comments 9 pages

详情
AI中文摘要

当模型在强化学习中产生正确解时,每个token都会收到相同的奖励信号,无论其是关键推理步骤还是语法填充。一种自然的解决方法是将模型条件化为正确的答案作为教师,识别出模型在知道答案时会生成不同的token。先前的工作表明,这种方法要么通过泄露答案到梯度而破坏训练,要么产生弱信号,无法区分关键步骤和填充内容,因为两者在模型基线下看起来同样令人惊讶。我们提出对比证据策略优化(CEPO),在每个token上提出更尖锐的问题:不仅“正确答案是否偏好此token?”而且“正确答案是否偏好它,而错误答案是否厌恶它?”满足两者的是真正的推理步骤;不满足的是填充内容。错误答案的教师是从训练批次中已有的拒绝rollouts构造的,不增加额外的采样成本。我们证明CEPO继承了先前最先进状态下的所有结构安全保证,同时在关键token上严格提高信用,改进在填充位置恰好消失。实验表明,CEPO在五个多模态数学推理基准上分别达到43.43%和60.56%的平均准确率(在2B和4B规模下),而GRPO在相同训练预算下为41.17%和57.43%。分布匹配自蒸馏方法(OPSD、SDPO)在未训练基线下表现低于,实验证实了我们的理论预测的信息泄漏。我们的代码可在https://github.com/ahmedheakl/CEPO上获得。

英文摘要

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

2605.19433 2026-05-20 cs.CL cs.AI 版本更新

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

在偏离时回溯:缓解大语言模型推理蒸馏中的双重暴露偏差

Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu, Sinan Fan, Ximing Li, Rui Miao, Xiaosong Yuan, Zhanming Shen, Jieping Ye

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering, MoE, Jilin University(吉林大学符号计算与知识工程重点实验室) Tongyi Lab, Alibaba Group(阿里集团通义实验室) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 本文提出了一种新的LLM推理蒸馏方法MOTAB,通过动态监控学生模型生成过程并回溯偏离安全边界的情况,缓解了传统蒸馏方法中因训练分布与推理上下文不匹配导致的双重暴露偏差问题,从而提升推理性能。

Comments 26 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)通过长链思考(CoT)在复杂推理任务中取得了显著成功,但其巨大的计算开销阻碍了实际应用。LLM推理蒸馏通过将推理能力从强大的教师模型转移到紧凑的学生模型来解决这一问题。然而,现有蒸馏方法面临根本性的困境。典型的离线蒸馏严格利用教师生成的黄金轨迹,由于训练分布与学生生成的推理上下文不匹配,导致长链CoT推理中出现错误级联。为了解决这一问题,在线蒸馏允许学生探索自己的轨迹,但我们证明这会引入相互的反向暴露偏差:当学生生成次优上下文时,教师模型也难以提供积极指导。为了解决这一双重暴露偏差问题,我们提出监控轨迹并在偏离时回溯(MOTAB)新的LLM推理蒸馏流程。具体而言,MOTAB动态监控学生在线生成过程,对照自适应的安全边界。当生成偏离并超过此阈值时,MOTAB回溯到上一个安全状态,并利用教师干预来纠正方向。这种方法本质上可以容忍少量学生错误以缓解暴露偏差,同时防止次优上下文以避免反向暴露偏差。在LIMO-v2和AceReason数据集上的广泛实验表明,MOTAB有效缓解了双重暴露偏差,使推理任务的平均性能提高了约3%。

英文摘要

Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student's on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.

2605.19394 2026-05-20 cs.CL cs.AI 版本更新

EmbGen: Teaching with Reassembled Corpora

EmbGen:利用重组语料库进行教学

Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva

发表机构 * Commonwealth Bank of Australia(澳大利亚联邦银行)

AI总结 本文提出EmbGen,一种通过重组语料库生成合成数据的pipeline,旨在提高在不同语义异质性数据集上指令微调模型的性能,通过实体-描述对的分解、基于嵌入相似性的重组以及基于聚类的采样生成问题-答案对,从而在固定token预算下提升二元准确率。

Comments 8 pages, 4 images (32 pages with appendix)

详情
AI中文摘要

适应小型指令微调模型到专业领域通常依赖于在精心挑选的指令-响应示例上进行监督微调(SFT),这在大规模收集时成本高昂。由教师LLM从领域语料库生成的合成训练示例可以降低此成本,但现有流程会产生同质化输出,并且不一致地捕捉跨段落或跨文档依赖性。我们引入EmbGen,一种合成数据生成流程,该流程将语料库分解为实体-描述对,通过从嵌入相似性推断出的语义结构重新组装它们,并通过接近性、集群内和集群间采样生成问题-答案(QA)对,使用集群专门化的系统提示。我们评估EmbGen在三个语义异质性不同的数据集上,固定token预算(5和20百万token)下的表现,与EntiGraph、InstructLab和Knowledge-Instruct进行比较。我们使用词汇重叠度量、LLM作为判断标准的评分表以及二元准确率(结合事实准确性和完整性)作为评估指标。EmbGen在最异质的数据集上,相对于最强基线,在5M和20M token预算下分别提高了12.5%和88.9%的二元准确率,同时在其他异质性较低的数据集上保持竞争力。

英文摘要

Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

2605.19358 2026-05-20 cs.CL 版本更新

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

驯服思考者:面向自适应大语言模型推理的条件熵塑造

Shuyu Wei, Jian Sun, Delai Qiu, Yining Wang, Shengping Liu, Jiaen Liang, Ying Fu, Wei Huang, Jitao Sang

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence(北京交通大数据挖掘与具身智能重点实验室) Beijing Jiaotong University(北京交通大学) Unisound AI Technology Co., Ltd.(Unisound AI科技有限公司)

AI总结 本文提出条件熵塑造(CES)框架,通过动态控制token级响应熵,使大语言模型在简单问题上产生简洁解,在复杂问题上促进深入探索,从而平衡响应长度与准确性。

详情
AI中文摘要

基于熵的深度推理已成为提升大语言模型(LLMs)推理能力的有前景方向,但现有方法往往要么无差别地增加响应长度,要么以牺牲准确性为代价缩短响应。为更好地平衡这一权衡,我们引入了条件熵塑造(CES),一种框架,能够动态控制token级响应熵,使LLMs在简单问题上产生简洁的解决方案,同时在困难问题上鼓励更深入的探索。基于DAPO,CES使用token级熵作为不确定性信号,并应用一个条件双向策略:它在正确的推理路径上惩罚高熵的'分叉点'token以提高简洁性,并在错误路径上奖励它们以促进探索和错误纠正。我们将在DeepSeek-R1-Distill-7B上实现CES,并在12个数学基准上进行评估。CES在平均准确性上优于DAPO,同时减少响应长度,补充实验显示在较小的1.5B模型和领域外基准上也呈现出相似的趋势。

英文摘要

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

2605.19357 2026-05-20 cs.CL 版本更新

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom: 一个用于大型语言模型科学能力定制评估的框架

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University(多媒体信息处理国家重点实验室,计算机学院,PKU-Anker LLM实验室,北京大学) Zhongguancun Academy(中关村学院) IDEA Xidian University(西安电子科技大学) Peking University(北京大学) University of Washington(华盛顿大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出SciCustom框架,通过从大规模科学数据中自定义构建基准,评估LLM在特定科学任务中的能力,无需专家标注或合成问题生成,展示了细粒度科学能力差异。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

英文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

2605.19351 2026-05-20 cs.MA cs.AI cs.CL 版本更新

PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies

PAVE:生成代理社会中的合法违规认知架构

Ahmad Yehia, Abduallah Mohamed, Kun Qian, Tianyi Wang, Jiseop Byeon, Omar Hassanin, Christian Claudel

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Meta Reality Labs(Meta现实实验室) University of Calgary(卡尔加里大学)

AI总结 本文提出PAVE认知架构,通过四个模块处理生成代理在需要违规的场景中的推理问题,实现了合法违规、对权威的服从、有限的范围和恢复四个特性,同时提高了决策的结构化和可解释性。

Comments Preprint. 23 pages, 4 figures. Code and environment will be released upon publication

详情
AI中文摘要

基于大语言模型的生成代理在合作环境中能够产生可信的人类行为,但在需要违规的场景中,如火灾疏散或受监督的紧急情况,如何推理仍不明确。我们提出PAVE(感知、评估、裁决、模拟),一种新的四模块认知架构,旨在解决这一差距:(i)感知提取一个结构化的上下文,包括明确的权威距离、同伴行为和严重标记的情境线索;(ii)评估在五个标量上评分上下文,包括一个明确的合法性判断,检查必要性、比例性和无替代方案;(iii)裁决在硬合法性门下决定服从或违规,每个代理的阈值从角色中提取;(iv)模拟执行裁决并限制违规到触发所证明的规则。我们将在Voville中实例化PAVE,这是一个从Smallville衍生的基于瓷砖的交通环境,并在三个场景、四个LLM后端和一个聚焦的消融中进行评估。PAVE代理同时满足四个属性:合法违规(只有当触发证明时)、权威服从(军官指令即使高合法性也优先)、有限范围(违规限制在目标规则内)和恢复(触发结束时恢复基准)。PAVE代理在所有四个属性上比vanilla更结构化和可解释,人类评估者认为它们更合理。消融合法性门会重现vanilla-like的失败。我们发布了Voville、PAVE提示和代码以及评估流程。

英文摘要

Generative agents based on large language models reproduce believable human behavior in cooperative settings, but how they should reason in situations where rule-breaking may be required, such as fire evacuation or authority-supervised emergency, remains poorly characterized. We propose PAVE (Perception, Assessment, Verdict, Emulation), a novel four-module cognitive architecture that addresses this gap end to end: (i) Perception extracts a structured context with explicit authority distance, peer behaviors, and severity-tagged situational cues; (ii) Assessment scores the context along five scalars including an explicit legitimacy judgment that checks necessity, proportionality, and absence of alternatives; (iii) Verdict decides to comply or violate under a hard legitimacy gate, with a per-agent threshold elicited from the persona; (iv) Emulation enacts the verdict and scopes the violation to the rule the trigger justifies. We instantiate PAVE in Voville, a tile-based traffic environment forked from Smallville, and evaluate across three scenarios, four LLM backbones, and a focused ablation. PAVE agents satisfy four properties simultaneously: legitimate violation (only when a trigger justifies it), authority deference (officer instructions override even high legitimacy), bounded scope (violations confined to the targeted rule), and recovery (baseline restored once the trigger ends). PAVE agents make more structured and interpretable decisions than vanilla across all four properties, and human evaluators rate them as more plausible. Ablating the legitimacy gate reproduces vanilla-like failures. We release Voville, the PAVE prompts and code, and the evaluation pipeline.

2605.19346 2026-05-20 cs.CL cs.AI cs.LG 版本更新

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

IMLJD:印度婚姻诉讼分析计算数据集

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 本文提出IMLJD数据集,用于分析印度婚姻纠纷案件,包含3613份法院判决,涵盖IPC第498A条、《家庭暴力保护法》和CrPC第482条案件,通过结构化标签、元数据指标和知识图谱揭示最高法院与卡纳塔克高等法院中撤销请求的成功率差异。

Comments 8 pages, 2 figures, 5 tables. Dataset available at huggingface.co/datasets/joyboseroy/imljd and Code at github.com/joyboseroy/imljd

详情
AI中文摘要

我们介绍了IMLJD,一个包含3,613份印度法院判决的开放数据集,涵盖受IPC第498A条、《家庭暴力保护法》和CrPC第482条规制的婚姻纠纷案件。该数据集涵盖最高法院(2000-2024年,1,474份案件)和卡纳塔克高等法院(2018-2024年,2,139份案件),包含结构化结果标签、元数据衍生指标和知识图谱。我们发现,最高法院级别的撤销请求成功率为57.6%,而卡纳塔克高等法院为39.7%。在匹配的2018至2024年期间,最高法院的撤销率是59.3%,扩大了差距至19.6个百分点,证实该发现对时间调整具有鲁棒性。该数据集、代码和知识图谱已公开发布在https://github.com/joyboseroy/imljd和https://huggingface.co/datasets/joyboseroy/imljd。

英文摘要

We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.

2605.19341 2026-05-20 cs.CL cs.AI cs.LG stat.ML 版本更新

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

HalluWorld: 一个用于通过参考世界模型控制幻觉的基准

Emmy Liu, Varun Gangal, Michael Yu, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Patronus AI Independent Researcher(独立研究者) Stanford University(斯坦福大学) The Ohio State University(俄亥俄州立大学) DegenAI Labs(DegenAI实验室)

AI总结 本文提出HalluWorld基准,通过显式参考世界模型研究语言模型的幻觉问题,发现不同任务中幻觉表现不一致,表明幻觉源于多种失败模式而非单一能力。

Comments HalluWorld benchmark (code and data) at github.com/DegenAI-Labs/HalluWorld

详情
AI中文摘要

幻觉仍然是大语言模型的核心失败模式,但现有基准在摘要、问答、检索增强生成和代理交互中操作不一致。这种碎片化使得不清楚一种缓解措施在不同情境中是否有效。当前基准要么需要人工标注和固定参考,要么依赖难以复现的观察。为研究根本原因,我们引入HalluWorld,一个基于显式参考世界模型的可扩展基准:当模型生成一个与该世界不一致的可观察声明时,即产生幻觉。基于这一观点,我们构建了合成和半合成环境,在其中参考世界完全指定,模型观点受控,幻觉标签自动产生。HalluWorld涵盖网格世界、国际象棋和现实终端任务,使世界复杂性、可观察性、时间变化和源冲突政策可控,并将幻觉细分为细粒度错误类别。我们评估了前沿和开放权重语言模型在这些设置中的表现,发现一致模式:前沿模型在直接观察信息上的感知幻觉接近解决,而多步状态跟踪和因果正向模拟仍然困难且未被扩展思考普遍解决。在终端设置中,模型在何时应放弃时也遇到困难。不同探测类型和领域中的失败分布不均,表明幻觉源于不同的失败模式而非单一能力。我们的结果表明,受控参考世界为测量和减少现代语言模型中的幻觉提供了可扩展且可重复的路径。

英文摘要

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

2605.19338 2026-05-20 cs.MA cs.AI cs.CL 版本更新

STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision

STAR-PólyaMath: 多智能体在持久元策略监督下的推理

Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang, Yinpeng Dong

发表机构 * Tsinghua University(清华大学) Microsoft Research(微软研究院) New York University(纽约大学) MIT(麻省理工学院)

AI总结 本文提出STAR-PólyaMath多智能体框架,通过元级监督和结构化的推理-验证交互系统性解决数学推理中的幻觉积累、记忆碎片化和推理工具平衡问题,并在多个顶级竞赛基准上取得最佳成绩。

Comments 25 pages, 4 figures. Code: https://github.com/Julius-Woo/STAR-PolyaMath

详情
AI中文摘要

前沿AI模型和多智能体系统在数学推理方面取得了显著进步。然而,对于需要扩展、长周期推理的问题,现有系统仍然存在根本性可靠性问题:幻觉积累、记忆碎片化以及推理工具之间的不平衡。在本文中,我们引入了STAR-PólyaMath,一个通过元级监督和结构化的推理-验证交互来系统性解决这些挑战的多智能体框架。STAR-PólyaMath被构造成一个由Python orchestrator控制的协同状态机,包含嵌套的挑战-步骤-重计划循环,该orchestrator通过分离控制与推理并利用回溯和重计划来限制误差传播。我们的关键创新是一个持续的元策略师,它通过发布高层战略指导或强制指令来维护跨尝试的记忆并执行元级控制,使系统能够逃离无生产力的循环,而不是停滞或过度依赖工具。STAR-PólyaMath在所有八个顶级竞赛基准上取得了最先进的结果:AIME 2025-2026、MathArena Apex Shortlist、MathArena Apex 2025、Putnam 2025、IMO 2025、HMMT February 2026和USAMO 2026。它在AIME、Putnam和HMMT上获得满分,并在Apex 2025上表现出最大的优势,得分93.75%相比最强基线GPT-5.5的80.21%。消融研究显示,收益来自框架的协调而非模型级多样性,因为移除关键组件或替换为混合backbone会一致削弱性能。代码可在https://github.com/Julius-Woo/STAR-PolyaMath获取。

英文摘要

Frontier AI models and multi-agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long-horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning-tool trade-offs. In this paper, we introduce STAR-PólyaMath, a multi-agent framework that systematically addresses these challenges through meta-level supervision and structured Reasoner-Verifier interaction. STAR-PólyaMath is structured as an orchestrated state machine with nested challenge-step-replan loops, governed by a reasoning-free Python orchestrator that separates control from inference and bounds error propagation through trace-back and re-planning. Our key innovation is a persistent Meta-Strategist that maintains cross-attempt memory and exercises meta-level control by issuing high-level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over-rely on tools. STAR-PólyaMath achieves state-of-the-art results on all eight top-tier competition benchmarks: AIME 2025-2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT-5.5. Ablation studies show that the gains arise from the framework's orchestration rather than from model-level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at https://github.com/Julius-Woo/STAR-PolyaMath.

2605.19316 2026-05-20 cs.CL 版本更新

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

一种用于特征约束难度控制的多智能体框架(在阅读理解项目生成中)

Seonjeong Hwang, Jun Seo, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence, POSTECH, Republic of Korea(韩国POSTECH人工智能研究生院) Department of Computer Science and Engineering, POSTECH, Republic of Korea(韩国POSTECH计算机科学与工程系)

AI总结 本文提出MAFIG多智能体框架,通过多个LLM代理和特征特定评估器协作生成并迭代修订项目,以满足特征约束,从而实现更稳定的难度控制。

Comments ACL 2026 Main Conference

详情
AI中文摘要

最近的研究在难度控制的阅读理解项目生成中利用大型语言模型(LLMs)通过调整与难度相关的特征来生成项目。然而,现有方法通常依赖于单代理提示方法,这往往无法一致地满足指定的特征约束,导致生成的项目偏离目标难度水平。为了解决这一限制,我们引入了MAFIG,一种用于特征约束项目生成的多代理框架,其中多个LLM代理和特征特定评估器协作生成并根据预期约束迭代修订项目。此外,为了验证MAFIG在难度控制中的有效性,我们提出了一种构造特征约束集序列的方法,该序列产生难度单调递增的项目。实验结果表明,MAFIG生成符合目标约束的项目率显著高于基线方法,通过难度校准的约束序列实现了稳健的难度控制。

英文摘要

Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

2605.19285 2026-05-20 cs.CL cs.AI cs.CY 版本更新

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

理性是否必要且充分?为可解释的虚假信息检测调优大语言模型

Bing Wang, Rui Miao, Ximing Li, Chen Shen, Shaotian Yan, Changchun Li, Kaiyuan Liu, Xiaosong Yuan, Jieping Ye

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里云实验室) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院)

AI总结 本文研究了如何通过调优大语言模型(LLM)来提升可解释性虚假信息检测的性能,提出了一种新的数据合成管道LONSREX,用于定位必要且充分的理性,以解决现有方法中因粗粒度标签和过度验证行为导致的理性不足和冗余问题。

Comments Accepted by KDD 2026. 12 pages, 8 figures. Code: https://github.com/wangbing1416/LONSREX

详情
AI中文摘要

社交媒体上虚假信息的快速传播已成为一个严峻挑战。为缓解其扩散,虚假信息检测(MD)已成为关键研究领域。传统基于小模型的MD方法通常通过黑盒过程进行二元分类。近年来,大型语言模型(LLMs)的兴起使可解释性MD成为可能,其中模型生成理性以解释其决策,从而提高透明度。现有可解释性MD方法主要集中在构建复杂的提示以从现成的LLMs中提取理性。在本文中,我们提出了一种管道来调优专门用于可解释性MD的LLM。我们的管道首先收集大规模经过事实核查的文章,然后使用多个强大的LLMs生成真实性预测和理性。为了确保高质量的训练数据,我们利用一种过滤策略,仅选择正确的实例进行微调。虽然该管道直观且普遍,但我们的实验表明,仅基于标签正确性的简单过滤在实践中是不够的,并存在两个关键限制:(1)粗粒度标签导致理性不足:仅基于二元标签过滤的理性不足以充分支持其决策;(2)过度验证行为导致不必要的理性:更强的LLMs倾向于表现出过度验证行为,生成过度冗长和不必要的理性。为了解决这些问题,我们引入了LONSREX,一种新的数据合成管道,用于定位可解释性MD中必要且充分的理性。具体来说,我们提出了一种度量标准,量化每个验证步骤对最终预测的贡献,从而评估其必要性和充分性。实验结果展示了LONSREX的有效性。

英文摘要

The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.

2605.19284 2026-05-20 cs.CL cs.LG 版本更新

Language models struggle with compartmentalization

语言模型在 compartmentalization 方面遇到困难

Thomas Vincent Howe, David Wingate

发表机构 * Department of Computer Science(计算机科学系) Brigham Young University

AI总结 研究探讨了大型语言模型在处理统一概念的不同表达方式时的 compartmentalization 问题,发现模型在不同表达方式之间无法有效共享统计信息,导致模型容量浪费和样本效率降低。

Comments 9 pages, 8 figures, plus 9 pages of appendices. Submitted to NeurIPS 2026. Code: https://github.com/vinhowe/compartmentalization. Eval data: https://doi.org/10.5281/zenodo.20171021

详情
AI中文摘要

在大型语言模型(LLMs)使用的训练数据中,相同的潜在概念通常以多种不同的方式呈现:相同的事实出现在英语和斯瓦希里语中;许多函数可以用Python和Haskell表达;命题可以用正式语言和自然语言表达。我们展示了LLMs可能会表现出compartmentalization,即在不同的统一概念的不同表达方式之间无法识别和共享统计信息。在最坏的情况下,LLMs只是学习了每个概念表达方式的平行内部表示,用冗余性耗尽模型容量,并随着这些表达方式的数量增加而降低样本效率。我们还证明,即使合成平行数据容易学习,它也可能无法改善这一问题。在此框架下,我们发现,对于小型模型,早期多语言学习几乎完全是 compartmentalized 的。最后,所有我们研究的干预措施都表现出相变,其有效性取决于不同的表达方式数量,这表明语言建模目标可能只能不一致地统一表示。

英文摘要

In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

2605.19274 2026-05-20 cs.CL 版本更新

Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

迷失在解释中:跨语言解释中的可信度与忠实度的权衡

Somnath Banerjee, Pranav Jha, Rima Hazra, Animesh Mukherjee

发表机构 * Indian Institute of Technology Kharagpur(印度理工学院Khargpur分校) TCG Crest(TCG研究中心) National University of Singapore(新加坡国立大学)

AI总结 本文研究了跨语言解释中可信度与忠实度之间的权衡,发现以英语为枢纽的解释在与人类理由的一致性上表现更好,但其证据在模型预测中的因果基础较弱。研究发现,英语解释虽然流畅,但与原生语言条件相比,其解释的全面性下降了5.7倍,甚至在任务准确性保持稳定的情况下,也未能保留语义细微差别。因此,建议在输入语言中审计解释,报告多方面的忠实度指标,并将英语解释视为沟通摘要而非忠实的决策轨迹。

详情
AI中文摘要

多语言部署的LLM通常通过英语解释对非英语输入进行审计。我们评估了提取性解释(模型识别输入token跨度作为证据并生成理由)并发现存在系统性的权衡:英语枢纽解释在与人类理由的一致性上表现更好,但其证据在模型预测中的因果基础较弱,这通过全面性和充分性来衡量。在三个任务、五种语言和两种多语言LLM家族中,我们发现英语解释经常产生流畅但松散锚定的理由,其全面性相对于原生语言条件下降高达5.7倍——即使在不同设置中任务准确性保持稳定。对于社会细微差别分类,英语枢纽也未能保留语义线索,从而降低忠实度和跨度一致性。我们建议在输入语言中审计解释,报告超越词法重叠的多方面忠实度指标,并将英语理由视为沟通摘要而非忠实的决策轨迹。

英文摘要

LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

2605.19270 2026-05-20 cs.CL 版本更新

DECOR: Auditing LLM Deception via Information Manipulation Theory

DECOR:通过信息操纵理论审计大语言模型的欺骗行为

Linyue Cai, Samuel Yeh, Jwala Dhamala, Rahul Gupta, Sharon Li

发表机构 * Department of Computer Sciences, University of Wisconsin-Madison(威斯康星大学麦迪逊分校计算机科学系) Amazon(亚马逊)

AI总结 本文提出DECOR框架,基于信息操纵理论,通过细粒度审计实现对大语言模型欺骗行为的有效检测,展示了其在单轮和多轮欺骗检测中的优越性能。

详情
AI中文摘要

大型语言模型可以通过微妙地操纵真实信息来欺骗,例如省略关键事实、转移焦点或模糊意义,使这种行为难以检测。现有的黑盒方法依赖于粗粒度判断,提供有限的可解释性,并未能确定哪些事实被扭曲以及如何扭曲。我们引入DECOR,一种基于信息操纵理论的多智能体框架,用于细粒度审计LLM响应中的战略性欺骗。DECOR将输入上下文分解为原子信息单元,并在四个操纵维度上对每个单元进行评分,生成可解释的操纵配置文件,并将其汇总为全局欺骗指数。我们全面评估了DECOR在单轮和多轮欺骗检测基准上,涵盖现实世界领域,并显示DECOR在两者上均达到最先进的性能,优于竞争基线。该框架在15种前沿模型上具有泛化能力,消融研究证实了每个关键设计组件的贡献。我们的发现表明,基于理论的细粒度信息操纵审计为LLM欺骗检测提供了一条有效且可解释的路径。

英文摘要

Large language models can deceive by subtly manipulating truthful information -- omitting key facts, shifting focus, or obscuring meaning -- making such behavior difficult to detect. Existing black-box methods rely on coarse-grained judgments, offering limited interpretability and failing to pinpoint which facts were distorted and how. We introduce DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses. DECOR decomposes input contexts into atomic informational units and scores each unit against the response across four dimensions of manipulation, producing interpretable manipulation profiles that are aggregated into a global deception index. We comprehensively evaluate DECOR on both single-turn and multi-turn deception detection benchmarks spanning real-world domains, and show that DECOR achieves state-of-the-art performance on both, outperforming competitive baselines. The framework generalizes across 15 frontier models, and ablation studies confirm the contribution of each key design component. Our findings demonstrate that fine-grained, theory-grounded auditing of information manipulation offers an effective and interpretable path for LLM deception detection.

2605.19234 2026-05-20 cs.CL cs.AI 版本更新

AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

人工智能技术在语言接入中的应用:对人工智能的态度以及语言接入管理者的人类价值

Miguel A. Jiménez-Crespo, Stephanie Rodriguez, Alejandro Jaume Losa

发表机构 * Rutgers University/ Dept. of Spanish(罗格斯大学西班牙语系) Rutgers University/ Dept. of Spanish and and Portuguese(罗格斯大学西班牙语和葡萄牙语系)

AI总结 本文探讨了人工智能在语言接入中的影响,通过分析十位美国语言接入管理者在医疗、法庭、公共服务和地方政府领域的半结构化访谈,揭示了语言接入管理者对人工智能的有条件乐观态度以及对人工智能实施中人类价值和人类监督的高度重视。

Comments 11 pages, 2 tables, Convergence Conference 2026

详情
AI中文摘要

人工智能技术的快速出现正在重塑翻译实践和理论。本文探讨了人工智能在语言接入中的影响。这一领域的特点在于需要服务于广泛且多样化的用户群体,而效率和可及性受到法律要求、伦理和商业矛盾以及安全问题的影响。本文报告了语言接入管理者对人工智能以及人工智能时代的人类价值的态度和看法。方法上,本文呈现了一项关于语言接入和技术的更大研究的子集分析,具体为对十位美国语言接入管理者进行的定性主题分析,这些管理者在医疗、法庭、公共服务和地方政府领域工作。结果表明,语言接入管理者对不可避免的人工智能实施表现出有条件乐观,对风险具有强烈意识,并对人工智能实施和输出中的人类价值和人类监督有深刻承诺。

英文摘要

The rapid emergence of AI technologies is reshaping translation practices and theory across the board. This paper deals with the impact of AI in language access. This area is characterized by the need to serve broad and diverse user populations, within a context where efficiency and access are shaped by legal mandates, ethical and commercial tensions, and safety concerns. This paper reports on the attitudes and perceptions of language access managers towards the AI and the human value in the AI age. Methodologically, this paper presents an analysis of a subset of a broader study on language access and technology, specifically a qualitative thematic analysis of ten semi-structured interviews with language access managers in the USA working in healthcare, court, public service and local government contexts. The results indicate that language access managers show conditional optimism towards the inevitable AI implementations, are strongly risk aware, and deeply committed to the human value and human oversight of AI implementations and output.

2605.19224 2026-05-20 cs.CL 版本更新

Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

在慢速fMRI上微调语言编码模型以提高对快速ECoG的预测

Aditya R. Vaidya, Richard J. Antonello, Alexander G. Huth

发表机构 * Department of Computer Science(计算机科学系) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Zuckerman Mind Brain Behavior Institute(祖克曼心智-脑-行为研究所) Columbia University(哥伦比亚大学) Departments of Neuroscience and Statistics(神经科学与统计学系) University of California, Berkeley(加州大学伯克利分校)

AI总结 该研究通过在慢速fMRI上微调语言编码模型,提高了对快速ECoG数据的预测性能,展示了慢速数据在构建快速脑数据模型中的价值。

详情
AI中文摘要

神经科学家最近开始使用侵入性脑记录方法,如电极皮层图(ECoG),进行人类实验,因为它们提供了精细的空间和时间分辨率。然而,训练这些数据的模型受到能接受记录植入物的患者群体的限制。我们提出利用非侵入性fMRI来弥补训练数据的不足。通过在fMRI上微调的语音语言表示,我们构建了ECoG的编码模型。这些表示在ECoG上的预测性能得到了提高,尽管fMRI的时间分辨率比ECoG差两个数量级。预测性能在远超fMRI直接测量的频率带中有所提升。接下来,为了测试该方法的泛化能力,我们对在fMRI响应上以2倍时间下采样率微调的模型进行了测试。尽管分辨率有所下降,这些模型仍能预测fMRI和ECoG响应,与原始fMRI微调模型的水平相当。最后,我们展示了ECoG性能与fMRI微调数据量之间有稳定的关系。我们的结果表明,像fMRI这样的“慢”数据可以成为构建“快”脑数据模型如ECoG的宝贵资源。未来,整合多种记录方法可能进一步提高在其他应用中的性能,如解码。

英文摘要

Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure's generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that "slow" data like fMRI can be a valuable resource for building better models of "fast" brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.

2605.19220 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

位置:在LLM中的不确定性量化仅仅是无监督聚类

Tiejin Chen, Longchao Da, Xiaoou Liu, Hua Wei

发表机构 * School of Computing(计算学院) Augmented Intelligence, Arizona State University(智能增强与亚利桑那州立大学)

AI总结 本文指出,当前LLM的不确定性量化方法本质上是无监督聚类算法,无法有效评估模型的外部正确性,导致无法检测出自信但错误的回答。文章提出了改进的不确定性量化方法,以确保模型的自信度能可靠地反映现实。

Comments Accepted by ICML 2026 Position Paper Track

详情
AI中文摘要

不确定性量化(UQ)被广泛认为是部署大型语言模型(LLM)于高风险领域的主要保障。然而,我们主张该领域存在类别错误:主流LLM的UQ方法本质上是无监督聚类算法。我们证明大多数当前方法本质上量化的是模型生成的内部一致性,而不是其外部正确性。因此,当前方法从根本上无法识别事实现实,并无法检测出“自信幻觉”,即模型在稳定但错误的答案上表现出高自信。因此,当前UQ方法在部署模型时可能会产生误导的安全感。具体而言,我们识别出由于对内部状态的依赖而产生的三种关键病理:超参数敏感危机,使部署不安全;内部评估循环,将稳定性与事实混淆;以及缺乏事实基础,迫使依赖不稳定代理指标来评估不确定性。为解决这一困境,我们倡导向UQ方法转变,并为研究界制定研究路线图,以采用更好的评估指标和设置,实施原生不确定性机制的变化,并将验证锚定在客观事实上,确保模型自信度能可靠地反映现实。

英文摘要

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

2605.19196 2026-05-20 cs.CL 版本更新

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

时间到REFLECT:我们能否信任LLM裁判来评估基于证据的研究代理?

Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

发表机构 * Yale University(耶鲁大学) IBM Research(IBM研究院)

AI总结 本文提出REFLECT基准,用于评估LLM裁判在代理环境中的细粒度失败检测,揭示当前LLM裁判在推理、工具使用和报告质量上的可靠性不足,为构建更可靠的评估流程提供指导。

详情
AI中文摘要

深度研究代理越来越多地自动化复杂的信息检索任务,通过多步骤推理、工具使用和综合生成基于证据的报告。其日益增长的作用要求可扩展、可靠的评估,将LLM作为裁判设定为评估事实准确性、证据使用和推理质量的监督范式。然而,这些裁判对深度研究代理的可靠性仍不明确,提出了一个关键的元评估问题:在部署LLM裁判监督研究代理之前,必须首先评估这些裁判本身。现有的元评估在两个方面存在不足:(1)依赖于粗略的、主观的人类偏好一致;(2)专注于遵循指令或可验证的任务,未探索开放性的代理执行。为了解决这些差距,我们引入REFLECT(REliable Fine-grained LLM judge Evaluation via Controlled inTervention),一个针对代理环境中细粒度失败检测的元评估基准。REFLECT定义了详细的失败模式分类,通过在质量筛选的代理执行轨迹上执行受控和局部化的干预来实例化。这产生了可验证、全面且细粒度的实例,用于验证裁判模型。我们的实验表明,当前LLM裁判仍然不可靠:即使是最能干的模型,在推理、工具使用和报告质量失败方面的总体准确率也低于55%,在证据验证上表现尤其差。一起,我们的分类和发现揭示了系统性的裁判限制,揭示了成本和可靠性之间的权衡,并为构建更可靠的评估流程提供可行的指导。

英文摘要

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

2605.19194 2026-05-20 cs.CL 版本更新

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MMoA: 一个具有递归性的记忆混合代理框架

Rui Chu

发表机构 * Rui Chu(楚瑞)

AI总结 本文提出MMoA框架,通过引入LSTM门控机制,改进了传统混合代理方法在时间依赖性和上下文感知方面的不足,实现了更高效的多代理系统。

详情
AI中文摘要

混合代理(MoA)框架通过聚合多个代理的输出来提升大语言模型(LLM)的性能。然而,现有MoA系统通常依赖静态路由器,无法充分捕捉聚合层中的时间依赖性和上下文依赖性。为了解决这一限制,我们提出MMoA,一种具有递归性的MoA架构,将基于LSTM的门控机制整合到代理选择过程中。递归路由器根据当前输入和历史路由决策动态调节代理贡献,从而实现更上下文感知的聚合。我们在标准的指令遵循基准上评估了MMoA,包括AlpacaEval 2.0、MT-Bench和Arena-Hard。结果表明,MMoA在准确率上与传统MoA相当,同时通过动态激活更少的代理减少了计算开销。例如,在AlpacaEval 2.0上,MMoA实现了58.0%的胜率,相比MoA的59.8%,同时将运行时间效率提高了高达4.6%。这些结果表明,MMoA为适应性多代理LLM系统提供了一种可扩展且高效的解决方案。

英文摘要

The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

2605.19173 2026-05-20 cs.CL 版本更新

Prompting language influences diagnostic reasoning and accuracy of large language models

提示语言影响大型语言模型的诊断推理和准确性

Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

发表机构 * Data Clinic, Nantes University Hospital(数据诊所,南特大学医院)

AI总结 本研究探讨了提示语言对大型语言模型在临床诊断推理和准确性上的影响,通过比较英语和法语性能,发现四种模型在英语环境下表现更优,而o3模型未表现出语言效应,表明提示语言是影响模型临床性能的关键因素。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被探索用于临床决策支持,但大多数评估都是用英语进行的,这使得其在其他语言中的可靠性存疑。本文通过比较五种LLM(o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct和BioMistral-7B)在英语和法语环境下的表现,评估了提示语言对诊断推理和最终诊断准确性的影响。总共评估了180个涵盖16个医学专科的临床情景,由两位医生使用18分量表评估诊断准确性和推理质量。五种模型中有四种在英语环境下表现更优(平均差异0.37-0.91,调整p<0.05),差距涵盖多个推理方面,包括鉴别诊断、逻辑结构和内部效度。o3是唯一一个未表现出整体语言效应的模型。这些发现表明,提示语言仍然是影响LLM临床性能的关键因素,对全球公平的语义文化部署具有重要影响。

英文摘要

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

2605.19149 2026-05-20 cs.CL cs.CR 版本更新

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Rishi Jha, Harold Triedman, Arkaprabha Bhattacharya, Vitaly Shmatikov

发表机构 * Department of Computer Science(计算机科学系)

AI总结 研究探讨了代理在遇到错误时可能发生意外崩溃现象,通过实验发现64.7%的代理在遇到模拟错误时会出现不同程度的不安全行为,且这些行为未被现有安全标准所覆盖。

Comments 32 pages, 8 figures, 4 tables

详情
AI中文摘要

代理在使用计算机和网络时不可避免地会遇到错误:无法访问的网页、缺失的文件、本地和远程的配置错误等。这些错误不会阻碍基于最新模型的代理。它们会继续寻找完成任务的方法。我们引入、描述并测量了一种新的代理失败类型,称为"意外崩溃":在没有对抗性输入的情况下,对良性环境错误产生不安全或有害行为。由于崩溃未被现有可靠性或安全基准捕捉,我们开发了一种崩溃行为的分类法。然后,我们实现了通用代理基础设施,用于在滚动环境中注入模拟的本地和远程错误,并使用它来系统评估基于GPT、Grok和Gemini的代理系统。我们的评估显示,在遇到模拟错误的64.7%的代理滚动中,会出现不同程度和成功程度的崩溃(例如,进行未经授权的侦察或颠覆访问控制)。在超过一半的这些崩溃中,不安全行为未报告给用户。比较有无错误的相同代理行为,我们发现对错误的探索与不安全和有害行为相关。

英文摘要

Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emph{accidental meltdown}: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7\% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

2605.19141 2026-05-20 cs.LG cs.AI cs.CL cs.CY cs.HC 版本更新

GRASP: Deterministic argument ranking in interaction graphs

GRASP:交互图中的确定性论证排名

Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher

发表机构 * MPI-IS Tübingen(图宾根MPI研究所) Tübingen AI Center(图宾根人工智能中心) ELLIS Institute Tübingen(图宾根ELLIS研究所) Eberhard Karls Universität Tübingen(图宾根埃伯哈德·卡尔斯大学) LIONS, EPFL(EPFL的LIONS实验室)

AI总结 本文提出GRASP框架,通过聚合稳定的局部交互判断生成全局排名,以解决大语言模型作为裁判时整体评判不一致的问题,强调结构充分性而非说服力或修辞吸引力。

Comments Preprint

详情
AI中文摘要

大型语言模型越来越多地被部署为自动裁判,以评估论证的强度。随着这一角色的扩大,其合法性取决于一致性、透明性和将论证结构与修辞吸引力区分开的能力。然而,我们证明了整体评判——一种常见的LLM-as-a-Judge实践,其中模型对辩论提供全球裁决——存在显著的跨模型分歧。我们主张这种不稳定性源于将辩论复杂的交互结构压缩成单一的不透明分数。为了解决这一问题,我们提出GRASP(渐进排名与攻击支持传播),一种确定性框架,通过收敛的攻击-防御传播操作,将稳定的局部交互判断聚合为全局排名。我们证明在LLM-as-a-Judge评估中,局部交互判断比整体排名更具可重复性,使GRASP能够生成更一致的全局排名。我们进一步证明GRASP分数与人类“说服性”标签不相关,突显了一个关键的社技术区别:GRASP不衡量说服力、事实性或修辞吸引力,而是结构充分性——一种在显式交互图上的防御意识的论证鲁棒性概念。总体而言,GRASP为整体LLM评判提供了一个透明且可审计的替代方案。

英文摘要

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

2605.19130 2026-05-20 cs.LG cs.AI cs.CL cs.CV 版本更新

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

EgoBabyVLM:基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) Stanford University(斯坦福大学) Meta Reality Labs(Meta现实实验室) The University of Tokyo(东京大学)

AI总结 研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性,提出了 EgoBabyVLM 挑战,推动模型在自然主义数据中实现 grounded language learning。

详情
AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性,这种能力超过了目前最好的大型多模态模型。最近的研究表明,目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流,并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上,包括自然主义婴儿和成人第一人称视频,并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench,它是一个基于语料库的基准测试,自动从模型的训练词汇中生成,以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明,当前 VLM 模型依赖于 curated 数据的紧密语义对齐,并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展,我们引入了 EgoBabyVLM 挑战,以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

2605.19099 2026-05-20 cs.AI cs.CL cs.MA 版本更新

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench: 一个用于长周期代理工作流中涌现委托的基准测试

Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu

发表机构 * OpenMesh AI University of Pennsylvania(宾夕法尼亚大学) Columbia University(哥伦比亚大学) Stanford University(斯坦福大学) MIT(麻省理工学院)

AI总结 本文提出DecisionBench基准测试,用于评估长周期代理工作流中涌现的委托机制,通过五个条件参考扫描发现委托质量、路由保真度和潜在性能上限等核心发现。

Comments 28 pages, 9 figures, 11 tables. Code and data: https://huggingface.co/decisionbench

详情
AI中文摘要

我们引入DecisionBench,一个用于长周期代理工作流中涌现委托的基准测试子系统。该子系统固定了一个任务集(GAIA,tau-bench,BFCL多轮),一个同级模型池(11个模型,7个供应商家族),一个委托接口(调用模型加可选的读取资料通道),一个确定性技能标注层,以及一个覆盖质量、成本、延迟、委托率、路由保真度-at-k、供应商自偏好以及反事实委托天花板的多轴度量套件。该子系统对同级信息的生成或传递方式无关,因此学习的路由器、更丰富的同级记忆、适应性的资料构造以及多步委托均可在此进行评估。我们通过在完整池(n=23,375任务实例)上的五条件参考扫描来表征该子系统。三个基准级别发现:(i)四个意识条件下的平均终端任务质量在统计上无法区分(|beta| <= 0.010,p >= 0.21),因此仅质量评估会错过编排信号;(ii)路由保真度-at-1在条件中从7.5%到29.5%不等,且在近似相等的平均质量下,交付通道(按需工具 vs. 预加载描述)主导描述内容;(iii)反事实天花板将完美委托置于每套测量性能的15-31个百分点之上,定位了未来编排方法中巨大的未实现潜力。我们发布了该子系统、标注层、参考干预套件、分析流程以及220个每条件运行存档。

英文摘要

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

2605.19092 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

反事实可能性测试用于私人推理通道中的间接影响

Alexander Boesgaard Lorup

发表机构 * Openhagen

AI总结 本文提出了一种反事实可能性测试方法,用于衡量私人推理通道之间的影响力,通过替换上游私人块为匹配长度的供体块,并固定公共令牌序列和下游目标,测量下游目标的负对数似然变化,以评估私人和公共通道中的直接和间接影响。

Comments 12 pages, 4 figures, 5 tables

详情
AI中文摘要

推理系统越来越多地将中间计算分成私人和公共通道,产生在转录中看起来相似的评估案例:独立共推导、直接访问私人内容和通过公共通信的间接影响。本文提出了一种反事实可能性测试,用于测量私人推理通道之间的影响力。该方法用一个长度匹配的供体块替换上游私人块,固定公共令牌序列和下游目标,测量下游目标的负对数似然变化。在用于验证的7B角色通道推理模型上,文本探针不可靠:原始n-gram重叠高估了泄漏,修正重叠仍存在噪声,canary复现报告无区分能力。反事实可能性将未遮蔽和遮蔽条件分开,而长度匹配控制了RoPE位置混杂因素。在强化遮蔽验证中,B到A的反向影响接近于零,而A到B的影响通过公共语音隐藏状态持续存在。在三个检查点、五个种子和13,734个有效方向对比的多检查点验证中,重复了这种不对称性。一个图分离控制,阻止私人到公共的载体边,产生所有13,734个控制评估中自然和反事实分数位相同的结果,确定测试的公共通道路径是测量的反事实信号在实施的角色可见性遮蔽下的完整载体。结果表明,私人通道评估应分别报告直接和间接影响,并且反事实可能性探针为测量这些边界提供了实用的默认方法。

英文摘要

Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

2605.19077 2026-05-20 cs.CL cs.AI 版本更新

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD:用于零样本对话状态跟踪的受限神经符号代理NLU

Yanjun Lin, Zimo Xiao, Kartik Natarajan, Mahesh Sankaranarayanan, Niraj Nawanit, Rakshit Parashar, Austin Zhang, Karthik Konaraddi, Rishita Mote, Wei Niu

发表机构 * Amazon(亚马逊)

AI总结 该研究提出ReacTOD,一种受限神经符号架构,通过在自我纠正的ReAct循环中将NLU重新表述为离散工具调用来解决零样本对话状态跟踪问题,其核心方法是确定性验证,主要贡献是实现了新的零样本状态-of-the-art结果。

Comments Accepted at TrustNLP Workshop at ACL 2026

详情
AI中文摘要

面向任务的对话系统--处理交易、预订和服务请求--需要可预测的行为,然而用于实际延迟的中等大小LLM容易产生幻觉和格式错误,这些错误会级联到错误的动作中(例如,预订了错误日期的酒店)。我们提出了ReacTOD,一种受限神经符号架构,将NLU重新表述为自纠正ReAct循环中的离散工具调用。受限的ReAct循环能够实现迭代自我纠正,比单次推断在MultiWOZ上提高了9.3个百分点的准确性。一个符号验证器在每次对话状态更新时强制执行动作合规性、模式一致性以及核心ference一致性,实现了93.1%的自我纠正率,并产生结构化的执行轨迹。增量状态预测和按需历史检索保持提示紧凑,实验证明在参数受限的模型中提高了指令遵循性。在MultiWOZ 2.1上,ReacTOD实现了新的零样本状态-of-the-art:gpt-oss-20B达到52.71%的联合目标准确率,超过之前的最佳结果14个百分点,而Qwen3-8B仅使用8B参数达到47.34%。在Schema-Guided Dialogue(SGD)基准上,ReacTOD在完全端到端评估中使用预测的领域,Claude-Opus-4.6达到80.68%的JGA,Qwen3-32B达到64.09%--展示了无需任务特定训练数据的跨基准泛化能力。

英文摘要

Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

2605.19066 2026-05-20 cs.CL 版本更新

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

低资源NLP评估中的标注稀缺悖论:十年加速与新兴约束

Vukosi Marivate

发表机构 * DSFSI & AfriDSAI, University of Pretoria(DSFSI与AfriDSAI, Pretoria大学)

AI总结 本文探讨了低资源NLP评估中由于标注资源不足导致的矛盾,分析了过去十年中评估方法的发展阶段,并提出了新的解决方案以提升评估的公平性和有效性。

Comments Under Review

详情
AI中文摘要

在过去十年中,低资源自然语言处理(NLP)经历了爆炸性增长,推动因素包括跨语言迁移、大规模多语言模型以及基准测试的快速普及。然而,这种表面的进步掩盖了一个关键但研究不足的矛盾:评估日益复杂的生成系统所需深入的社会语言学专业知识严重受限、分配不均且结构上被边缘化。本文对低资源NLP评估(2014至今)进行了批判性叙述调查,追溯其在三个阶段的发展:早期启发式乐观、顶层基准测试的幻象以及当前的生成瓶颈时代。我们提出了标注稀缺悖论的概念,即当模型扩展能力远超所需的人类基础设施时产生的结构性摩擦。通过分析提取数据管道、未补偿的“幽灵工作”和语言数据膨胀,我们论证了这种悖论威胁了所报告进展的认知有效性。我们调查了新兴的回应方式,包括数据增强、基于模型的评估、参与性整理以及通过项目反应理论和主动学习的标注高效方法,并评估了它们的公平性与有效性权衡。最后,我们呼吁实践者行动,认为克服这一瓶颈需要从交易性数据提取转向基于认知治理、数据主权和共享所有权的社区嵌入评估。

英文摘要

Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

2605.18474 2026-05-20 cs.CR cs.AI cs.CL cs.LG 版本更新

Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation

Prompt2Fingerprint: 通过文本到权重生成实现即插即用的LLM指纹生成

Sixu Chen, Xiang Chen, Hongyao Yu, Jiaxin Hong, Hao Fang, Shuoyang Sun, Bin Chen, Shu-Tao Xia

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院,中国深圳) South China University of Technology, Guangzhou, China(华南理工大学,中国广州) Harbin Institute of Technology, Shenzhen, Shenzhen, China(哈尔滨工业大学深圳校区,中国深圳)

AI总结 本文提出Prompt2Fingerprint框架,将LLM指纹生成重新定义为条件参数生成任务,通过专用生成器将文本描述直接映射到低秩参数增量,实现无需进一步模型微调的即插即用LLM指纹注入,显著降低计算开销,提供可扩展且即时的LLM所有权管理解决方案。

详情
AI中文摘要

大规模语言模型(LLMs)的广泛部署和重新分布使模型溯源跟踪成为关键挑战。尽管现有的LLM指纹生成方法,特别是通过微调嵌入身份信号的主动方法,实现了高准确性和鲁棒性,但它们面临显著的可扩展性瓶颈。这些方法通常将指纹注入视为一个独立的一次性优化任务,而不是可重用的能力,需要为每个新身份进行单独且资源密集的训练。这导致了高昂的计算成本和部署延迟。为了解决这一问题,我们提出了Prompt2Fingerprint(P2F),这是首个将指纹生成重新定义为条件参数生成任务的框架。通过利用专用生成器,P2F在单次前向传递中将文本描述直接映射到低秩参数增量,从而实现无需进一步模型微调的即插即用LLM指纹注入。我们的实验表明,P2F在保持高指纹准确度、无害性和鲁棒性的同时,显著降低了计算开销,为LLM所有权管理提供了可扩展且即时的解决方案。

英文摘要

The widespread deployment and redistribution of large language models (LLMs) have made model provenance tracking a critical challenge. While existing LLM fingerprinting methods, particularly active approaches that embed identity signals via fine-tuning, achieve high accuracy and robustness, they suffer from significant scalability bottlenecks. These methods typically treat fingerprint injection as an independent, one-off optimization task rather than a reusable capability, necessitating separate, resource-intensive training for every new identity. This incurs prohibitive computational costs and deployment delays. To address this, we propose Prompt2Fingerprint (P2F), the first framework that reformulates fingerprinting as a conditional parameter generation task. By leveraging a specialized generator, P2F maps textual descriptions directly to low-rank parameter increments in a single forward pass, enabling plug-and-play LLM fingerprint injection without further model retraining. Our experiments demonstrate that P2F maintains high fingerprint accuracy, harmlessness, and robustness while significantly reducing computational overhead, offering a scalable and instant solution for LLM ownership management.

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG 版本更新

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理?

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究所) TransPerfect(TransPerfect公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究探讨了现有模型如何利用潜在令牌,发现潜在令牌在最终预测中起作用有限,主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示,需要高质量数据和更精确的潜在令牌预测来推动发展。

详情
AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题,而非仅通过语言推理。受此启发,近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中,我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是,当潜在令牌被无信息的占位符令牌替代时,模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象,我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题:首先,在大多数现有数据集中,oracle潜在令牌提供的信息有限,仅超出原始图像,且不显著简化任务,导致模型在训练时忽略它们,并在推理时有效绕过它们。当在诊断数据集上微调时,其中潜在令牌为最终预测提供充分支持,我们显示模型可以因果依赖于它们。其次,在推理时生成的潜在令牌偏离其对应的oracle表示,坍缩到狭窄区域,即使模型依赖它们也无法获得收益。总体而言,我们的发现表明,未来潜在视觉推理的进步取决于两个关键支柱:具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

2605.17046 2026-05-20 cs.LG cs.AI cs.CL 版本更新

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

1GC-7RC:一张图形卡——七个研究挑战!AI代理在做你的工作方面有多好?

Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald

AI总结 本文提出1GC-7RC基准测试,通过七个跨领域机器学习任务评估AI代理在从头设计、实现和训练模型的能力,揭示了不同代理在隐式机器学习知识、规划能力和时间预算管理方面的差异。

详情
AI中文摘要

自主AI编码代理正成为机器学习从业者在工业和研究中不可或缺的工具。尽管这种应用日益广泛,但尚无标准化基准来评估其在不同领域从头设计、实现和训练模型的能力。我们引入了1GC-7RC(单张图形卡:七个研究挑战),该基准包含七个机器学习任务,涵盖语言建模、图像分类、语义分割、图学习、表格预测、时间序列预测和文本分类。每个任务都提供锁定的数据准备和评估脚本以及基线训练脚本;代理只能修改训练代码,无法访问预训练权重(语义分割任务有一个受控例外),无法访问互联网,并必须在单个GPU上完成每个任务的时间预算(40-120分钟)。我们评估了七个编码代理:五个专有(Claude Code with Sonnet 4.6、Opus 4.6和Opus 4.7;Codex CLI with GPT 5.5;和OpenCode with Qwen 3.6+)和两个开源(OpenCode with Kimi K2.5、Kimi K2.6)。在每个代理-任务对的5次运行中,我们报告了显著的性能差异,揭示了不同代理在隐式机器学习知识、规划能力和时间预算管理方面的不同水平。该基准、工具和所有评估成果均在GitHub上公开,以促进未来代理的可重复比较。由于我们的基准设计是模块化的,该基准可以扩展到新任务和领域,适应不同的GPU预算,并用于研究多代理设置,使其成为未来自主研究代理研究的灵活平台。

英文摘要

Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.

2605.16712 2026-05-20 cs.AI cs.CL cs.HC 版本更新

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

回想起并不足够:在个性化语言系统中界定承诺

Rui Tang, Yichi Zhang, Xi Chen, Chen Dong, Youwei Yang, Yumeng Shen, Qiangqiang Liu

发表机构 * OpenAsk Stern School of Business, New York University(纽约大学 Stern 学院) Bank of Hebei(河北银行) Lingnan College, Sun Yat-sen University(中山大学 Lingnan 学院) School of Economics, Xiamen University(厦门大学 经济学院) BitMart Binance

AI总结 本文提出了一种新的方法,通过合同界定证据激活(CBEA)和词典承诺验证(LCV)来解决个性化语言系统中承诺界定的问题,从而在360个测试用例和三个生成后端上实现了零失败,同时降低了输入负载。

Comments 14 pages, 3 figures, 22 tables; preprint version

详情
AI中文摘要

长上下文和记忆系统通常将个性化视为召回问题。在实践中,许多故障发生在系统承诺时:它将嘈杂的提示转化为硬约束,丢弃罕见的见证,忘记下游义务,或在不可行的情况下作答。我们引入了合同界定证据激活(CBEA)与词典承诺验证(LCV)。CBEA通过类型覆盖、尾见证和后果债务激活一个有界的证据集;LCV在文本之前验证结构化的承诺,并将不可行的状态路由到修复、回避或再合同。在360个测试用例和三个生成后端上,CBEA+LCV在验证范围内达到零失败,可用性为0.49-0.60,而具有相同LCV门的原始和长上下文基线只有在0.003-0.092时才能达到零失败。一个影子 oracle 诊断标记了极限:CBEA+LCV召回了0.012个未编译的可见事实,而原始召回了0.53。结果是一个有界的操作点:显式的承诺控制和74-75%更低的中位数输入负载,而不是普遍的记忆主导。

英文摘要

Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.

2605.16679 2026-05-20 cs.CL cs.AI 版本更新

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench: 能否让AI代理自动化端到端、长周期、政策丰富的医疗工作流程?

Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

发表机构 * Johns Hopkins Medicine(约翰霍普金斯医学中心) Wellstar Health System(Wellstar健康系统) Stanford University(斯坦福大学) CMU(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) Yale School of Medicine(耶鲁医学院) Salesforce AI Research(Salesforce人工智能研究) University of Washington(华盛顿大学) Northeastern University(东北大学) Brown University(布朗大学) Boston College(波士顿学院) Stony Brook University(史泰森布里克大学) University of Oxford(牛津大学) Arizona State University(亚利桑那州立大学) University of Southern California(南加州大学) Emory University(埃默里大学) MBZUAI Recursive Superintelligence(递归超级智能) University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出CHI-Bench基准,旨在评估AI代理在医疗工作流程中端到端、长周期和政策丰富任务中的自动化能力,揭示当前基准测试中政策密度、多角色协作和多方交互等能力的不足。

Comments Website: https://actava.ai/benchmarks Code: https://github.com/actava-ai/chi-bench Dataset: https://huggingface.co/datasets/actava/chi-bench

详情
AI中文摘要

现实医疗操作的端到端自动化要求具备当前基准测试中较少体现的三种能力:政策密度,即决策必须基于大量医疗、保险和运营规则;多角色组成,即单个任务需要代理扮演多个角色并进行交接;以及多方交互,即中间工作流程步骤是多轮对话,例如同行评审和患者接触。我们介绍了CHI-Bench,一个涵盖三个领域的长周期医疗工作流程基准:提供者预先授权、支付方使用管理以及护理管理。每个任务都会将代理置于一个高保真模拟器中,该模拟器暴露了20个医疗应用程序,通过87个MCP工具。代理必须通过工具调用和编写角色的文档来驱动任务完成,受1,290多份文档管理护理操作手册技能的指导。在30种代理配置下,最佳代理仅能解决28.0%的任务,没有代理在严格通过标准下达到20%以上,且单次会话执行所有任务会将性能降至3.8%。这些结果提出了假设,即在其他政策密集、角色组成和不可逆的企业领域中,类似的差距可能会出现。

英文摘要

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

2605.15518 2026-05-20 cs.CL 版本更新

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

DetectRL-X: 向可靠多语言和真实世界LLM生成文本检测迈进

Junchao Wu, Yefeng Liu, Chenyu Zhu, Hao Zhang, Zeyu Wu, Tianqi Shi, Yichao Du, Longyue Wang, Weihua Luo, Jinsong Su, Derek F. Wong

发表机构 * NLP2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学学院NLP2 CT实验室) Alibaba Group(阿里巴巴集团) Xiamen University(厦门大学)

AI总结 本文提出DetectRL-X,一个综合性的多语言基准,用于评估先进检测器在8个维度上的性能,通过8种常见商业语言和6种易受LLM滥用的领域人类文本,结合4种流行商业LLM生成文本及润色、扩展和缩写等AI辅助写作操作,分析不同语言、领域、生成器、攻击策略、文本长度和润色操作对检测器性能的影响,以强化多语言和语言特定检测器。

Comments ACL 2026 Main. Code and data are available at https://github.com/AIDC-AI/Marco-LLM/tree/main/DetectRL-X

详情
AI中文摘要

由于LLM生成内容的滥用风险日益增加,有效检测和治理LLM生成内容变得越来越关键。尽管现有检测器性能优异,但其在多语言和真实世界场景中的可靠性和潜力仍鲜有研究。本文介绍DetectRL-X,一个综合性的多语言基准,用于评估先进检测器在8个维度上的性能。该基准涵盖8种在商业中常用的语言,并收集了6种易受LLM滥用的领域的人类文本。为更好地匹配真实世界应用,我们使用4种流行的商业LLM生成LLM生成文本,并包含润色、扩展和缩写等典型AI辅助写作操作,以捕捉真实的使用模式。此外,我们开发了多语言框架用于改写和扰动攻击,以模拟多样化的人类修改和写作噪声,从而在不同语言上对检测器进行压力测试。在DetectRL-X上的实验结果揭示了当前最先进的检测器在不同语言资源上的优势和局限性。我们进一步分析了领域、生成器、攻击策略、文本长度和润色操作如何影响不同语言中的性能,突显DetectRL-X作为强化多语言和语言特定检测器的有效基准。

英文摘要

The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.

2605.13652 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

超越困惑度:低秩预训练的几何与谱研究

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校)

AI总结 本文通过几何和谱分析研究低秩预训练方法,揭示其与全秩训练在模型性能和解空间上的差异,发现低秩方法在不同模型规模下表现各异,且困惑度不能完全反映下游任务性能。

Comments 9 pages, 5 figures, 2 tables

详情
AI中文摘要

大规模语言模型的预训练主要受限于存储全秩权重、梯度和优化器状态的内存成本。低秩预训练出现以解决这一问题,相关方法空间迅速扩展。一个核心问题仍未解决:低秩方法是否能产生与全秩训练具有同等泛化能力的模型,或者秩约束是否根本性地改变了所达到的解?现有比较几乎完全依赖于单种子运行的验证困惑度,通常继承自先前文献。然而,困惑度是解质量的差代理;两种方法可以在困惑度上匹配,却收敛到不同的损失景观区域和内部表示。我们通过表征五种低秩预训练方法(GaLore和Fira(内存高效优化器)、CoLA和SLTrain(架构再参数化)、ReLoRA(适配器式更新带周期性重置))在三个模型规模(60M、130M、350M)下与全秩训练的解,关闭这一差距。我们评估每种方法在四个维度上的16个指标:1D损失景观沿随机/Top-K PCA方向、1D检查点之间插值、权重和学习更新的谱结构,以及激活相似性与全秩训练。我们显示低秩方法不等同于全秩训练,也不等同于彼此,即使验证困惑度接近。全秩训练在随机方向上达到更尖锐的盆地,而反方向则适用于top-1 PCA方向。每种方法收敛到几何上不同的盆地。低秩激活在训练过程中随着层数增加而偏离全秩激活,GaLore最接近全秩激活。进一步,验证困惑度在每个规模下并不转化为下游性能。添加几何和谱度量提高了预测。

英文摘要

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

2605.08696 2026-05-20 cs.CL cs.LG 版本更新

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

结构化递归混合器用于大规模并行序列生成

Benjamin L. Badger

发表机构 * IBM

AI总结 本文提出了一种结构化递归混合器架构,能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换,从而在不依赖专用内核或设备特定内存管理的情况下提高训练效率、输入信息容量和推理吞吐量。

详情
AI中文摘要

在过去二十年中,语言建模经历了从主要使用递归架构(在训练和推理过程中按顺序处理标记)到非递归模型(在训练过程中并行处理序列元素)的转变,后者在训练效率和稳定性方面有所提升,但以较低的推理吞吐量为代价。本文介绍了一种结构化递归混合器(SRM)架构,该架构能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换,尤其不需要专用内核或设备特定的内存管理。我们通过实验表明,这种双表示方法相比其他线性复杂度模型,在训练效率、输入信息容量和推理吞吐量及并发性方面具有优势。我们推测递归模型对于信息丰富的输入(如语言)在扩展序列长度方面并不理想,但因其每个样本的常数内存需求,适合在样本(批量)维度上扩展。我们提供了Mojo/MAX推理实现的SRM,其吞吐量和并发性分别比同样强大的Transformer在vLLM上的推理提高了12倍和170倍,这些增益特征与PyTorch实现导致的GSM8k Pass@k计算常数增加30%。最后,我们证明SRM是有效的强化学习训练候选。

英文摘要

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

2605.07721 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

内存高效的循环变换器:在循环语言模型中解耦计算与内存

Victor Conchello Vendrell, Arnau Padres Masdemont, Niccolò Grillo, Jordi Ros-Giralt, Arash Behboodi, Fabio Valerio Massoli

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种内存高效的循环变换器(MELT),通过解耦推理深度与内存消耗,实现了常数内存的迭代推理,同时保持了LoopLM的性能,仅需轻量级的后训练过程。

Comments 22 pages, 5 figures, 11 tables

详情
AI中文摘要

递归大语言模型(LLM)架构已作为一种改进推理能力的有希望的方法出现,因为它们能够在嵌入空间中进行多步计算而无需生成中间标记。例如Ouro模型通过迭代更新内部表示并在每次迭代中保留标准的键值(KV)缓存来进行推理,导致内存消耗与推理深度成线性增长。因此,增加推理迭代次数会导致内存使用变得不可接受,限制了此类架构的实际可扩展性。在本工作中,我们提出了内存高效的循环变换器(MELT),一种新颖的架构,将推理深度与内存消耗解耦。与使用每个层和循环的标准KV缓存不同,MELT在每个层中维护一个共享于推理循环的单个KV缓存。该缓存通过可学习的门控机制随时间更新。为了在该架构下实现稳定且高效的训练,我们提出采用分块训练的两阶段过程进行训练:插值转换,随后是注意力对齐的蒸馏,均从LoopLM起始模型到MELT。实验表明,我们展示MELT模型在从预训练Ouro参数微调后,优于同等规模的标准LLM,同时保持与这些模型相当的内存占用,并显著小于Ouro的内存占用。总体而言,MELT实现了无需牺牲LoopLM性能的常数内存迭代推理,仅需轻量级的后训练过程。

英文摘要

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

2605.06546 2026-05-20 cs.CL 版本更新

Efficient Pre-Training with Token Superposition

高效的token叠加预训练

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

发表机构 * Nous Research

AI总结 本文提出了一种名为Token-Superposition Training (TST) 的方法,通过在不修改模型架构的情况下,提高预训练的数据吞吐量,从而在大规模预训练中实现更高的效率和性能。

Comments 25 pages, 11 figures, 28 tables

详情
AI中文摘要

大型语言模型的预训练通常成本高昂且在扩展时效率低下,需要复杂的侵入性修改才能实现高数据吞吐量。在本工作中,我们提出了Token-Superposition Training (TST),一种简单的即插即用方法,能够在不修改并行性、优化器、分词器、数据或模型架构的情况下,显著提高预训练过程中每FLOPs的数据吞吐量。TST分为两个阶段:(i) 一个高度高效的叠加阶段,其中我们将许多连续的token合并成一个袋,并使用多热交叉熵(MCE)目标进行训练;(ii) 一个恢复阶段,其中我们恢复回标准训练。我们对270M和600M参数的规模进行了广泛评估,并在3B和10B A1B混合专家模型上进行了验证,证明其在不同设置中具有高度鲁棒性。最终,TST在基准损失和下游评估中均优于基线,且在等损失设置下,TST在10B A1B规模上实现了预训练时间的2.5倍减少。

英文摘要

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

2605.06501 2026-05-20 cs.LG cs.CL 版本更新

Cubit: Token Mixer with Kernel Ridge Regression

Cubit:基于核岭回归的令牌混合器

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

AI总结 本文提出Cubit,一种基于核岭回归的新型架构,通过将令牌混合机制从Nadaraya-Watson回归转换为核岭回归,从而提供更稳固的数学基础,并在长序列建模能力上表现出优势。

Comments Tech Report

详情
AI中文摘要

自2017年引入以来,Transformer已成为现代深度学习中最广泛采用的架构之一。尽管在位置编码、注意力机制和前馈网络方面进行了大量改进,Transformer的核心令牌混合机制仍为注意力。在本文中,我们表明Transformer中的注意力模块可以被解释为执行Nadaraya-Watson回归,其中它计算令牌之间的相似性并相应地汇总值。受这一视角的启发,我们提出了Cubit,一种潜在的下一代架构,它利用核岭回归(KRR),而传统的Transformer依赖于Nadaraya-Watson回归。具体而言,Cubit通过将经典的注意力计算修改为结合KRR的闭式解,将值汇总通过核相似性与通过核矩阵的逆进行归一化。为了提高训练稳定性,我们进一步提出了有限范围重缩放(LRR),它在受控范围内缩放值层。我们认为,作为基于KRR的架构,Cubit比传统的Transformer提供了更稳固的数学基础,因为Transformer的注意力机制对应于Nadaraya-Watson回归。我们通过全面的实验验证了这一主张。实验结果表明,Cubit可能在长序列建模能力上表现更强。特别是,其在Transformer上的性能提升似乎随着训练序列长度的增长而增加。

英文摘要

Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

2605.00768 2026-05-20 cs.CL 版本更新

Characterizing the Expressivity of Local Attention in Transformers

对变换器中局部注意力的表达力进行表征

Jiaoda Li, Ryan Cotterell

AI总结 本文研究了变换器中局部注意力的表达力,通过形式语言理论分析了局部注意力如何扩展可识别的正则语言,并证明了全局和局部注意力在表达力上的互补性,实验表明混合型变换器在形式语言识别和自然语言建模中表现更优。

Comments ACL 2026

详情
AI中文摘要

变换器是语言建模中最流行的神经架构。变换器的核心是其全局注意力机制,该机制允许模型在生成下一个标记前聚合所有前序标记的信息。局部注意力是一种常见的注意力变体,它限制每个标记只能聚合来自有限窗口前序标记的信息,从而将全局注意力的二次成本降低到线性。尽管这种限制通常由效率驱动,但发现它也提高了模型质量,这一现象至今缺乏令人满意的解释。本文从识别器表达力的角度提供了这一现象的正式说明。已证明具有固定精度的变换器与包含单一过去操作符的线性时序逻辑片段相对应。此外,证明添加局部注意力引入了第二个时间操作符,严格扩大了可识别的正则语言类别。此外,全局和局部注意力在表达力上是互补的:彼此不包含,结合它们会产生最丰富的片段。在形式语言识别和自然语言建模上的实验验证了理论,显示混合型全局-局部变换器在性能上优于仅使用全局注意力的变换器。

英文摘要

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.

2605.00333 2026-05-20 cs.LG cs.CL 版本更新

Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B

借来的几何:冻结预训练的Gemma 4 31B在跨分布头部重要性指纹

Abay Bektursun

发表机构 * Independent research(独立研究)

AI总结 本文研究了冻结预训练的Gemma 4 31B模型在跨分布任务中的头部重要性指纹,通过分析多个任务中的头部影响,发现特定头部在不同任务中表现出显著的重要性,同时验证了这些头部在因果上的有效性。

Comments v2: Added head-level causal ablation on OGBench cube-task1 (n=30, 3.2x specificity; n=5 paired-t p=0.039) and full L26 sweep. New sections on honest negatives (activation patching null, sufficiency null, within-layer Spearman wrong-direction). Multiplicity-aware permutation null V4 P=0.013. Title and framing updated. 25 pages (13 main), 10 figures

详情
AI中文摘要

冻结在文本上预训练的Gemma 4 31B权重,未经修改,通过一个薄的可训练接口转移到非文本模态。在L24-L29切片(192个注意力头)上,一个英语文本TxtCopy注意力探针(95个句子)和每个头部对四个非语言标记模式任务(二进制复制、联想回忆、1D细胞自动机规则90、二进制加法)的影响共同分类了四个头部——L26.28、L27.28、L27.2、L27.3——在两个信号上都处于顶级。切片级别的联合巧合在超几何空虚下显著(P=0.0013,N=192,K=38,n=4)并且在多重性感知的排列检验中存活(P_V4=0.013)。预训练的Gemma L26在OGBench cube-double-play-task1上达到60.22% vs ~1%对于随机初始化的Gemma(+59pt在n=3时);一个带有正确1/√d_k缩放的FrozenRandom-GPT2对照也失败。头部层面的因果验证:在训练的cube-task1 IQL代理中零化L26.28导致成功从63.3%降至10.0% vs 46.7%对于层匹配的低-TxtCopy负对照(在n=30时有3.2倍的特异性;n=5配对-t p=0.039)。完整的L26扫描将L26.28置于32个中的第4位。诚实的负样本:在L26内Spearman ρ(TxtCopy,drop)=+0.37(与层内因果阅读相反);单个头部激活修补不转移匹配变量;四个命名头部单独不足以完成任何任务;Walker2d-DT和scene-task1招募L24在命名切片之外并显示头-消融特异性为零。我们将贡献框架为切片级别的跨分布重要性指纹加上一个跨模态目标的头部层面因果证据。

英文摘要

Frozen Gemma 4 31B weights pretrained exclusively on text, unmodified, transfer through a thin trainable interface to non-text modalities the substrate has never processed. On the L24--L29 slice (192 attention heads), an English-text TxtCopy attention probe (95 sentences) and per-head ablation impact on four non-language token-pattern tasks (binary copy, associative recall, 1D cellular automaton Rule 90, binary addition) jointly classify four heads -- L26.28, L27.28, L27.2, L27.3 -- as top-tier on both signals. The slice-level joint coincidence is significant under hypergeometric null ($P = 0.0013$, $N=192$, $K=38$, $n=4$) and survives multiplicity-aware permutation tests ($P_{V4} = 0.013$). Pretrained Gemma L26 reaches 60.22% on OGBench cube-double-play-task1 vs ~1% for random-init Gemma ($+59$pt at $n=3$); a FrozenRandom-GPT2 control with correct $1/\sqrt{d_k}$ scaling also fails. Head-level causal validation: zeroing L26.28 in the trained cube-task1 IQL agent drops success $63.3\% \to 10.0\%$ vs $46.7\%$ for a layer-matched low-TxtCopy negative control ($3.2\times$ specificity at $n=30$; $n=5$ paired-$t$ $p=0.039$). A full L26 sweep places L26.28 at rank 4 of 32. Honest negatives: within-L26 Spearman $ρ(\text{TxtCopy, drop}) = +0.37$ (opposite of within-layer causal reading); single-head activation patching does not transfer the matching variable; the 4 named heads alone do not suffice on any task; Walker2d-DT and scene-task1 recruit L24 outside the named slice and show null head-ablation specificity. We frame the contribution as a cross-distribution importance fingerprint at the slice level plus head-level causal evidence on one cross-modality target.

2604.16593 2026-05-20 cs.CL 版本更新

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

重新审视一个令人头疼的问题:一种用于语言模型的语义推理基准

Yang Liu, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Chao Huang

发表机构 * University of Science and Technology Beijing(北京科技大学) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出SemanticQA基准,用于评估语言模型在语义短语处理任务中的表现,通过整合现有多词表达资源并重新组织为统一测试平台,涵盖通用词汇现象及三种细粒度类别,评估不同架构和规模的语言模型在提取、分类、解释及任务组合中的性能,揭示语义推理任务中模型性能的显著差异,为提升语言模型在非平凡语义短语上的理解能力提供见解。

Comments ACL 2026 (Oral), 24 pages, 22 figures, 14 tables

详情
AI中文摘要

我们提出了SemanticQA,一种评估套件,旨在评估语言模型(LMs)在语义短语处理任务中的能力。该基准整合了现有的多词表达(MwE)资源,并将它们重新组织为一个统一的测试平台。它涵盖了通用词汇现象,如词组搭配,以及三个细粒度类别:习语表达、名词复合词和动词结构。通过SemanticQA,我们评估了不同架构和规模的语言模型在提取、分类、解释任务以及顺序任务组合中的表现。我们揭示了显著的性能差异,特别是在需要语义推理的任务中,突显了不同模型在推理效率和语义理解方面的差异,为推动语言模型在非平凡语义短语上具备更强理解能力提供了见解。SemanticQA的评估套件和数据可在https://github.com/jacklanda/SemanticQA上获取。

英文摘要

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.

2604.11796 2026-05-20 cs.CL cs.AI 版本更新

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

C-ReD:一个源自真实世界提示的综合性中文AI生成文本检测基准

Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia

发表机构 * Tsinghua University(清华大学) Nankai University(南开大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Peng Cheng Laboratory(鹏城实验室) Shannon InfoTech

AI总结 本文提出C-ReD基准,用于检测AI生成的中文文本,通过解决模型多样性、领域覆盖和提示真实性等关键问题,提升检测性能和泛化能力。

Comments ACL 2026 Findings

详情
AI中文摘要

近年来,大型语言模型(LLMs)能够生成高度流畅的文本内容。尽管它们为人类提供了显著的便利,但也引入了诸如钓鱼和学术不端等风险。大量研究致力于开发检测AI生成文本的算法并构建相关数据集。然而,在中文语料领域仍存在挑战,包括模型多样性有限和数据同质性。为了解决这些问题,我们提出了C-ReD:一个综合性的中文真实提示AI生成检测基准。实验表明,C-ReD不仅能够实现可靠的领域内检测,还支持对未见LLMs和外部中文数据集的强大泛化能力,从而弥补了先前中文检测基准在模型多样性、领域覆盖和提示真实性方面的关键缺口。我们已在https://github.com/HeraldofLight/C-ReD上发布了相关资源。

英文摘要

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

2604.07035 2026-05-20 cs.CL 版本更新

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

统一的部署感知开放推理语言模型评估

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science, Rensselaer Polytechnic Institute(理海理工学院计算机科学系) Department of Biomedical Engineering, Rensselaer Polytechnic Institute(理海理工学院生物医学工程系)

AI总结 本文提出了一种统一的开放推理语言模型评估方法,通过四个基准测试(ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1)对七种配置进行评估,结合零样本、链式思维(CoT)和少量样本CoT提示策略,分析模型在准确率、延迟、内存使用等多目标下的表现,强调部署感知的多目标优化问题。

详情
AI中文摘要

开放推理语言模型通常在混合样本量、部分标准化提示和以准确性为中心的总结下进行比较,这使得实际模型选择难以解释。我们针对ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1四个基准测试,对七种开放推理语言模型配置进行了统一评估。我们对每种模型-数据集-策略条件下的238个示例子集测试了零样本、链式思维(CoT)和少量样本CoT提示策略,得到一个完整的7×4×3设计,包含84个条件和19,992个评估示例。除了准确性外,我们还报告了Wilson置信区间、延迟、峰值视频随机访问内存(VRAM)、加权聚合性能、帕累托高效运行点、提示敏感度指标和兼容性诊断。Gemma-4-26B-A4B在零样本提示下实现了最高的加权分数0.794。Gemma-4-E4B在各种提示设置中仍接近顶部,同时使用显著更低的延迟和内存,使其成为一种强大的实际运行点。Bootstrap和配对排列分析显示,领先配置足够接近,部署权衡仍然重要。我们还发现提示策略的变化会改变模型排名,而不是统一移动所有模型。基准特定的互补性创造了路由空间,一个 oracle 任务感知选择器达到了加权分数0.825。兼容性诊断显示,一些明显失败,尤其是Phi-4-Reasoning在GSM8K上的表现,反映了在共享评估流程下的鲁棒性和接口适应性问题。这些结果支持一个核心主张:开放模型评估应作为部署感知的多目标运行点问题,而不是单一分数排行榜练习。

英文摘要

Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks: ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1. We test zero-shot, chain-of-thought (CoT), and few-shot CoT prompting on the same 238-example subset for every model--dataset--strategy condition, yielding a complete 7 x 4 x 3 design with 84 conditions and 19,992 evaluated examples. Beyond accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Gemma-4-26B-A4B with zero-shot prompting achieves the highest weighted score at 0.794. Gemma-4-E4B remains close to the top across prompting settings while using substantially lower latency and memory, making it a strong practical operating point. Bootstrap and paired-permutation analyses show that the leading configurations are close enough that deployment tradeoffs remain important. We also find that prompting strategy changes model rankings rather than shifting all models uniformly. Benchmark-specific complementarity creates routing headroom, with an oracle task-aware selector reaching a weighted score of 0.825. Compatibility diagnostics show that some apparent failures, especially Phi-4-Reasoning on GSM8K, reflect robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

2604.02784 2026-05-20 cs.CV cs.CL 版本更新

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

EnsemHalDet: 通过内部状态检测器的集成实现鲁棒的视觉语言模型幻觉检测

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

发表机构 * The University of Electro-Communications(电通大学)

AI总结 本文提出EnsemHalDet,一种通过集成多个内部表示的视觉语言模型幻觉检测框架,以提高多模态幻觉检测的鲁棒性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态任务中表现出色,但它们仍然容易受到事实错误或与输入图像无关的幻觉影响。最近的研究表明,利用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而,现有的基于内部表示的方法通常依赖于单一的表示或检测器,限制了它们捕捉多样化幻觉信号的能力。在本文中,我们提出了EnsemHalDet,一种基于集成的幻觉检测框架,利用VLMs的多种内部表示,包括注意力输出和隐藏状态。EnsemHalDet为每个表示训练独立的检测器,并通过集成学习进行组合。在多个VQA数据集和VLMs上的实验结果表明,EnsemHalDet在AUC方面始终优于先前的方法和单检测器模型。这些结果表明,集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。

英文摘要

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

2603.25620 2026-05-20 cs.CL 版本更新

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

PICon: 一种用于评估人设代理一致性的多轮询问框架

Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Hwajung Hong, Edward Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出PICon框架,通过逻辑链式的多轮提问评估人设代理的一致性,发现即使之前被认为高度一致的系统在三个维度上也未能达到人类基准水平,揭示了矛盾和逃避回应。

Comments 20 pages, 6 figures

详情
AI中文摘要

基于大型语言模型的人设代理正被广泛应用于替代人类参与者,但缺乏系统方法来验证其响应是否在交互中保持一致性和准确性。本文提出PICon框架,通过逻辑链式的多轮提问来评估人设代理的一致性,从内部一致性(无自相矛盾)、外部一致性(与现实世界事实一致)和重测一致性(重复测试下的稳定性)三个核心维度进行评估。在评估七组人设代理和63名真实人类参与者时,发现即使之前报告为高度一致的系统在三个维度上也未能达到人类基准水平,揭示了矛盾和逃避回应。本文为评估人设代理提供了概念基础和实用方法,提供了源代码和交互演示:https://kaist-edlab.github.io/picon/

英文摘要

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

2603.22453 2026-05-20 cs.CL cs.SI 版本更新

XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception

XNote: 对基于图像的上下文欺骗的自动社区笔记生成进行基准测试

Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng

发表机构 * School of Computing, Clemson University(克莱姆森大学计算机学院)

AI总结 本文研究了基于图像的上下文欺骗的自动社区笔记生成任务,提出了一个真实世界数据集XNote,并对前沿大视觉语言模型和商业工具进行了基准测试,以评估其在欺骗检测和笔记生成任务中的性能。

详情
AI中文摘要

社区笔记已成为一种有效的众包机制,用于对抗社交媒体上的在线欺骗。然而,其依赖于人类贡献者限制了及时性和可扩展性。在本工作中,我们研究了基于图像的上下文欺骗的自动社区笔记生成任务,其中一张真实图像与误导性上下文(例如时间、实体和事件)配对。与之前主要关注欺骗检测(即以二元方式判断帖子是否真实)的工作不同,自动社区笔记生成需要生成简洁且有根据的笔记,帮助用户恢复缺失或更正的上下文。由于支持此任务的数据集稀缺,该问题仍未被充分探索。为了解决这一差距,我们整理了一个真实世界的数据集XNote,包含X篇帖子及其相关的社区笔记和外部上下文,以及主题和欺骗因素的注释。我们进一步在XNote上基准测试了一系列前沿的大视觉语言模型(LVLMs),评估它们在欺骗检测和笔记生成任务中的性能。我们还对比了端到端方法SNIFFER和商业工具GPT-5。我们的结果突显了自动社区笔记生成的挑战,强调了改进针对此任务的方法和指标的必要性。

英文摘要

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation task for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), automated Community Notes generation requires producing concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to the scarcity of datasets that support this task. To address this gap, we curate a real-world dataset, XNote, comprising X posts with associated Community Notes and external contexts, along with annotations of topics and deceptive factors. We further benchmark a range of frontier large vision language models (LVLMs) on XNote, evaluating their performance on both deception detection and note generation tasks. We also compare against an end-to-end approach, SNIFFER, and a commercial tool, GPT-5. Our results highlight the challenges in automated Community Notes generation, underscoring the need for improved methods and metrics tailored for this task.

2603.17839 2026-05-20 cs.CL cs.AI cs.LG 版本更新

How do LLMs Compute Verbal Confidence

LLMs如何计算言语自信

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 研究探讨了大型语言模型如何内部生成言语自信评分,通过实验发现自信评分在回答生成后被缓存并用于后续输出,揭示了模型自我评估的机制。

详情
AI中文摘要

言语自信——提示LLMs以数字或类别形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而,LLMs内部如何生成此类评分仍不清楚。我们解答了两个问题:首先,信心是在被请求时即时计算,还是在生成答案时自动计算并缓存以供后续检索;其次,言语自信代表什么——token对数概率,还是更丰富的答案质量评估?我们聚焦于Gemma 3 27B(在TriviaQA、BigMath和MMLU上的表现)、Qwen 2.5 7B以及推理模型Magistral Small 24B,提供了缓存检索的收敛证据。激活引导、修补、噪声和交换实验揭示,信心表示在回答相邻位置先出现,再出现在言语化位置。注意力阻断指出了信息流:信心从回答token中收集,缓存于第一个回答后的位置,然后用于输出。关键发现是线性探测和方差划分揭示,这些缓存表示能够解释超出token对数概率的显著方差,表明是更丰富的答案质量评估,而非简单的流畅性读取。这些发现表明,言语自信反映了自动、复杂的自我评估——而非事后重建——对理解LLMs中的元认知和改进校准具有启示。

英文摘要

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

2603.01009 2026-05-20 cs.CL 版本更新

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays

Qayyem: 一个实时平台用于评分阿拉伯语作文的熟练程度

Hoor Elbahnasawi, Marwan Sayed, Sohaila Eltanbouly, Fatima Brahamia, Tamer Elsayed

发表机构 * Computer Science and Engineering Department, Qatar University(卡塔尔大学计算机科学与工程系)

AI总结 本文提出Qayyem平台,提供阿拉伯语作文评分的集成工作流程,通过友好的界面简化评分服务器API的交互,部署了多种先进的阿拉伯语作文评分模型。

Comments Accepted at ACL 2026

详情
AI中文摘要

近年来,自动作文评分(AES)系统因其可扩展性和一致性而受到越来越多的关注,作为评估学生写作熟练程度的解决方案。尽管有最近的进步,但阿拉伯语AES的支持仍然有限,由于语言复杂性和大规模公开标注数据集的缺乏。在本工作中,我们提出了Qayyem,一个基于网络的平台,旨在通过提供集成的工作流程来支持阿拉伯语AES,包括作业创建、批量作文上传、评分配置和每个特征作文评估。Qayyem抽象了与评分服务器API交互的技术复杂性,允许教师通过用户友好的界面访问高级评分服务。该平台部署了多种最先进的阿拉伯语作文评分模型,具有不同的有效性和效率指标。

英文摘要

Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.

2602.13466 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Language Model Memory and Memory Models for Language

语言模型记忆与记忆模型用于语言

Benjamin L. Badger

发表机构 * IBM(IBM公司)

AI总结 研究探讨了语言模型和记忆模型在信息存储中的能力差异,发现语言模型的嵌入向量信息较少,而自编码器在输入再生训练中能形成接近完美的记忆,提出了一种可并行的编码器-解码器记忆模型架构,并通过结合因果和信息保留目标函数来提升记忆形成和解码能力。

详情
AI中文摘要

机器学习模型存储输入信息的能力,类似于“记忆”的概念,在隐藏层向量嵌入中被广泛使用但未充分表征。我们发现,无论数据和计算规模如何,语言模型嵌入通常包含相对较少的输入信息。相比之下,用于输入再生训练的自编码器嵌入能够形成几乎完美的记忆。用记忆嵌入替代令牌序列可带来显著的计算效率,从而引入一种可并行的编码器-解码器记忆模型架构。在因果训练后,这些模型包含信息贫乏的嵌入,无法进行任意信息访问,但通过结合因果和信息保留目标函数,它们学会形成和解码信息丰富的记忆。通过冻结高保真编码器并采用课程训练方法,解码器首先学习处理记忆,然后学习预测下一个令牌。我们引入了观点,即仅使用下一个令牌预测训练不足以准确形成记忆,因为目标本身不可逆,从而推动在输入不完全暴露的情况下使用结合目标函数的模型。

英文摘要

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

2602.11767 2026-05-20 cs.AI cs.CL cs.LG 版本更新

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TSR:用于LLM代理多轮RL的轨迹搜索

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Heiko Ludwig, Holger Boche

发表机构 * Technical University Munich(慕尼黑技术大学) IBM Research(IBM研究院)

AI总结 本文提出TSR,一种在训练时改进每轮轨迹生成的方法,通过轻量级树状搜索构造高质量轨迹,提升rollout质量和学习稳定性,适用于多轮RL任务。

详情
AI中文摘要

大规模语言模型(LLMs)的进步正在推动使用强化学习(RL)来训练代理,从跨任务的迭代、多轮交互中学习。然而,多轮RL仍然具有挑战性,因为奖励通常稀疏或延迟,而环境可能是随机的。在这种情况下,朴素的轨迹采样会阻碍利用并导致模式崩溃。我们提出了TSR(轨迹搜索rollouts),一种训练时的方法,重新利用测试时扩展的想法以改进每轮rollout生成。TSR通过基于状态的反馈在每个回合中选择高分动作,进行轻量级树状搜索来构造高质量轨迹。这提高了rollout质量并稳定了学习,同时与标准策略梯度优化器兼容,使TSR对优化器无偏见。我们用best-of-N、beam和浅层前瞻搜索实例化TSR,并与PPO和GRPO配对,在Sokoban、FrozenLake和WebShop任务中实现高达15%的性能提升和更稳定的训练,仅需适度增加一次训练计算。通过将搜索从推理时间转移到训练的rollout阶段,TSR提供了一种模块化且通用的机制,用于更强的多轮代理学习,与现有框架和拒绝采样式选择方法互补。

英文摘要

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using state-based feedback. This improves rollout quality and stabilizes learning while remaining compatible with standard policy gradient optimizers, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a modest, one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a modular and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

2602.06462 2026-05-20 cs.CL cs.LG 版本更新

Diffusion-State Policy Optimization for Masked Diffusion Language Models

扩散状态策略优化用于掩码扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

发表机构 * Institute of Science Tokyo(东京科学研究院)

AI总结 本文提出Diffusion-State Policy Optimization(DiSPO),一种用于掩码扩散语言模型的插件信用分配层,通过直接优化中间填充决策来改进生成过程,实验表明其在数学和规划基准测试中优于现有基线方法。

详情
AI中文摘要

掩码扩散语言模型通过迭代填充掩码标记来生成文本,但仅对最终完成结果的终端奖励对中间填充决策的信用分配过于粗糙。我们提出Diffusion-State Policy Optimization(DiSPO),一种插件信用分配层,直接优化中间填充决策。在选定的中间掩码状态下,DiSPO通过从滚出缓存的logits中重新采样当前掩码位置,评估由此产生的完成结果,并仅更新新填充的标记,无需额外的多步扩散滚出或优化器步骤。我们为分支完成形式化了一个固定状态目标,并推导出一个策略梯度估计器,该估计器重用与终端反馈策略优化相同的滚出。在LLaDA-8B-Instruct上的实验表明,DiSPO在匹配的滚出计算和优化器步骤下,一致提高了终端反馈基线,包括diffu-GRPO和SPG,在数学和规划基准测试中。我们的项目页面可在https://daioba.github.io/dispo上找到。

英文摘要

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

2602.04998 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

学习率至关重要:Vanilla LoRA可能足以用于LLM微调

Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh

发表机构 * National Taiwan University(国立台湾大学) IBM Research(IBM研究院) Academia Sinica(台湾“学术院”)

AI总结 本文通过广泛的超参数搜索重新评估了九种代表性的LoRA变体和Vanilla LoRA,在数学推理、常识推理、代码生成和指令遵循等任务上,发现不同的LoRA方法偏好不同的学习率范围。当学习率正确调整时,所有方法都能达到相似的峰值性能,这表明Vanilla LoRA仍然是一个有竞争力的基线,而单一训练配置下的改进可能并不反映一致的方法优势。

Comments Project page: https://github.com/yuang-lee/lr-matters-lora

详情
AI中文摘要

低秩适应(LoRA)是高效大型语言模型(LLM)微调的主流方法。在此范式基础上,近期研究提出了替代的初始化策略、架构修改和优化调整,报告了显著优于Vanilla LoRA的改进。然而,这些改进通常是在固定或狭窄调整的超参数设置下展示的,尽管神经网络对训练配置敏感已知。在本工作中,我们通过广泛的超参数搜索,系统地重新评估了九种代表性的LoRA变体以及Vanilla LoRA,搜索范围包括学习率、批量大小、秩和训练持续时间。在覆盖数学推理、常识推理、代码生成和指令遵循等任务的不同模型规模上,我们发现不同的LoRA方法偏好不同的学习率范围。关键的是,一旦学习率正确调整,所有方法都能达到相似的峰值性能(在1-2%以内),仅存在细微的秩依赖行为。这些结果表明,Vanilla LoRA仍然是一个有竞争力的基线,而单一训练配置下的改进可能并不反映一致的方法优势。最后,二次分析将不同的最优学习率范围归因于最大的Hessian特征值的变化,这与经典的机器学习理论一致。

英文摘要

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

2601.16823 2026-05-20 cs.CL cs.AI 版本更新

Disentangling generalization and memorization in large language models using chess

通过国际象棋解构大型语言模型中的泛化与记忆

Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsaecker

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 本文通过国际象棋测试环境,研究大型语言模型中泛化与记忆能力的区别,发现模型在相关先验知识稀疏时性能显著下降,表明系统泛化能力有限,需超越规模的机制来实现鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)展现出显著的能力,但其能力在多大程度上反映的是复杂的记忆还是真正的推理能力仍不明确。我们引入国际象棋作为受控测试环境,旨在区分这些能力。利用游戏的结构和可扩展的引擎评估,我们构建了一个位置分类学,这些位置在相关先验知识的密度上变化较大,从可以通过记忆解决的常见状态到完全新颖需要泛化的状态。关键的是,我们的方法在不需要显式了解模型训练数据的情况下实现了这一区分。应用此分类学,我们结合了GPT系列的纵向分析和对现代模型的严格评估,包括Claude Opus和Gemini。我们的分析揭示了一个陡峭的梯度:随着相关先验知识密度的降低,性能持续下降。值得注意的是,在相关先验知识较少的任务中,基础模型性能回归到随机下棋的基线。虽然新模型有所改进,但在先验知识稀疏的任务中,进步显著放缓。此外,虽然推理增强的推理提高性能,但在没有相关先验知识的情况下,每token的相对边际收益减少。这些结果表明系统泛化能力有限,强调了在缺乏相关先验知识时,需要超越规模的机制来实现鲁棒性能。

英文摘要

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably, for tasks with few relevant priors, base model performance regresses to the random-play baseline. While newer models improve, progress slows significantly for tasks with sparse priors. Furthermore, while reasoning-augmented inference improves performance, its relative marginal benefit per token decreases in the absence of relevant priors. These results suggest limitations in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust performance when deprived of relevant priors.

2601.12369 2026-05-20 cs.CL 版本更新

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

深度研究代理能否检索和组织?通过专家分类法评估合成差距

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhui Wang, Zhenghao Xiang, Qiyuan Peng, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Maxm Pan, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University(复旦大学) Hunyuan Team, Tencent(腾讯 Hunyuan 团队)

AI总结 本文提出TaxoBench基准,评估深度研究代理在检索和组织论文方面的能力,发现两者在能力与对齐方面均存在瓶颈。

详情
AI中文摘要

深度研究代理越来越多地自动化文献综述生成,但它们是否能像人类专家一样检索关键论文并将其组织成专家级分类法仍不清楚。现有基准强调写作质量和引用正确性,而标准聚类指标忽略层次结构。我们引入TaxoBench,一个包含72篇高引LLM综述、专家编写的分类树和3,815篇映射到论文类别的论文的基准。TaxoBench评估(1)检索通过召回率/精确率/F1,以及(2)在叶级别(论文到类别分配)和层次级别通过两个新指标:无序语义树编辑距离(US-TED/US-NTED)和语义路径相似性(Sem-Path)。支持两种模式:深度研究(主题-only,端到端)和自下而上(提供专家论文集,仅组织)。为了区分与单一专家参考的分歧与真正的模型失败,我们明确将发现分为能力基于(参考自由)和对齐基于(参考依赖)组。评估7个深度研究代理和12个前沿LLM揭示了双重瓶颈。在能力方面,最好的代理只能检索专家引用论文的20.92%,1,000个模型分类法显示75.9%的兄弟节点重叠,51.2%的MECE违规,和83.4%的结构不平衡,所有这些在没有参考的情况下都可以检测到。在对齐方面,所有12个LLM收敛到Sem-Path 28-29%,远低于三个独立人工标注组在相同论文集上达到的47-58%。我们的基准在https://github.com/KongLongGeFDU/TaxoBench上公开可用。

英文摘要

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.

2601.05437 2026-05-20 cs.CL cs.AI 版本更新

Tracing Moral Foundations in Large Language Models

在大型语言模型中追溯道德基础

Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Psychology, University of Southern California(南加州大学心理学系) Center for Computational Language Sciences, University of Southern California(南加州大学计算语言科学中心)

AI总结 本文研究了大型语言模型中道德基础的编码、组织和表达,通过多层方法分析道德基础与人类道德感知的一致性,并发现道德结构在预训练和微调过程中自然形成,且部分解耦。

详情
AI中文摘要

大型语言模型常常产生类似人类的道德判断,但不清楚这种表现是内部概念结构还是表面的'道德模仿'。使用道德基础理论(MFT)作为分析框架,我们研究了14个基础和指令微调的LLM在四个模型家族(Llama、Qwen2.5、Qwen3-MoE、Mistral)和从7B到70B的不同规模上如何编码、组织和表达道德基础。我们采用多级方法结合(i)逐层分析MFT概念表示及其与人类道德感知的一致性,(ii)在残差流上预训练稀疏自编码器(SAEs)以识别支持道德概念的稀疏特征,以及(iii)使用密集MFT向量和稀疏SAE特征进行因果引导干预。我们发现模型在表示和区分道德基础方面与人类判断一致,且这种道德几何结构自然从预训练中产生,并在微调中被选择性重 wiring。在更细的尺度上,SAE特征显示出与特定基础的明确语义联系,表明在共享表示中存在部分解耦的机制。最后,沿着密集向量或稀疏特征引导会产生可预测的在基础相关行为上的变化,证明了内部表示与道德输出之间的因果联系。共同,我们的结果提供了机械证据,表明LLM中的道德概念是分布的、分层的且部分解耦的,暗示了多元道德结构可以从语言的统计规律中作为潜在模式出现。

英文摘要

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

2512.14700 2026-05-20 cs.SI cs.CL cs.CY 版本更新

Context-Aware Detection and Victim-Centered Response Generation for Online Harassment in Private Messaging

基于上下文的在线骚扰检测与以受害者为中心的回应生成:私人信息交流中的在线骚扰

Pinxian Lu, Nimra Ishfaq, Emma Win, Morgan Rose, Sierra R Strickland, Candice L Biernesser, Jamie Zelazny, Munmun De Choudhury

发表机构 * Georgia Institute of Technology(佐治亚理工学院) The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Pittsburgh(匹兹堡大学)

AI总结 本文研究了大型语言模型如何支持私人信息交流中的在线骚扰检测与回应,通过构建一个包含80,053条Instagram私信的标注数据集,开发了上下文感知的级联分类流水线,并提出了一种以受害者为中心的回应框架,生成心理上合理的AI回应,经评估发现其在情感支持和缓和冲突方面显著优于原始回应。

Comments 16 pages, 2 figures

详情
AI中文摘要

在线骚扰是一种普遍的社会和公共卫生问题,但大多数检测和应对骚扰的计算方法专注于公开的社交媒体内容,而非私人信息环境。私人对话带来了独特的挑战,因为有害的互动通常通过依赖上下文的多轮交流展开,而受害者在遭受骚扰时可能缺乏及时的支持。本文研究了大型语言模型(LLMs)如何支持私人对话中的在线骚扰检测与回应。使用80,053条由26名12-18岁青少年捐赠的Instagram私信数据集(包括有自杀风险因素的青少年),我们首先构建了一个在线骚扰的标注数据集,并开发了一个上下文感知的级联LLM分类流水线。所提出的流水线在基线毒性分类器(主要训练于公开社交媒体数据)上表现更优。然后我们开发了一种以受害者为中心的回应框架,生成上下文敏感且心理上合理的AI生成回应。人类评估者认为AI生成的回应比原始参与者回应显著更有帮助(95% CI: 0.767-0.815, p < .001),特别是在情感支持和缓和冲突方面。我们的发现突显了上下文感知和以受害者为中心的AI系统在私人信息环境中提供即时支持的潜力。

英文摘要

Online harassment is a widespread social and public health concern, yet most computational approaches for detecting and addressing harassment focus on publicly visible social media content rather than private messaging environments. Private conversations present unique challenges because harmful interactions often unfold through context-dependent, multi-turn exchanges, while victims may lack timely support during moments of harassment. In this study, we investigate how large language models (LLMs) can support both the detection of and response to online harassment in private messaging. Using a dataset of 80,053 Instagram direct messages donated by 26 adolescents aged 12-18, including youth with suicide risk factors, we first construct a human-labeled dataset of online harassment in private conversations and develop a context-aware cascading LLM classification pipeline. The proposed pipeline outperforms baseline toxicity classifiers trained primarily on public social media data. We then develop a victim-centered response framework that produces context-sensitive and psychologically-grounded AI-generated responses to online harassment messages. Human evaluators perceived the AI-generated responses as significantly more helpful than the original participant responses (95% CI: 0.767--0.815, p < .001), particularly in terms of emotional support and de-escalation. Our findings highlight the potential of context-aware and victim-centered AI systems to provide just-in-time support during harassment in private messaging environments.

2512.07068 2026-05-20 cs.CL 版本更新

SETUP: Sentence-level English-To-Uniform Meaning Representation Parser

SETUP:句子级别的英语到统一意义表示解析器

Emma Markle, Javier Gutierrez Bach, Shira Wein

发表机构 * Amherst College(阿默斯特学院)

AI总结 本文提出两种英语到统一意义表示(UMR)的解析方法,其中一种微调了现有的抽象意义表示解析器,另一种利用了通用依赖关系转换器。所提出的最佳模型SETUP在AnCast和SMATCH++评分上分别达到84和91,显示出在自动UMR解析中的显著提升。

Comments LREC 2026 Camera-ready

详情
AI中文摘要

统一意义表示(UMR)是一种新颖的基于图的语义表示,能够捕捉文本的核心意义,其注释方案具有灵活性,使得世界上各种语言(包括低资源语言)的注释成为可能。尽管UMR在促进语言记录、改进低资源语言技术以及增加可解释性方面显示出潜力,但只有在文本到UMR解析器能够实现大规模自动生产准确的UMR图时,UMR的下游应用才能得到充分探索。先前的文本到UMR解析工作仅限于当前阶段。在本文中,我们介绍了两种英语文本到UMR解析方法,其中一种微调了现有的抽象意义表示解析器,另一种利用了通用依赖关系转换器,以先前工作为基准。我们的最佳模型,我们称之为SETUP,在AnCast评分上达到84,在SMATCH++评分上达到91,表明在自动UMR解析方面取得了显著进展。

英文摘要

Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.

2511.04776 2026-05-20 cs.CY cs.CL 版本更新

Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid

量化生成式人工智能的气候风险:基于G-TRACE和AI可持续性金字塔的区域感知碳核算

Zahida Kausar, Seemab Latif, Raja Khurram Shahzad, Mehwish Fatima

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan(电气工程与计算机科学学院(SEECS),国家科学与技术大学(NUST),巴基斯坦伊斯兰阿巴德)

AI总结 本文提出G-TRACE框架,用于量化生成式人工智能在不同模态和部署地区的训练和推理相关排放,并通过AI可持续性金字塔模型提出七级治理框架,以指导可持续的人工智能部署。

Comments 27 page, 4 figures

详情
AI中文摘要

生成式人工智能(GenAI)代表了迅速扩展的数字基础设施,其能源需求和相关二氧化碳排放正成为新的气候风险类别。本研究引入G-TRACE(GenAI变革性碳估算器),一种跨模态、区域感知的框架,用于量化不同模态和部署地区的训练和推理相关排放。通过实际数据分析和微观模拟,G-TRACE测量每种输出类型(文本、图像、视频)的能源使用和碳强度,并揭示去中心化推理如何将小查询能耗放大为系统级影响。通过吉卜力风格图像生成趋势(2024-2025),我们估计了4309 MWh的能源消耗和2068吨CO2排放,展示了病毒式参与如何将个体数字行为放大到吨级后果。基于这些发现,我们提出了AI可持续性金字塔,一种七级治理模型,将碳核算指标(L1-L7)与运营准备性、优化和 stewardship 相关联。该框架将定量排放指标转化为可持续人工智能部署的可操作政策指导。本研究为新兴数字基础设施作为新型气候风险类别提供了定量评估,支持适应性治理以实现可持续技术部署。通过将GenAI置于气候风险框架中,本工作推进了数据驱动方法,以使技术创新与全球去碳化和韧性目标相一致。

英文摘要

Generative Artificial Intelligence (GenAI) represents a rapidly expanding digital infrastructure whose energy demand and associated CO2 emissions are emerging as a new category of climate risk. This study introduces G-TRACE (GenAI Transformative Carbon Estimator), a cross-modal, region-aware framework that quantifies training- and inference-related emissions across modalities and deployment geographies. Using real-world analytics and microscopic simulation, G-TRACE measures energy use and carbon intensity per output type (text, image, video) and reveals how decentralized inference amplifies small per-query energy costs into system-level impacts. Through the Ghibli-style image generation trend (2024-2025), we estimate 4,309 MWh of energy consumption and 2,068 tCO2 emissions, illustrating how viral participation inflates individual digital actions into tonne-scale consequences. Building on these findings, we propose the AI Sustainability Pyramid, a seven-level governance model linking carbon accounting metrics (L1-L7) with operational readiness, optimization, and stewardship. This framework translates quantitative emission metrics into actionable policy guidance for sustainable AI deployment. The study contributes to the quantitative assessment of emerging digital infrastructures as a novel category of climate risk, supporting adaptive governance for sustainable technology deployment. By situating GenAI within climate-risk frameworks, the work advances data-driven methods for aligning technological innovation with global decarbonization and resilience objectives.

2511.01526 2026-05-20 cs.CL 版本更新

Difficulty-Controllable Cloze Question Distractor Generation

可调节难度的填空题干扰项生成

Seokhoon Kang, Yejin Jeon, Seonjeong Hwang, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence, POSTECH, South Korea(韩国立科学技术院人工智能研究生院) Department of Computer Science and Engineering, POSTECH, South Korea(韩国立科学技术院计算机科学与工程系) Mila Quebec AI Institute, Canada(魁北克AI研究院) McGill University, Canada(麦吉尔大学)

AI总结 本文提出了一种可调节难度的填空题干扰项生成框架,通过数据增强和多任务学习策略,生成高质量且标注难度的干扰项,优于GPT-4o在匹配干扰项难度与人类感知方面。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

多项选择填空题常用于评估语言能力和理解能力。然而,生成高质量的干扰项仍具有挑战性,因为现有方法往往缺乏适应性和对难度水平的控制,缺乏难度标注的数据集进一步阻碍了进展。为了解决这些问题,我们提出了一种生成具有可控难度的干扰项的新框架,通过利用数据增强和多任务学习策略。首先,为了创建高质量、难度标注的数据集,我们引入了双向干扰项生成过程来生成多样且合理的干扰项。这些候选者经过筛选后,通过集成QA系统进行难度分类。其次,利用新创建的数据集通过多任务学习训练一个可调节难度的生成模型。实验结果表明,我们的方法在不同难度级别上生成高质量的干扰项,并在匹配干扰项难度与人类感知方面显著优于GPT-4o。

英文摘要

Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process to produce diverse and plausible distractors. These candidates are filtered and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is used to train a difficulty-controllable generation model via multitask learning. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

2510.25064 2026-05-20 cs.CL 版本更新

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

LLMs能否估算阅读理解题的认知复杂性?

Seonjeong Hwang, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence, POSTECH, Republic of Korea(韩国立科学技术院人工智能研究生院) Department of Computer Science and Engineering, POSTECH, Republic of Korea(韩国立科学技术院计算机科学与工程系)

AI总结 本文研究了大型语言模型是否能通过证据范围和转换级别两个维度估算阅读理解题的认知复杂性,结果显示LLMs能够近似估算认知复杂性,但存在推理能力与元认知意识之间的差距。

Comments ACL 2026 Main Conference

详情
AI中文摘要

估算阅读理解(RC)题的认知复杂性对于在施测前评估题目难度至关重要。与句法和语义特征(如文章长度或选项间的语义相似性)不同,答案推理过程中产生的认知特征难以用现有NLP工具提取,传统上依赖人工标注。在本研究中,我们探讨大型语言模型(LLMs)能否通过两个维度——证据范围和转换级别——来估算RC题的认知复杂性,这两个维度表明推理答案过程中所涉及的认知负担程度。我们的实验结果表明,LLMs能够近似估算题目的认知复杂性,表明其在前期难度分析中的潜力。进一步分析揭示了LLMs推理能力与其元认知意识之间的差距:即使它们产生了正确答案,有时也会错误地识别出自身推理过程背后的特征。

英文摘要

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

2510.14261 2026-05-20 cs.CL 版本更新

Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

重写历史:一种用于干预分析以研究数据对模型行为影响的配方

Rahul Nadkarni, Yanai Elazar, Hila Gonen, Noah A. Smith

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(保罗·G·阿伦计算机科学与工程学院,华盛顿大学) Department of Computer Science, Bar-Ilan University(巴伊兰大学计算机科学系) Department of Computer Science, University of British Columbia(不列颠哥伦比亚大学计算机科学系) Allen Institute for Artificial Intelligence(人工智能阿伦研究所)

AI总结 本文提出了一种实验方法,用于研究训练数据与语言模型行为之间的关系,通过干预数据批次(即'重写历史')并重新训练模型检查点来测试数据与行为之间的假设,展示了如何通过案例研究来验证事实知识获取中的数据影响。

Comments Accepted to TACL, pre-MIT Press publication version

详情
AI中文摘要

我们提出了一种实验配方,用于研究训练数据与语言模型(LM)行为之间的关系。我们概述了干预数据批次——即'重写历史'——的步骤,并重新训练模型检查点以测试数据与行为之间的假设。我们的配方将这种干预分解为多个阶段,包括从衡量模型行为的基准中选择评估项目、将相关文档与这些项目匹配,并在重新训练前修改这些文档以测量效果。我们通过事实性知识获取的案例研究展示了该配方的实用性,使用共现统计和信息检索方法来识别可能促进知识学习的文档。我们的结果补充了过去将共现与模型行为联系起来的观测分析,同时表明现有方法无法完全解释LM正确回答知识问题的能力。总体而言,我们概述了一种研究人员可以遵循的配方,以进一步测试训练数据如何影响模型行为的假设。我们的代码已公开发布,以促进未来的工作。

英文摘要

We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.

2510.12773 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Dr.LLM: Dynamic Layer Routing in LLMs

Dr.LLM:大语言模型中的动态层路由

Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh

发表机构 * Parameter Lab(参数实验室) MBZUAI(穆扎夫法尔国际人工智能研究院) NAVER AI Lab(NAVER人工智能实验室) University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心)

AI总结 本文提出Dr.LLM,一种通过在预训练模型中加入轻量级每层路由器来实现动态层路由的框架,该方法在不改变基础权重的情况下,通过显式监督训练路由器,提高推理的计算效率和准确性。

Comments Published at ICLR 2026

详情
AI中文摘要

大语言模型(LLMs)处理每个token时都会通过transformer堆栈的所有层,这导致简单查询的计算浪费以及更复杂的查询需要更深层次推理时的灵活性不足。适应深度方法可以提高效率,但先前的方法依赖于成本高昂的推理时间搜索、架构更改或大规模重新训练,在实践中虽然提高了效率,但常常导致准确性下降。我们介绍了Dr.LLM,即大语言模型中的动态层路由,一种可回退的框架,该框架为预训练模型配备了轻量级每层路由器,决定跳过、执行或重复一个块。路由器通过显式监督进行训练:使用蒙特卡洛树搜索(MCTS),我们推导出高质量的层配置,以在计算预算下保持或提高准确性。我们的设计,包括窗口池化以实现稳定的路由、聚焦损失与类别平衡以及瓶颈MLP路由器,确保在类别不平衡和长序列下具有鲁棒性。在ARC(逻辑)和DART(数学)上,Dr.LLM在每个示例上平均节省5层的同时,将准确性提高了最高3.4个百分点。路由器能够泛化到域外任务(MMLU、GSM8k、AIME、TruthfulQA、SQuADv2、GPQA、PIQA、AGIEval)时,仅导致0.85%的准确性下降,同时保持效率,并在某些情况下优于先前的路由方法。总体而言,Dr.LLM展示了通过显式监督训练的路由器可以回退冻结的LLMs,以实现预算意识、准确性驱动的推理,而无需改变基础权重。代码可在https://github.com/parameterlab/dr-llm上获得。

英文摘要

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr. LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr. LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr. LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights. Code is available at https://github.com/parameterlab/dr-llm.

2510.08986 2026-05-20 cs.CL cs.CE cs.CY 版本更新

CAPC-CG: A Large-Scale, Expert-Directed LLM-Annotated Corpus of Adaptive Policy Communication in China

CAPC-CG:一个大规模、专家指导的中国适应性政策沟通LLM注释语料库

Bolun Sun, Charles Chang, Yuen Yuen Ang, Ruotong Mu, Yuchen Xu, Zhengxin Zhang, Pingxu Hao

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Northwestern University(西北大学) Duke Kunshan University(杜克-昆山大学)

AI总结 本文介绍了CAPC-CG语料库,该语料库是首个开放的中国政策指令注释语料库,基于Ang的适应性政策沟通理论,采用五色分类法对清晰和模糊语言类别进行标注,旨在支持下游任务和多语言NLP研究。

Comments Accepted for publication in the Proceedings of ACL Main 2026

详情
AI中文摘要

我们介绍了CAPC-CG,即中国适应性政策沟通(中央政府)语料库,这是首个开放的中国政策指令注释语料库,基于Ang的适应性政策沟通理论。涵盖1949-2023年,该语料库包括中国最高当局发布的国家法律、行政法规和部级规章。每个文档被分割成段落,产生总计330万个单位。此外,我们还发布了全面的元数据、双轮标注框架和由专家和受训编码器开发的黄金标注集。标注者间协议在指令标签上达到Fleiss's kappa为K=0.86,表明高可靠性用于监督建模。我们提供了基于几种大语言模型(LLMs)的基线分类结果,以及我们的标注代码本,并描述了数据集中的模式。此次发布旨在支持下游任务和多语言NLP研究。

英文摘要

We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

2510.05746 2026-05-20 cs.AI cs.CL cs.LG 版本更新

ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

ARM:为通用多智能体系统发现代理推理模块

Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav

发表机构 * University of Washington(华盛顿大学)

AI总结 本文提出了一种新的自动多智能体系统设计范式,通过优化链式推理(CoT)来发现代理推理模块(ARM),该模块通过在代码空间中进行树搜索,利用执行轨迹的反思来进化,从而提升多智能体系统的泛化能力。

Comments 29 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)驱动的多智能体系统(MAS)在各种复杂推理任务上取得了最先进的结果。最近的研究提出了自动化设计MAS的方法,消除了手动工程的需要。然而,这些方法表现不佳,通常与简单的基线相当或更差。此外,它们需要为每个新任务领域进行昂贵的架构重新发现,并且在没有现有标注验证集的领域中需要昂贵的数据注释。关键的洞察是简单的链式推理(CoT)推理往往与这些复杂系统竞争,表明MAS的基本推理单元CoT值得进一步研究。为此,我们提出了一种新的自动MAS设计范式,将焦点转向优化CoT推理。我们引入了代理推理模块(ARM),即CoT的代理泛化,其中每个细粒度推理步骤由专门的推理模块执行。该模块通过在代码空间中进行树搜索来发现,从简单的CoT模块开始,利用执行轨迹的反思进行进化。最终的ARM作为一个通用的推理构建块,可以作为直接的递归循环或作为学习元协调器中的子程序使用。我们的方法显著优于手动设计的MAS和最先进的自动MAS设计方法。关键的是,由ARM构建的MAS表现出卓越的泛化能力,在不同的基础模型和任务领域中保持高性能,而无需进一步优化。

英文摘要

Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

2510.05431 2026-05-20 cs.CL 版本更新

Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

基于LLM生成信任指标的自过滤蒸馏用于可靠专利分类

Yongmin Yoo, Xu Zhang, Longbing Cao

发表机构 * Frontier AI Research Centre, School of Computing, Faculty of Science and Engineering, Macquarie University(前沿人工智能研究中心,计算机学院,科学与工程学院,麦考瑞大学)

AI总结 本文提出自过滤蒸馏方法,通过将LLM生成的推理作为信任指标而非真实标签,提升专利分类的可靠性,实验显示在USPTO-2M数据集上宏F1指标提升了38.7%。

详情
AI中文摘要

按照分类方案组织大规模专利语料库是信息管理的核心任务,决定先例检索、技术知识发现和知识产权决策的准确性和效率。近期方法将大语言模型生成的自然语言推理蒸馏到紧凑的学生模型中,但这些推理中固有的逻辑错误、标签不匹配和分类学不一致在训练过程中被无差别吸收,影响分类可靠性并传播误差至下游信息流程。而非事后纠正这些错误,我们提出自过滤蒸馏(SFD),通过将LLM生成的推理重新解释为信任指标而非真实监督,直接将质量保证嵌入学习过程。SFD整合三种无监督信号到统一的信任分数中,动态调节每个训练实例的贡献:自我一致性,量化独立生成推理之间的一致性;类别蕴含对齐,评估推理与分配CPC类定义之间的语义一致性;LLM同意评分,通过独立验证者评估外部合理性。在包含超过两百万专利的USPTO-2M基准上,SFD在四种学生架构上实现了宏F1指标高达38.7%的相对提升,信任分数与专家判断之间的强相关性(r=0.685)证实该框架不仅提供准确预测,还提供可分解的置信度语义,使大规模专利知识组织能够实现可审计和自文档化的分类结果。

英文摘要

Organizing large-scale patent corpora according to classification schemes is a core information management task that determines the accuracy and efficiency of prior art retrieval, technology knowledge discovery, and intellectual property decision-making. Recent approaches distill natural language rationales generated by large language models (LLMs) into compact student models, yet logical errors, label mismatches, and taxonomy misalignments inherent in these rationales are indiscriminately absorbed during training, undermining classification reliability and propagating errors throughout downstream information processes. Rather than correcting such errors post-hoc, we propose Self-Filtered Distillation (SFD), which embeds quality assurance directly into the learning process by reinterpreting LLM-generated rationales as trust indicators rather than ground-truth supervision. SFD integrates three unsupervised signals into a unified trust score that dynamically modulates each training instance's contribution: Self-Consistency, which quantifies agreement among independently generated rationales; Class Entailment Alignment, which evaluates semantic coherence between a rationale and its assigned CPC class definition; and LLM Agreement Scoring, which assesses external plausibility through an independent verifier. On the USPTO-2M benchmark comprising over two million patents, SFD achieves up to 38.7\% relative improvement in Macro-F1 across four student architectures, and the strong correlation between trust scores and expert judgments ($r = 0.685$) confirms that the framework provides not only accurate predictions but also decomposable confidence semantics that enable auditable and self-documenting classification outcomes for large-scale patent knowledge organization.

2509.25448 2026-05-20 cs.CR cs.CL 版本更新

Fingerprinting LLMs via Prompt Injection

通过提示注入指纹化LLM

Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong

发表机构 * Duke University(杜克大学) Ant Group(蚂蚁集团)

AI总结 本文提出LLMPrint框架,通过利用LLM对提示注入的固有脆弱性来构建指纹,以解决已发布模型的溯源问题,其核心方法是优化提示以获得唯一且抗后处理的指纹,实验显示其在高召回率下保持低误报率。

详情
AI中文摘要

大型语言模型(LLMs)在发布后通常通过后处理如微调或量化进行修改,这使得确定一个模型是否源自另一个模型变得具有挑战性。现有溯源检测方法有两个主要局限:(1)它们在发布前将信号嵌入到基础模型中,这在已发布的模型上不可行;或者(2)它们使用手工设计或随机提示比较不同模型的输出,这在后处理后不够稳健。在本文中,我们提出LLMPrint,一种新的检测框架,通过利用LLM对提示注入的固有脆弱性来构建指纹。我们的关键见解是,通过优化指纹提示以强制一致的token偏好,可以得到既独特于基础模型又抗后处理的指纹。我们进一步开发了一个统一的验证程序,适用于灰盒和黑盒设置,并具有统计保证。我们在五个基础模型和约700个后处理或量化变体上评估了LLMPrint。我们的结果表明,LLMPrint在高召回率的同时保持低误报率。代码可在https://github.com/hifi-hyp/ACL-LLMPrint公开获取。

英文摘要

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero. The code is publicly available at https://github.com/hifi-hyp/ACL-LLMPrint.

2509.23108 2026-05-20 cs.AI cs.CL 版本更新

Artificial Phantasia: Emergent Mental Imagery in Large Language Models

人工幻象:大语言模型中的涌现性心智 imagery

Morgan McCarty, Jorge Morales

发表机构 * Khoury College of Computer Sciences, Northeastern University(东北大学克劳利计算机科学学院) Department of Psychology, Northeastern University(东北大学心理学系) Department of Philosophy, Northeastern University(东北大学哲学系)

AI总结 本研究探讨了纯语言能否驱动视觉 imagery,发现大语言模型在视觉 imagery 任务中表现优于人类,表明可能存在非图示性的涌现性心智 imagery,挑战传统认知科学观点。

Comments 34 pages, 10 figures, 3 tables

详情
AI中文摘要

视觉 imagery 是否可以仅由语言驱动?这一想法与传统认知科学观点相悖,即视觉心智 imagery 只能通过图示性表示实现。大语言模型(LLMs)提供了初步证据,表明通过命题性表示的视觉 imagery 是可能的,并且可能比人类想象更稳健。我们为一个经典任务创建了数十种新项目,该任务被认为只能通过图示性表示解决(即仅靠语言不足以完成)。受试者被要求想象一系列组合字母和形状的变换并识别结果图像。我们发现最佳的 LLMs 在人类(n=100)表现上显著更好(p<0.0001),表明存在人工幻象或非图示性的涌现性“视觉”心智 imagery。此外,我们测试了具有可变推理令牌分配的推理模型,发现模型在更长的推理链中表现最佳,显示了语言对任务的影响——仅靠语言可能就足够。我们检验了三种涌现 imagery 假设:纯命题性 imagery、带有视觉-语言先验的命题性 imagery 或图示性视觉 imagery(经典视觉 imagery)。本研究不仅提供了大语言模型之前未报告的涌现性认知能力的证据,也重新引发了关于心智 imagery 是否需要图示格式的讨论。

英文摘要

Can visual imagery be driven solely by language? This idea goes against cognitive science's traditional view that visual mental imagery is only possible through pictorial representations. Large Language Models (LLMs) provide nascent evidence not only that visual mental imagery via propositional-representations is possible, but that it can be more robust than human imagination. We created dozens of novel items for an extension to a classic task which is argued to be solvable exclusively via pictorial representations (i.e., language alone would be insufficient). Subjects were asked to imagine a series of compositional letter and shape transformations and identify the resultant "image". We found that the best LLMs performed significantly better than humans ($n = 100$ human participants, $p < .0001$), indicating the existence of an artificial phantasia, or emergent "visual" mental imagery that may not be pictorial. Furthermore, we tested reasoning models with variable reasoning-token allocation and found that models perform best with longer reasoning chains, demonstrating a linguistic impact on the task -- language alone may be sufficient. We examined three emergent imagery hypotheses: pure propositional imagery, propositional imagery with visio-linguistic priors, or pictorial visual imagery (classical visual imagery). Our study not only presents evidence for a previously unreported emergent cognitive capacity of LLMs, but also reignites debate on the requirement for a pictorial format in mental imagery.

2509.22202 2026-05-20 cs.SE cs.CL 版本更新

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

LLM生成代码中的库幻觉:基于开发者查询的风险分析

Lukas Twist, Jie M. Zhang, Mark Harman, Helen Yannakoudakis

发表机构 * King’s College London(伦敦国王学院) University College London(伦敦大学学院)

AI总结 本文研究了LLM生成代码中库幻觉的风险,通过分析用户提示变化对库幻觉的影响,揭示了系统性漏洞,并提出了LibHalluBench基准测试以评估这些幻觉。

Comments 27 pages, 1 figure, 13 tables

详情
AI中文摘要

大型语言模型(LLMs)现在在代码生成中扮演核心角色,但它们仍然会幻觉,频繁地发明不存在的库。此类库幻觉不仅仅是无害的错误:它们可以误导开发者,导致构建失败,并暴露系统面临供应链威胁,如slopsquatting。尽管对这些风险的认识日益增加,但对库幻觉在真实使用条件下的表现仍缺乏深入理解。为填补这一空白,我们进行了首次系统研究,分析用户级提示变化如何影响LLM生成代码中的库幻觉。在七个不同的LLM上,我们分析了库名幻觉(无效导入)和库成员幻觉(从有效库中无效调用),考察了现实开发者语言和受控用户错误的影响,包括拼写错误和虚构库或成员。我们的发现揭示了系统性漏洞:单字符拼写错误会触发多达26%任务的幻觉;虚构库名被接受的比例高达99%;基于时间的提示会引发多达85%的幻觉。基于我们研究中发现的最高风险提示,我们引入了LibHalluBench基准测试,以系统且可重复地评估这些库幻觉。我们的发现突显了LLM对自然提示变化的脆弱性,并强调了对库相关幻觉及其下游风险的保护措施的紧迫需求。

英文摘要

Large language models (LLMs) now play a central role in code generation, yet they continue to hallucinate, frequently inventing non-existent libraries. Such library hallucinations are not just benign errors: they can mislead developers, break builds, and expose systems to supply chain threats such as slopsquatting. Despite growing awareness of these risks, there is limited understanding of how library hallucinations manifest under realistic usage conditions. To fill this gap, we present the first systematic study of how user-level prompt variations influence library hallucinations in LLM-generated code. Across seven diverse LLMs, we analyse library name hallucinations (invalid imports) and library member hallucinations (invalid calls from valid libraries), examining the effects of realistic developer language and controlled user mistakes, including misspellings and fabricated libraries or members. Our findings expose systemic vulnerabilities: one-character misspellings trigger hallucinations in up to 26% of tasks; fabricated library names are accepted in up to 99%; and time-based prompts induce hallucinations in up to 85%. Grounded in the highest-risk prompts identified in our study, we introduce LibHalluBench, a benchmark that enables a systematic and reproducible evaluation of these library hallucinations. Our findings underscore the fragility of LLMs to natural prompt variation and highlight the urgent need for safeguards against library-related hallucinations and their downstream risks.

2509.21698 2026-05-20 cs.CL 版本更新

GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures

GRAB:一种风险分类——面向财务披露中无监督主题发现的基准测试

Ying Li, Tiejun Ma

发表机构 * The University of Edinburgh, UK(爱丁堡大学)

AI总结 本文提出GRAB,一个专门针对财务披露中无监督主题发现的基准测试,通过结合FinBERT词注意力、YAKE关键词信号和基于分类法的短语匹配,生成无需人工标注的句子标签,从而评估无监督主题模型。

Comments 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Generative AI in Finance

详情
AI中文摘要

在10-K风险披露中的风险分类对于监管和投资至关重要,但目前尚无公开的基准测试评估此类任务的无监督主题模型。我们提出了GRAB,一个专门针对金融领域的基准测试,包含来自8247份文件的161万个句子,并通过结合FinBERT词注意力、YAKE关键词信号和基于分类法的短语匹配生成无需人工标注的句子标签。标签基于一个将193个术语映射到21个细粒度类型(嵌套在五个宏观类别下的类型)的风险分类法;21种类型指导弱监督,而评估则在宏观层面进行。GRAB通过固定的数据集划分和稳健的指标——准确率、宏F1、主题BERTScore以及基于熵的有效主题数,统一了评估。该数据集、标签和代码使经典、基于嵌入、神经网络和混合主题模型在财务披露上的可重复、标准化比较成为可能。

英文摘要

Risk categorization in 10-K risk disclosures matters for oversight and investment, yet no public benchmark evaluates unsupervised topic models for this task. We present GRAB, a finance-specific benchmark with 1.61M sentences from 8,247 filings and span-grounded sentence labels produced without manual annotation by combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching. Labels are anchored in a risk taxonomy mapping 193 terms to 21 fine-grained types nested under five macro classes; the 21 types guide weak supervision, while evaluation is reported at the macro level. GRAB unifies evaluation with fixed dataset splits and robust metrics--Accuracy, Macro-F1, Topic BERTScore, and the entropy-based Effective Number of Topics. The dataset, labels, and code enable reproducible, standardized comparison across classical, embedding-based, neural, and hybrid topic models on financial disclosures.

2509.17428 2026-05-20 cs.CL 版本更新

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

QWHA: 量化感知的沃尔什-哈达玛适应用于大型语言模型的参数高效微调

Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学) Sungkyunkwan University(全北国立大学)

AI总结 本文提出QWHA方法,通过将基于傅里叶变换的适配器与沃尔什-哈达玛变换结合,改进量化感知参数高效微调,有效减少量化误差并降低计算成本。

Comments 25 pages, 9 figures, 14 tables

详情
Journal ref
ICLR 2026 Poster
AI中文摘要

大型语言模型(LLMs)的高效部署需求推动了量化(减少推理成本)和参数高效微调(降低训练开销)的研究。为此,我们开发了量化感知参数高效微调(QWHA),以生成准确且高效的量化模型。在该设定中,减少量化误差在微调前至关重要。然而,现有依赖低秩适应的方法存在表示能力有限的问题。最近的傅里叶相关变换(FT)基于适配器具有比低秩适配器更大的表示能力,但其直接整合到量化模型中往往导致无效的误差减少和计算开销增加。为克服这些限制,我们提出了QWHA,一种通过使用沃尔什-哈达玛变换(WHT)作为变换内核,并结合新的适配器初始化方案(包括自适应参数选择和值细化)将FT基于适配器整合到量化模型中的方法。我们证明QWHA有效减轻量化误差并促进微调,其设计显著降低了计算成本。实验结果表明,QWHA在低比特量化精度上一致优于基线,并在现有FT基于适配器上实现了显著的训练加速。代码可在https://github.com/vantaa89/qwha获取。

英文摘要

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

2507.15698 2026-05-20 cs.CL cs.AI cs.LG 版本更新

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

CoLD: 用于数学推理过程中奖励模型的反事实引导长度偏差消除

Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Peking University(北京大学) University of Bristol(布里斯托大学)

AI总结 本文提出CoLD,一种通过反事实引导消除过程奖励模型中长度偏差的统一框架,旨在提高多步骤推理的准确性和简洁性,同时提升下游强化学习性能和跨领域泛化能力。

详情
AI中文摘要

过程奖励模型(PRMs)在评估和引导大型语言模型(LLMs)的多步推理中起着核心作用,特别是在数学问题解决中。然而,我们发现现有PRMs存在普遍的长度偏差:即使语义内容和逻辑有效性未变,它们也倾向于对较长的推理步骤赋予更高的分数。这种偏差会削弱奖励预测的可靠性,并导致推理过程中输出过于冗长。为了解决这一问题,我们提出了CoLD(Counterfactually-Guided Length Debiasing),一种统一的框架,通过三个组件减轻长度偏差:显式的长度惩罚调整、一个训练以捕捉虚假长度相关信号的学得偏差估计器,以及一种联合训练策略,强制奖励预测的长度不变性。我们的方法基于反事实推理,并受因果图分析的启发。在MATH500和GSM-Plus上的广泛实验表明,CoLD提高了步骤选择的准确性,并鼓励了更简洁、逻辑有效的推理。此外,它一致提高了下游RL性能,并通过减轻长度偏差在跨领域中泛化,展示了CoLD强大的泛化能力。

英文摘要

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

2507.03122 2026-05-20 cs.IR cs.CL cs.LG 版本更新

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

基于轻量模型和预训练嵌入的ICD分类联邦学习

Binbin Xu, Gérard Dray

发表机构 * EuroMov Digital Health in Motion, Univ. Montpellier, IMT Mines Ales(EuroMov数字健康运动、蒙彼利埃大学、IMT Mines Ales)

AI总结 本文研究了使用MIMIC-IV数据集中的临床笔记进行多标签ICD代码分类的联邦学习可行性与性能,提出了一种结合冻结文本嵌入和简单多层感知机分类器的轻量级可扩展流程,展示了在分布式医疗环境中隐私保护和部署高效的替代方案。

Comments 20 pages

详情
AI中文摘要

本研究探讨了使用MIMIC-IV数据集中的临床笔记进行多标签ICD代码分类的联邦学习(FL)的可行性和性能。不同于以往依赖集中训练或微调大型语言模型的方法,我们提出了一种轻量级且可扩展的流程,结合冻结的文本嵌入与简单的多层感知机(MLP)分类器。该设计为临床NLP应用提供了一种隐私保护且部署高效的替代方案,特别适用于分布式医疗环境。在集中式和联邦式配置下进行了广泛的实验,测试了六个公开可用的嵌入模型(来自Massive Text Embedding Benchmark排行榜)和三种MLP分类器架构,以及两种医学编码(ICD-9和ICD-10)。此外,对十个随机分层分割进行消融研究以评估性能稳定性。结果表明,嵌入质量在决定预测性能方面显著优于分类器复杂性,并且在理想条件下联邦学习可以接近集中式结果。尽管模型比最先进的架构小多个数量级,并且在微和宏F1分数上取得了竞争性的成绩,但仍存在一些限制,包括缺乏端到端训练和简化FL假设。然而,本研究展示了向可扩展、隐私意识的医疗编码系统迈进的可行方法,并为未来研究联邦、领域适应的临床AI提供了一步。

英文摘要

This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.

2506.14148 2026-05-20 cs.SD cs.CL eess.AS 版本更新

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

基于声学散射的非侵入式物体分类AI:一项关于头发评估的案例研究

Long-Vu Hoang, Tuan Nguyen, Tran Huy Dat

AI总结 本文提出了一种利用声学散射进行非侵入式物体分类的新方法,通过头发评估的案例研究进行演示。通过发射声学刺激并捕捉带有头发样本的头部对象散射信号,利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括(i)完全监督深度学习、(ii)嵌入式分类、(iii)监督基础模型微调和(iv)自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力,为各行业应用提供了巨大前景。

Comments This paper has been retracted by the authors. Due to miscommunication, the authorship is incomplete and missing early contributions

详情
AI中文摘要

本文提出了一种利用声学散射进行非侵入式物体分类的新方法,通过头发评估的案例研究进行演示。当入射波与物体相互作用时,会生成散射声场,该声场编码了结构和材料属性。通过发射声学刺激并捕捉头部带头发样本对象的散射信号,我们利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括(i)完全监督深度学习、(ii)嵌入式分类、(iii)监督基础模型微调和(iv)自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力,为各行业应用提供了巨大前景。

英文摘要

This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.

2505.04588 2026-05-20 cs.CL 版本更新

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

ZeroSearch: 无需搜索即可激励大语言模型的搜索能力

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, Jingren Zhou

发表机构 * Tongyi Lab , Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 本研究提出ZeroSearch框架,通过模拟搜索提升大语言模型的搜索能力,解决真实搜索引擎文档质量不可控和API成本高的问题,实验表明其在不同参数规模的模型上均表现优异。

详情
AI中文摘要

有效信息检索对于增强大语言模型(LLMs)的推理和生成能力至关重要。最近的研究探索了通过与真实搜索引擎在现实环境中交互,利用强化学习(RL)来提高LLMs的搜索能力。尽管这些方法显示出有前景的结果,但面临两个主要挑战:(1)不可控的文档质量:搜索引擎返回的文档质量往往不可预测,引入噪声和训练过程的不稳定性。(2)极高的API成本:RL训练需要频繁的回放,可能涉及数万次搜索请求,导致显著的API费用,并严重限制可扩展性。为了解决这些挑战,我们引入了ZeroSearch,一种新颖的RL框架,通过在训练过程中使用模拟搜索来激励LLMs的搜索能力。我们的方法首先通过轻量级监督微调将LLM转换为检索模块,使其能够生成有用和嘈杂的文档以响应查询。在RL训练过程中,我们采用基于课程的回放策略,逐步降低生成文档的质量,逐步通过暴露模型于越来越具有挑战性的检索场景来激发其推理能力。广泛的实验表明,ZeroSearch有效地利用3B LLM作为检索模块来激励LLMs的搜索能力。令人印象深刻的是,7B检索模块的表现与真实搜索引擎相当,而14B检索模块甚至超过了它。此外,它在各种参数规模的基模型和指令微调模型上均表现出良好的泛化能力,并且与多种RL算法兼容。

英文摘要

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

2503.19877 2026-05-20 cs.CL 版本更新

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

通过推理模型作为评估器来提升评估时的计算能力

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, Sean Welleck

发表机构 * CMU(卡内基梅隆大学) UIUC(伊利诺伊大学) KAIST AI(韩国科学技术院人工智能研究所) NEC Laboratories Europe(日本 NEC 欧洲实验室) Ss.Cyril and Methodius University of Skopje(斯科普里塞尔吉尔和梅蒂乌斯大学)

AI总结 本文探讨了通过增加评估时的计算量来提升语言模型的评估能力,利用推理模型作为评估器,分别评估响应整体和每个步骤,从而提高评估效果。

Comments ACL 2026 Findings

详情
AI中文摘要

随着语言模型(LM)的输出越来越自然,评估其质量变得越来越困难。同时,通过增加测试时的计算量来提升LM的'思考'时间,已被证明是解决数学和代码等领域挑战性问题的有效技术。这引发了一个自然的问题:是否可以通过增加测试时的计算量来提升LM的评估能力?为回答这个问题,我们研究了利用推理模型——即能够原生生成长链推理的LM——作为评估器。具体而言,我们考察了通过(1)使用推理模型,以及(2)提示这些模型不仅评估响应整体(即结果评估),还评估响应中的每个步骤(即过程评估)来利用更多测试时计算量的方法。在实验中,我们观察到评估器的性能随着生成更多推理标记而单调提升,类似于LM生成中的趋势。此外,我们使用这些更准确的评估器对多个生成进行重新排序,并证明在评估时花费更多计算量可以像在生成时花费更多计算量一样有效,从而提升LM的问题解决能力。

英文摘要

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

2410.18856 2026-05-20 cs.AI cs.CL 版本更新

Entry-level guide to the use of large language models for medical research

大型语言模型在医学研究中应用的入门指南

Qiao Jin, Nicholas Wan, Robert Leaman, Shubo Tian, Zhizheng Wang, Yifan Yang, Zifeng Wang, Guangzhi Xiong, Po-Ting Lai, Qingqing Zhu, Benjamin Hou, Maame Sarfo-Gyamfi, Gongbo Zhang, Aidan Gilson, Balu Bhasuran, Zhe He, Aidong Zhang, Jimeng Sun, Chunhua Weng, Ronald M. Summers, Qingyu Chen, Yifan Peng, Zhiyong Lu

发表机构 * National Library of Medicine (NLM), National Institutes of Health (NIH)(国家医学图书馆(NLM)、国立卫生研究院(NIH)) Department of Computer Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Department of Biomedical Informatics, Columbia University(哥伦比亚大学生物医学信息学系) School of Medicine, Yale University(耶鲁大学医学院) School of Information, Florida State University(佛罗里达州立大学信息学院) Department of Radiology and Imaging Sciences, NIH Clinical Center(国立卫生研究院临床中心放射学与影像科学部) Department of Population Health Sciences, Weill Cornell Medicine(韦尔医学院人口健康科学系)

AI总结 本文提出了一套可操作的指南,帮助医疗专业人员更高效地利用大型语言模型(LLMs)进行医学研究,涵盖任务制定、模型选择、提示工程、微调和模型部署等关键步骤,确保安全可靠地将LLMs应用于临床实践。

详情
AI中文摘要

前沿大型语言模型(LLMs),如GPT-5、Claude 4.5、Gemini 3、Llama 4和DeepSeek-R1,代表了一类具有变革潜力的AI工具,能够通过在各种上下文中生成类人响应并适应新任务来革新医疗保健的各个方面。它们的应用潜力涵盖广泛医学任务,如临床文档、患者与临床试验的匹配以及回答医学问题。在本文中,我们提出了一套可操作的指南,帮助医疗专业人员更高效地利用LLMs进行工作,并提供了一套最佳实践。整体工作流程包括几个主要阶段,包括制定任务、选择LLMs、提示工程、微调和模型部署。我们首先讨论了识别与LLMs核心能力相匹配的医学任务以及基于选定任务和数据、性能要求和模型接口选择模型的关键考虑因素。然后回顾了提示工程和微调等策略,以将标准LLMs适应于专门的医学任务。部署考虑因素,包括监管合规性、伦理准则以及持续监控公平性和偏见,也进行了讨论。通过提供结构化的分步方法,本文入门教程旨在为医疗专业人员提供必要的工具,以有效将LLMs整合到临床实践中,确保这些强大技术以安全、可靠和有影响力的方式得到应用。

英文摘要

Frontier large language models (LLMs), such as GPT-5, Claude 4.5, Gemini 3, Llama 4, and DeepSeek-R1, represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this paper, we propose an actionable guideline to help healthcare professionals more effectively and efficiently utilize LLMs in their work, along with a set of best practices. The overall workflow consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and model deployment. We start with the discussion of critical considerations in identifying medical tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this entry-level tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.

2410.15362 2026-05-20 cs.LG cs.AI cs.CL cs.CR 版本更新

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Faster-GCG: 面向对齐大语言模型的高效离散优化监狱突破攻击

Xiao Li, Wei Zhang, Zhuhong Li, Qiongxiu Li, Shei PernChua, BingZe Lee, Jinghao Cui, Yifan Huang, Xiaolin Hu

发表机构 * Tsinghua University(清华大学) Sea-Fill Duke University(杜克大学) Aalborg University(奥胡斯大学) Chinese Institute for Brain Research (CIBR)(中国脑科学研究院)

AI总结 本文提出Faster-GCG,通过改进估计、高效采样和避免重复评估,提高了对齐大语言模型的监狱突破攻击效率,实现了样本效率提升8倍,时间减少7倍,并在多个模型上取得了更高的突破成功率。

Comments 18 pages, new version

详情
AI中文摘要

对齐大语言模型(LLMs)因其安全性而受到广泛关注,尤其是在试图通过对抗性提示绕过安全边界(guardrails)的监狱突破攻击中。现有方法中,贪心坐标梯度(GCG)攻击通过离散标记优化实现了自动化监狱突破,但其低样本效率限制了实际应用。特别是,GCG需要约256,000次评估才能达到满意的监狱突破成功率,这是由于底层离散优化问题的固有难度。在本工作中,我们识别了限制GCG样本效率的三个关键因素:不准确的基于梯度的估计、低效的均匀采样以及重复评估先前探索的后缀。为了解决这些问题,我们提出了Faster-GCG,一种经过简化且改进的GCG变种,它结合了基于距离的正则化以提高估计、温度控制的采样以更有效的探索,以及一个标记已访问后缀的机制以避免冗余评估。Faster-GCG将所需的评估次数减少到32,000次,实现了与GCG相比样本效率提升8倍和时间减少7倍的改进。在该减少的预算下,Faster-GCG在五个对齐LLMs上平均达到了78.1%的监狱突破成功率,并在Qwen3.5-4B上达到了88.7%,优于最先进的白盒监狱突破方法。

英文摘要

Aligned Large Language Models (LLMs) have attracted significant attention for their safety, particularly in the context of jailbreak attacks that attempt to bypass guardrails via adversarial prompts. Among existing approaches, the Greedy Coordinate Gradient (GCG) attack pioneered automated jailbreaks through discrete token optimization; however, its low sample efficiency limits practical applicability. In particular, GCG requires approximately 256K evaluations per harmful behavior to achieve a satisfactory jailbreak success rate, due to the inherent difficulty of the underlying discrete optimization problem. In this work, we identify three key factors that limit the sample efficiency of GCG: inaccurate gradient-based estimation, inefficient uniform sampling, and repeated evaluation of previously explored suffixes. To address these issues, we propose Faster-GCG, a streamlined variant of GCG that incorporates distance-based regularization for improved estimation, temperature-controlled sampling for more effective exploration, and a visited-suffix marking mechanism to avoid redundant evaluations. Faster-GCG reduced the required evaluations to 32K, achieving up to an $8\times$ improvement in sampling efficiency and a $7\times$ reduction in wall-clock time compared to GCG. Under this reduced budget, Faster-GCG attained an average jailbreak success rate of 78.1\% across five aligned LLMs, and achieved 88.7\% against Qwen3.5-4B, outperforming state-of-the-art white-box jailbreak methods.

2403.07183 2026-05-20 cs.CL cs.AI cs.LG cs.SI 版本更新

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

大规模监控AI修改内容:ChatGPT对AI会议同行评审影响的案例研究

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Machine Learning Department, NEC Labs America(NEC美国实验室机器学习部门) Department of Biomedical Data Science, Stanford University(斯坦福大学生物医学数据科学系) Department of Electrical Engineering, Stanford University(斯坦福大学电气工程系) Graduate School of Education, Stanford University(斯坦福大学教育研究生院) Department of Sociology, Stanford University(斯坦福大学社会学系) Graduate School of Business, Stanford University(斯坦福大学商学院) Department of Management Science and Engineering, Stanford University(斯坦福大学管理科学与工程系) Department of Computer Science, UC Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 本文提出了一种方法,用于估计大规模语料库中可能被大语言模型(LLM)显著修改或生成的文本比例。通过专家撰写和AI生成的参考文本,该最大似然模型能够高效地在语料库层面考察实际的LLM使用情况。研究以ChatGPT发布后举行的AI会议同行评审(ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023)为案例,发现6.5%至16.9%的提交文本可能被LLM显著修改。生成文本的情境揭示了用户行为:在信心较低、接近截止日期或回复作者反驳较少的评审中,估计的LLM生成文本比例更高。此外,观察到语料库层面的趋势可能过于微妙,无法在个体层面检测到,并讨论了这些趋势对同行评审的影响。呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

Comments 46 pages, 31 figures, ICML '24

详情
AI中文摘要

我们提出了一种方法,用于估计大规模语料库中可能被大语言模型(LLM)显著修改或生成的文本比例。我们的最大似然模型利用专家撰写和AI生成的参考文本,以准确且高效的方式在语料库层面考察实际的LLM使用情况。我们将该方法应用于ChatGPT发布后举行的AI会议同行评审案例研究,包括ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。我们的结果表明,在这些会议中提交的同行评审文本中,6.5%至16.9%可能被LLM显著修改,即超出拼写检查或小幅写作更新的范围。生成文本出现的情境提供了关于用户行为的见解:估计的LLM生成文本比例在信心较低、接近截止日期或来自较少回应作者反驳的评审中更高。我们还观察到语料库层面的生成文本趋势,这些趋势可能在个体层面过于微妙而无法检测到,并讨论了这些趋势对同行评审的影响。我们呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

英文摘要

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

2605.19008 2026-05-20 cs.AI cs.CL cs.LG 版本更新

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

通过线学习的训练控制治理:在压力下受限制的自主训练以稳定性和效率

Anis Radianis

发表机构 * Qluon Inc.(Qluon公司)

AI总结 本文提出了一种名为Learn-by-Wire Guard (LBW-Guard)的受限制自主训练控制治理层,用于在压力下提高大型语言模型的稳定性和效率,通过在AdamW之上进行有界控制,以保持固定训练目标。

详情
AI中文摘要

现代语言模型训练越来越暴露于不稳定性、退化运行和计算浪费,特别是在使用激进的学习率、规模和运行时间压力条件时。本文介绍了Learn-by-Wire Guard (LBW-Guard),一种在AdamW之上运行的受限制自主训练控制治理层。而不是替换优化器更新规则,LBW-Guard通过观察训练 telemetry,解读对不稳定性敏感的制度,并在保持固定训练目标的同时对优化器执行应用有界控制。我们评估LBW-Guard在以Qwen2.5为中心的压力和鲁棒性套件中使用WikiText-103,以Qwen2.5-7B为经验锚点,与Qwen2.5-3B和Qwen2.5-14B进行模型大小比较,学习率压力测试,梯度裁剪基线以及无LoRA TinyLlama-1B全参数 sanity check。在7B参考设置中,LBW-Guard将最终困惑度从13.21降低到10.74,降低18.7%,同时将端到端时间从392.54秒降低到357.02秒,提高了1.10倍的速度。在更强的学习率压力下,AdamW在LR=3e-3时退化到最终困惑度1885.24,在LR=1e-3时为659.76,而LBW-Guard分别保持可训练性为11.57和10.33。梯度裁剪基线无法再现这种效果。这些结果支持了一个范围系统的结论,即对稳定性敏感的LLM训练可以受益于在优化器之上进行治理。LBW-Guard提供了证据,表明在压力下受限制的运行时间控制可以在保持生产力计算的同时,与优化器替换和局部梯度抑制保持不同。

英文摘要

Modern language-model training is increasingly exposed to instability, degraded runs, and wasted compute, especially under aggressive learning-rate, scale, and runtime-stress conditions. This paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW. Rather than replacing the optimizer update rule, LBW-Guard observes training telemetry, interprets instability-sensitive regimes, and applies bounded control to optimizer execution while preserving fixed training objectives. We evaluate LBW-Guard in a Qwen2.5-centered stress-and-robustness suite using WikiText-103, with Qwen2.5-7B as the empirical anchor, model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, learning-rate stress tests, gradient-clipping baselines, and a no-LoRA TinyLlama-1B full-parameter sanity check. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74, an 18.7% improvement, while reducing end-to-end time from 392.54s to 357.02s, a 1.10x speedup. Under stronger learning-rate stress, AdamW degrades to 1885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, whereas LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce this effect. These results support a scoped systems conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer. LBW-Guard provides evidence that bounded runtime control can preserve productive compute under stress while remaining distinct from optimizer replacement and local gradient suppression.

2605.18904 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Dynamic Model Merging Made Slim

动态模型合并的轻量级方法

Guodong Du, Wanyu Lin

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出DiDi-Merging方法,通过可微分的秩分配平衡共享和专家参数,实现更高效的动态模型合并,在参数量上显著优于现有方法。

详情
AI中文摘要

模型合并使在不联合训练或访问原始数据的情况下重用微调模型成为可能。动态合并进一步通过选择性激活任务相关参数并高效组合多个任务的专家来提高灵活性。然而,现有动态方法要么维护一个完整的共享模型加小专家,要么为专家分配过多容量,导致准确性与效率之间的权衡不优。为此,我们提出DiDi-Merging,一种轻量动态合并框架,利用可微分的秩分配来平衡共享和专家参数。通过将参数预算分配建模为低秩模块中的可微分秩优化,并引入无需数据的细化步骤来恢复任务保真度,DiDi-Merging在仅1.24倍单个微调模型参数的情况下匹配现有动态基线,并在1.4倍时超越它们,显著优于需要>2倍存储容量的方法。DiDi-Merging适用于视觉、语言和多模态任务。

英文摘要

Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

2605.18864 2026-05-20 cs.LG cs.AI cs.CL 版本更新

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

SAGE: 通过塑造锚点引导LLMs的RLVR探索

Chanuk Lee, Minki Kang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出SAGE框架,通过重塑反KL锚分布来实现可控的经验支持扩展,从而在数学推理基准中提升pass@1和pass@k的表现。

Comments Preprint

详情
AI中文摘要

近期研究发现,可验证奖励的强化学习(RLVR)能够可靠地提高推理任务的pass@1指标,但往往在pass@k上未能取得类似提升,引发了关于RLVR是否真正使大语言模型获得新推理能力还是仅提高基础模型中现有推理模式采样效率的问题。先前分析大多支持后者观点,认为这种限制源于标准RLVR目标的结构特性,导致探索压力不足。在本文中,我们提出一个核心结构约束源于反KL正则化,该正则化稳定了训练但本质上将策略锚定于参考分布,从而抑制了替代推理模式的出现。然而,我们显示,去除KL项或用前向KL替代并不能提供满意的解决方案,因为两者都会通过诱导奖励黑客或将概率质量分配给非目标区域而破坏效率-覆盖权衡。为了解决这一矛盾,我们提出了SAGE,一个原理性的框架,通过引导函数q(x,y)重塑反KL锚分布本身,实现可控的经验支持扩展,从而在挑战性的数学推理基准中获得一致的pass@1和pass@k提升。我们的代码可在https://github.com/tally0818/SAGE上获得。

英文摘要

Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.

2605.18852 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

通过代理评估和稳定性感知排名实现多模态大语言模型的鲁棒检查点选择

Qinwu Xu, Zhuoheng Li, Jessie Salas

发表机构 * Meta AI

AI总结 本文提出了一种多阶段框架,结合了精心挑选的现实世界数据、结构化的LLM判断和多阶段排名协议,以解决多模态大语言模型检查点选择中的鲁棒决策问题,强调数据质量(特别是OCR可读性)对评估有效性的重要性。

详情
AI中文摘要

多模态大语言模型(MLLMs)的检查点选择在性能差异微小且评估信号易受噪声影响时面临重大挑战。现有方法依赖静态基准或逐点评分,经常与实际应用场景不一致,并缺乏对不确定性的鲁棒估计,特别是在OCR密集场景中。在本文中,我们将检查点选择建模为在评估不确定性下的稳健决策问题。我们提出了一种多阶段框架,整合了精心挑选的现实世界数据、结构化的LLM判断和多阶段排名协议。评估系统通过逐点过滤、列表排名和成对比较进行逐步细化。为了提高可靠性,我们引入基于子采样的置信度估计和基于百分位数的评分公式,以捕捉分布特征并惩罚尾部失败。此外,我们证明数据质量,特别是OCR可读性,是评估有效性的重要决定因素。

英文摘要

Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.

2605.18824 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

细粒度基准生成用于基础模型的全面评估

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

发表机构 * Vector Institute(Vector研究院) York University(约克大学)

AI总结 本文提出了一种自动化基准生成框架,用于生成覆盖广泛、元数据丰富且抗污染的评估问题,从而提升基础模型的全面评估能力。

详情
AI中文摘要

基础模型的评估通常依赖于缺乏全面覆盖和细粒度评估元数据的基准汇总分数。我们引入了一个自动化基准生成框架。该框架生成基于参考材料(如教科书)的评估问题,生成具有广泛覆盖、丰富元数据和抗污染性的基准。该流程采用多代理架构进行问题生成,并采用以解决方案图驱动的策略,显著提高了地面真实解决方案的可靠性。使用该框架,我们生成了三个基准:机器学习、公司金融和个人金融。专家审查发现,其地面真实错误率显著低于之前的基准,如MMLU和GSM8K。对12个商业和开源模型的评估显示,我们的基准实现了接近均匀的竞争力覆盖,并揭示了现有基准未能捕捉到的模型间性能差异。我们即将开源该框架和我们精心挑选的基准。

英文摘要

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

2605.18812 2026-05-20 cs.LG cs.CL cs.IR 版本更新

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

PASC:面向多阶段NLP和LLM流水线的管道感知置信区间

Varun Kotte

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出PASC,一种面向多阶段NLP和LLM流水线的管道感知置信区间方法,通过联合覆盖保证提升多阶段流水线的置信区间性能。

详情
AI中文摘要

现代NLP和LLM系统是流水线:命名实体识别(NER)->实体消歧(NED)->实体类型、检索增强生成(检索器->读者),以及代理链(规划器->工具->批评者)。错误在各阶段累积,但现有不确定性量化方法要么独立校准每个阶段(无联合覆盖),要么应用Bonferroni联合界(有联合覆盖但保守)。我们提出了PASC(Pipeline-Aware Split Conformal),将多阶段联合覆盖转换为单个标量置信区间问题,基于联合最大不一致性分数。PASC提供了一个有限样本分布无关的保证,所有K阶段同时覆盖的概率至少为1 - alpha,并且几乎紧致,误差不超过1/(n+1)。在CoNLL-2003上的三阶段NER->NED->实体类型流水线中,PASC实现了96.4%的端到端覆盖,优于Bonferroni的93.4%和独立CP的86.5%,在相同平均预测集大小(1.083)下。在分布偏移至WNUT-17推特和WikiNEuRal维基数据时,PASC在测试偏移设置中保持目标覆盖,而独立CP下降到59%。PASC只需一次分位数计算,运行速度比Bonferroni快1.7倍,并可扩展到K=6阶段,其中独立CP下降到0.53端到端覆盖。相同的联合最大分数减少直接应用于复合LLM系统和代理流水线。

英文摘要

Modern NLP and LLM systems are pipelines: named entity recognition (NER) -> entity disambiguation (NED) -> entity typing, retrieval-augmented generation (retriever -> reader), and agentic chains of planner -> tool -> critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER -> NED -> entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

2605.18808 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

在指令微调的LLM中构建组合文学原语:跨架构SAE特征用于自我、风格和情感

Joao Paulo Cavalcante Presa, Savio Salvarino Teles de Oliveira

发表机构 * Federal University of Goias(戈亚斯联邦大学)

AI总结 本文通过稀疏自编码器研究了指令微调的LLM中组合文学原语的架构,发现四种特征类别,并通过跨架构SAE特征验证了自我、风格和情感的表达能力。

Comments 36 pages, 6 figures

详情
AI中文摘要

我们通过在中层残差流上使用稀疏自编码器,对两个指令微调的大型语言模型(Llama 3.1 8B-Instruct和Gemma 2 9B-IT)的文学原语组合架构进行了表征。四种特征类别出现:促进目标情感词的命名门,一个包含第一人称注册特征的十一自我簇,风格注册调节器(show-don't-tell和陌生化),以及仅由多特征引导产生的组合情感。在应用于27类情感分类法(Cowen-Keltner)的强制选择5-LLM判断小组中,Llama通过结合命名门、多特征食谱和单个自我特征引导实现了完全27/27覆盖;Gemma在adoration作为单一残差严格失败的情况下达到23/27。在随机判断中,每个单元格通过的概率约为$10^{-3}$,整个目录中两个种子假阳单元格的预期数量可忽略不计,因此观察到的覆盖度不一致于偶然。在严格与柔和判断对比中存在跨架构不对称性:在相同生成中,判断者在Llama输出上比在Gemma输出上更一致,因为Llama输出更直接地命名目标情感,而Gemma输出则通过场景和意象来唤起情感。两种架构都包含同时作为注册标记和情感发射器的自我特征,包括每个架构中一个最RLHF加载的自我特征,该特征在某一操作 regime 中增强机构Helper-AI人格,并在相同校准系数下产生可分类情感的输出。方法上,本文提出了一个三阶段验证流程(logit-lens,LLM-rate,5-LLM判断)并记录了文档化的反模式;总计算量为单GPU,大约每种情感特征发现循环15分钟。

英文摘要

We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.

2605.18799 2026-05-20 cs.LG cs.AI cs.CL 版本更新

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit: 基于过渡意识的强化学习用于科学批评推理

Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) National University of Singapore(新加坡国立大学) Chinese University of Hong Kong(香港中文大学) University of Oxford(牛津大学) Tsinghua University(清华大学)

AI总结 该研究提出ReCrit框架,通过强化学习解决科学批评推理中的过渡意识问题,改进了批评准确性。

详情
AI中文摘要

大型语言模型在批评交互中不仅可能因回答错误而失败,还可能在用户批评后放弃最初正确的科学解答。在科学推理中,这种风险尤为突出,因为用户的批评可能将正确答案变为错误答案。我们将批评交互视为跨回合正确性过渡问题,而非最终答案准确性问题,并识别出三个挑战:过渡意识、解耦有用的修正与有害的阿谀奉承,以及可扩展的回放。我们提出了ReCrit,一个基于过渡意识的强化学习框架,将初始到批评行为分解为四个象限:修正、阿谀奉承、鲁棒性和边界。ReCrit奖励修正和鲁棒性,惩罚阿谀奉承,并将持续错误视为弱边界信号。为了使交互训练实用,ReCrit进一步使用动态异步回放与尾部自适应完成以减少回放等待。在三个科学推理基准测试(ChemBench、TRQA和EarthSE)上,ReCrit在Qwen3.5-4B上将平均批评准确性从38.15提升到51.49,在Qwen3.5-9B上从45.40提升到55.59。消融实验显示,最终答案奖励提供很少的交互层面增益,而基于过渡意识的奖励和象限加权产生更可区分的训练信号和更大的净批评阶段改进。代码可在https://github.com/black-yt/ReCrit获取。

英文摘要

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

2605.18796 2026-05-20 cs.LG cs.CL 版本更新

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

UCCI:用于成本最优LLM级联路由的校准不确定性

Varun Kotte

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出UCCI,一种以校准为核心的路由方法,通过异质回归将token层面的边际不确定性映射到查询级误差概率,并通过约束成本最小化选择升级阈值。在三个显式假设下,阈值策略在校准分数上是成本最优的,异质校准在期望校准误差(ECE)上实现O(n^{-1/3})的样本复杂度。在75000个生产命名实体识别工作负载上,UCCI将推理成本降低了31%(95%CI:[27%, 35%]),同时将ECE从0.12降低到0.03。

Comments 9 pages, 2 figures, 4 tables. Code: https://github.com/varunkotte6/ucci

详情
AI中文摘要

LLM级联和模型路由通过将简单查询发送到小型模型并升级困难查询到大型模型来降低推理成本,但大多数部署的路由器使用未校准的置信度分数并需要每个工作负载的阈值调整。我们提出了UCCI,一种以校准为核心的路由器,通过异质回归将token层面的边际不确定性映射到查询级误差概率,并通过约束成本最小化选择升级阈值。在三个显式假设下,阈值策略在校准分数上是成本最优的,异质校准在期望校准误差(ECE)上实现O(n^{-1/3})的样本复杂度。在75000个生产命名实体识别工作负载上,UCCI将推理成本降低了31%(95%CI:[27%, 35%]),同时将ECE从0.12降低到0.03。在相同的操作点上,UCCI优于熵阈值法、分割置信路由以及FrugalGPT风格的学习阈值。所有级联结果均使用实际模型输出和测量的H100延迟进行端到端路由,而不是基于全局准确率或名义API价格的模拟路由。

英文摘要

LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

2605.18792 2026-05-20 cs.IR cs.CL 版本更新

Trust or Abstain? A Self-Aware RAG Approach

信任还是弃权?一种自感知RAG方法

Xi Zhu, Ziqi Wang, Kai Mei, Wujiang Xu, Minghao Guo, Bangji Yang, Jiajun Fan, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种自感知RAG方法SABER,通过构建知识冲突基准并引入自感知信念估计器,提升RAG在冲突场景下的准确性和可靠性,同时在弃权策略上实现风险与覆盖的平衡。

详情
AI中文摘要

检索增强生成(RAG)通过整合外部证据提升大语言模型(LLMs),但检索的上下文知识(CK)和参数化知识(PK)冲突或不可靠时会引入知识冲突。现有方法主要协调使用哪个来源,而未明确询问每个答案路径是否正确。我们主张忠实的RAG需要LLM的自感知能力,即能够识别自身知识和推理的局限性。为此,我们构建了一个模型特定、与事实一致的知识冲突基准,通过评估LLM主干在PK-only和CK条件下的答案路径,覆盖约69,000个查询-上下文实例,来自五个冲突问答数据集。然后我们引入SABER,一种用于RAG的自感知信念估计器,无需对LLM进行微调。SABER结合自先验和PK侧、CK侧的条件推理表示,通过两个轻量级预测器估计可靠性信念,驱动四细胞决策,包括信任PK、信任CK、信任两者或弃权。在四个LLM主干上,SABER在端到端准确性和冲突特定的忠实度上优于十种推理时间和微调基线,最大的收益出现在冲突密集的数据集上。在弃权情况下,SABER的风险覆盖曲线帕累托主导了所有基于提示的弃权者,提供了一个可调的覆盖与答案风险的平衡。我们的代码可在https://github.com/xizhu1022/SABER上获得。

英文摘要

Retrieval-augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self-awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model-specific, ground-truth-aligned knowledge-conflict benchmark by evaluating LLM backbones on PK-only and CK-conditioned answer paths over approximately 69K query-context instances per backbone, drawn from five conflict-QA datasets. We then introduce SABER, a Self-Aware Belief Estimator for RAG that requires no LLM fine-tuning. SABER combines a self-prior with PK-side and CK-side conditional reasoning representations from multi-trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4-cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end-to-end accuracy and conflict-specific faithfulness over ten inference-time and fine-tuning baselines, with the largest gains on conflict-heavy datasets. Under abstention, SABER's risk-coverage curve Pareto-dominates every prompt-based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at https://github.com/xizhu1022/SABER.

2605.18772 2026-05-20 cs.IR cs.AI cs.CL 版本更新

Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization

无需基于分类的错误归类即可改进检索增强生成

Gongbo Zhang, Yifan Peng, Chunhua Weng

发表机构 * Columbia University(哥伦比亚大学) Weill Cornell Medicine(韦尔·科恩医学中心)

AI总结 本文提出RePAIR方法,通过直接将错误的RAG输出映射到错误缓解行动计划,无需依赖细粒度错误分类和显式批评者监督,从而提升检索增强生成的性能。

详情
AI中文摘要

检索增强生成(RAG)通过将生成过程 grounding 在外部知识来提高大语言模型(LLM)输出的事实准确性。最近的代理 RAG 系统扩展了这一范式,通过关键代理评估模型响应并迭代优化输出。然而,大多数先前工作隐式假设可靠的批评反馈,专注于规划策略,而对误差纠正过程本身的鲁棒性关注有限,这可能受到对齐错误类别和无效或错误纠正的影响。本文假设 RAG 性能可以在没有显式错误分类的情况下得到改进。我们提出 RePAIR,一种响应-行动学习范式,通过直接将错误的 RAG 输出映射到错误缓解行动计划,而无需依赖细粒度错误分类和显式批评者监督。在多个基准测试中,RePAIR 一致地提高了代理 RAG 的性能。

英文摘要

Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.

2605.18769 2026-05-20 cs.IR cs.AI cs.CL 版本更新

ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

ClusterRAG: 基于聚类的协同过滤用于个性化检索增强生成

Gibson Nkhata, Uttamasha Anjally Oyshi, Quan Mai, Susan Gauch

发表机构 * University of Arkansas(阿肯色大学) Walmart Inc.(沃尔玛公司)

AI总结 ClusterRAG通过聚类用户轮廓文档,利用相似用户的协同信号提升当前用户的个性化生成性能,通过在集群和文档层面进行检索,实现了高效的个性化检索增强生成。

Comments 17 pages, 2 figures, to be published in the proceedings of ACL 2026

详情
AI中文摘要

个性化检索增强生成(RAG)依赖于准确选择与用户相关的文档。在实践中,现有RAG方法往往面临较高的检索成本,并忽略了来自相似用户的协同信号可以增强当前用户的个性化生成。我们提出了ClusterRAG,一种基于聚类的协同过滤用于个性化检索增强生成。ClusterRAG通过用户的轮廓文档来表示用户,利用基于密度的聚类将用户组织成语义连贯的集群,并通过集群级相似性和细粒度排序在集群和文档层面进行检索。在LaMP基准上的广泛实验表明,联合利用目标用户的轮廓和顶部相似用户的轮廓在各种任务中始终获得最佳性能。进一步分析显示,ClusterRAG能够无缝集成不同的密集检索器和排序器,并在与微调和零样本语言模型配对时仍保持有效。

英文摘要

Personalized Retrieval-Augmented Generation (RAG) relies on accurately selecting user-relevant documents. In practice, existing RAG approaches often suffer from high retrieval costs and overlook that collaborative signals from similar users can enhance personalized generation for the current user. We propose ClusterRAG, a Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation. ClusterRAG represents users through their profile documents, organizes users into semantically coherent clusters using density-based clustering, and performs retrieval at both the cluster and document levels via cluster-level similarity and fine-grained ranking. Extensive experiments on the LaMP benchmark demonstrate that jointly leveraging the target user's profile and profiles from top similar users consistently yields the best performance across diverse tasks. Further analysis shows that ClusterRAG integrates seamlessly with different dense retrievers and rankers, and remains effective when paired with both fine-tuned and zero-shot language models.

2605.18766 2026-05-20 cs.IR cs.AI cs.CL 版本更新

Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

仅检索相关表格,无论多少:自适应表格检索方法

Taehee Kim, Seungbin Yang, Jihwan Kim, Jaegul Choo

发表机构 * KAIST AI(韩国科学技术院人工智能研究所)

AI总结 本文提出了一种自适应表格检索方法,根据查询需求调整检索表格数量,通过自适应阈值机制和滑动窗口重排序算法,有效解决传统top-k检索策略的局限性,提升检索和下游任务性能。

Comments ACL 2026 Findings

详情
AI中文摘要

从大量数据库中检索与给定自然语言查询相关的表格对于准确回答文本到SQL等任务中的问题至关重要。现有的表格检索方法选择与查询最相似的预设数量k的表格。然而,所需表格的数量因查询而异,无法提前确定。强制无论查询如何都检索固定数量的表格可能会导致检索到的表格数量不足,无法获得所有必要的证据,或者检索到的表格过多,包含不相关的内容。为了解决这个问题,我们提出了一种自适应表格检索方法,根据每个查询的需求调整检索的表格数量。具体来说,我们利用自适应阈值机制来选择性地检索表格,并整合滑动窗口重排序算法以高效处理大量表格数据集。在Spider、BIRD和Spider 2.0上的广泛实验表明,我们的方法有效解决了传统top-k检索策略的局限性,提高了检索和下游任务的性能。我们的代码和数据可在https://github.com/sbY99/Adaptive-Table-Retrieval上获得。

英文摘要

Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at https://github.com/sbY99/Adaptive-Table-Retrieval.