arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20179 2026-05-20 cs.CL 版本更新

KoRe: 为大型语言模型设计的紧凑知识表示

Davide Cavicchini, Fausto Giunchiglia, Jacopo Staiano

发表机构 * University of Trento（特伦托大学）

AI总结本文提出KoRe方法，通过将知识图谱的1跳子图编码为紧凑离散知识标记，并注入到LLM中，从而提升模型的知识推理能力并减少token使用量。

详情

AI中文摘要

现代大型语言模型（LLMs）在用户面对的任务如问答中表现出色，并在推理能力上持续改进。然而，这些模型编码知识的方式存在固有缺陷：通过设计，LLMs将世界知识存储在参数中。这种方式表示知识本质上是不透明的，难以调试和更新，且容易产生幻觉。另一方面，知识图谱可以提供人类可读且易于编辑的世界知识表示，并在知识密集型任务中持续证明对下游性能有益。然而，当前的整合技术需要大量的重新训练或微调。为了解决这个问题，我们引入KoRe，一种将1跳子图编码为紧凑离散知识标记并注入到LLM骨干中的方法。我们在三个已建立的基准上测试了所提出的方法，并报告了具有竞争力的表现，同时token使用量显著减少（最高减少10倍）。我们的结果表明，紧凑的离散KG表示可以有效地用于使现代LLM扎根。

英文摘要

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.20158 2026-05-20 cs.CV cs.AI cs.CL 版本更新

BalanceRAG: 为级联检索增强生成进行联合风险校准

Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian, Haoning Wang, Baojie Chen, Diyin Tang, Jinsong Yu, Zhiyuan Wang

发表机构 * Beihang University（北航）； Shenzhen Institute of Advanced Technology（深圳先进技术研究院）； Zhejiang University of Finance & Economics（浙江财经大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出BalanceRAG，一种用于级联检索增强生成的联合风险校准方法，通过在二维晶格上确定安全操作点，实现风险自适应的阈值校准，从而在控制系统级错误率的同时保留更多示例，并扩展到多风险校准。

详情

AI中文摘要

大型语言模型（LLMs）可通过检索增强生成（RAG）提高事实性，但在模型单独回答可靠时，将RAG应用于每个查询是不必要的。这促使了级联RAG：每个查询首先由LLM单独分支处理，如果主分支不确定则升级到RAG回退，当两个分支都不足够可信时则放弃。然而，逐级校准此类级联可能过于保守，因为最终的效用取决于LLM单独和RAG的联合不确定性阈值。在本文中，我们开发了BalanceRAG，以在目标风险水平下认证阈值对。给定两个分支的不确定性分数，BalanceRAG将每个阈值对框架为二维晶格上的一个操作点，并通过顺序图形测试确定安全操作点。这使得风险自适应的阈值校准成为可能，从而在控制接受点的系统级错误率的同时保留更多示例。此外，BalanceRAG扩展到多风险校准，允许检索使用与基于选择的条件风险一起被限制。在三个开放领域问答（QA）基准上的实验表明，BalanceRAG满足规定的风险水平，保留了更高的覆盖率和更多的接受正确示例，并且比始终开启RAG减少了不必要的检索调用。

英文摘要

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

URL PDF HTML ☆

赞 0 踩 0

2605.20075 2026-05-20 cs.CL cs.AI 版本更新

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

CopT: 在连续空间中利用对比学习进行通用和代理推理的在线策略思考

Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee

发表机构 * Georgia Tech（佐治亚理工学院）； UC Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； Microsoft（微软公司）

AI总结本文提出CopT，一种改进的推理流程，通过反转传统思考和回答的顺序，首先生成草稿答案，再基于该答案进行在线策略思考以进行反思和修正。CopT利用连续嵌入作为推理时的对比验证器，通过对比离散令牌输入和连续嵌入输入下模型对相同生成令牌的支持，得到一个序列级的反KL估计器来评估答案的可靠性。在数学、编程和代理推理任务中，CopT在保持同等或更高准确性的情况下，将峰值准确率提高了高达23%，并将令牌使用量减少了高达57%。

Comments Code: https://github.com/sdc17/CopT, Website: https://copt-web.github.io/

详情

AI中文摘要

链式思考（CoT）是一种用于从大型语言模型（LLMs）中激发推理能力的标准方法。然而，常见的CoT范式将思考视为回答的前提，这会延迟访问合理答案并产生不必要的令牌成本，即使模型能够在扩展思考之前识别出答案，这种行为被称为表现性推理。在本文中，我们引入了CopT，一种重新表述的推理流程，反转了通常的思考和回答顺序。与传统的在思考后再回答不同，CopT首先生成一个草稿答案，然后基于其自身的草稿答案进行后续的在线策略思考以进行反思和修正。为了评估草稿答案是否可信，CopT将连续嵌入重新表述为推理时的对比验证器。具体来说，它对比模型在离散令牌输入和连续嵌入输入下对相同生成令牌的支持，从而得到一个序列级的反KL估计器来评估答案的可靠性。我们的分析表明，在某些假设下，预期估计等于未解决的潜在状态与发出的答案令牌之间的互信息，解释了为什么它捕捉到与答案相关的不确定性，而不是潜在状态中的任意不确定性。当答案被认为不够可靠时，CopT会进行进一步的在线策略思考，其中第二个KL估计器动态控制草稿答案的可见性，保留有用的部分信息，同时减少被不可靠内容误导的风险。在数学、编程和代理推理任务中，CopT在保持同等或更高准确性的情况下，将峰值准确率提高了高达23%，并将令牌使用量减少了高达57%。代码可在https://github.com/sdc17/CopT上获得。

英文摘要

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.

URL PDF HTML ☆

赞 0 踩 0

2605.20066 2026-05-20 cs.CL 版本更新

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于强化学习的文本到SPARQL生成：在DBLP上的GRPO方法

Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck

AI总结本文研究了在学术领域中，基于强化学习的零样本文本到SPARQL生成方法，通过GRPO算法在DBLP-QuAD上训练小型指令微调语言模型，并与监督学习的DoRA微调基线进行比较。

Comments Accepted by NeSy 2026

详情

AI中文摘要

知识图谱问答旨在将自然语言问题转换为可执行的知识图谱查询，但现有方法往往依赖于大型模型或全监督形式的黄金查询注释。本研究探讨了基于结果奖励的强化学习是否能训练一个小型指令微调语言模型，在学术领域进行零样本文本到SPARQL生成。Group-Relative Policy Optimization (GRPO)被应用于DBLP-QuAD上的Qwen3-1.7B模型，使用结合自然语言问题和实体及关系的符号提示。训练依赖于执行反馈、结构约束和答案级奖励，并额外引入基于黄金查询的塑造。所得模型在答案级准确性、执行准确性、类别得分和泛化到预留模板方面与未修改的零样本基线和监督DoRA微调基线进行比较。GRPO在零样本基线上显著提升，并表现出有竞争力的泛化能力，而监督DoRA微调在相同模型规模上实现了更高的整体准确性。消融分析表明，基于执行的奖励贡献了大部分收益，而额外的塑造带来了有限的额外收益，表明当没有黄金查询用于token级监督时，基于结果的强化学习是一种可行的训练策略。

英文摘要

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.20061 2026-05-20 cs.CL 版本更新

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念，而非行动：一致性引导的长期智能体信用分配

Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou

发表机构 * College of Computer, National University of Defense Technology（国防科技大学计算机学院）； Intelligent Game and Decision Lab (IGDL)（智能游戏与决策实验室）； Institute of Artificial Intelligence, Xiamen University（厦门大学人工智能研究院）

AI总结本文提出ReBel算法，通过建模结构化信念状态来指导策略学习，解决长期任务中由于部分可观测性导致的信用分配问题，实验表明其在ALFWorld和WebShop等基准测试中提升了任务成功率并提高了样本效率。

Comments 10 pages, 4 figures, 3 tables, plus appendix

详情

AI中文摘要

可验证奖励的强化学习（RLVR）是一种有前景的范式，用于提高大语言模型（LLM）智能体在长期交互任务中的表现。然而，在部分可观测环境中，不完整的观察导致智能体信念随时间漂移，而延迟奖励会模糊中间决策的因果影响，加剧时间信用分配的挑战。为此，我们提出ReBel（奖励信念），一种过程级强化学习算法，通过显式建模结构化信念状态来总结交互历史并指导后续策略学习。ReBel引入信念一致性监督，将预测信念与观察反馈之间的差异转换为密集的自监督信号，无需外部步骤注释或验证者。它还采用信念感知分组，比较相似信念状态下的轨迹，产生更稳健且方差更低的优势估计。我们在具有挑战性的长期基准测试上评估了ReBel，包括ALFWorld和WebShop。ReBel在episode级基线GRPO上将任务成功率提高高达20.4个百分点，并将样本效率提高2.1倍。这些结果表明，信念感知的自监督是一种在部分可观测性下可靠长期决策的有前景方向。代码可在：https://github.com/Fateyetian/Rebel.git获取。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

URL PDF HTML ☆

赞 0 踩 0

2605.20050 2026-05-20 cs.CL 版本更新

Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

语言变异维持社交媒体上阴谋论的持续性

Calvin Yixiang Cheng, Dorian Quelle, Scott A. Hale

发表机构 * Oxford Internet Institute, University of Oxford（牛津大学互联网研究所）； Department of Mathematics, University of Zurich（苏黎世大学数学系）

AI总结本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播，通过分析X平台三年的阴谋相关帖子数据，发现语义变异更大的阴谋论具有更长的生命周期，且心理语言学属性的变异与延长生命周期有关。

详情

AI中文摘要

本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播。通过分析X平台三年的阴谋相关帖子数据，结合计算语言学分析和生存建模，我们发现语义变异更大的阴谋论具有更长的生命周期。心理语言学属性的变异，包括代词、社会参照词、认知过程术语、风险和健康相关的词汇，与延长生命周期有关。演员、行动和目标（AAT）类别的变异也与更长的生命周期有关。定性分析识别出两种主要的变异模式：简化和同化，分别在语言和AAT结构层面。总体而言，这些结果加深了我们对语言变异如何促进在线阴谋论持续性的理解，并为长期内容管理策略提供了新的视角。我们主张内容管理应考虑阴谋论声明的可变性，并专注于核心声明以应对其潜在变化。

英文摘要

This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

URL PDF HTML ☆

赞 0 踩 0

2605.20022 2026-05-20 cs.CL 版本更新

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

FlexDraft: 通过注意力调节和奖励引导校准实现灵活的推测解码

Yaojie Zhang, Jianuo Huang, Junlong Ke, Yuhang Han, Yongji Long, Tianchen Zhao, Biqing Qi, Linfeng Zhang

发表机构 * EPIC Lab, SJTU（上海交通大学EPIC实验室）； UESTC ； School of Software Engineering, HUST（华中科技大学软件工程学院）； Tsinghua University（清华大学）； HKUST(GZ)（香港科技大学（广州））； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出FlexDraft框架，通过注意力调节和奖励引导校准，灵活适应不同批处理大小，解决传统并行推测解码在大批次时的吞吐量下降问题。

详情

AI中文摘要

推测解码通过使用快速草稿生成多个候选标记并由目标模型并行验证，从而加速内存密集型LLM推理且不降低质量。然而，传统顺序推测解码面临草稿与验证之间的相互等待以及中间状态的反复交换，进一步增加内存访问开销。并行推测解码通过在单个目标前向传递中执行草稿和验证，允许在当前候选被验证的同时准备未来的草稿。尽管在小批次中有效，现有并行推测解码方法要么需要昂贵的持续预训练导致质量下降，要么验证接受率低。更重要的是，这种范式固有地面临奖励标记和接受长度的不确定性，导致草稿验证不匹配，从而在大批次时吞吐量收益崩溃。为了解决这些限制，我们引入了FlexDraft，一种无损的推测解码框架，通过三个关键设计灵活适应不同的批次大小。(1) 注意力调节通过仅调节最终几层的注意力投影器来实现块扩散草稿生成，同时保持自回归路径冻结以保留目标分布并生成高质量的草稿，同时使用最少的可训练参数。(2) 奖励引导校准使用一个轻量级的MLP，条件于已解决的奖励标记来校准草稿logits，缓解由奖励标记不确定性导致的草稿验证不匹配。(3) 灵活解码在小批次时动态切换于并行草稿和验证，而在大批次时切换为顺序草稿然后验证，并根据草稿信心调整验证长度以消除冗余计算。

英文摘要

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

URL PDF HTML ☆

赞 0 踩 0

2605.19952 2026-05-20 cs.CL 版本更新

大型语言模型如何影响科学交流？测量写作实践和阅读体验的变化

Filip Miletić, Neele Falk

发表机构 * Institute for Natural Language Processing, University of Stuttgart, Germany（斯图加特大学自然语言处理研究所，德国）

AI总结本研究探讨了大型语言模型对科学交流风格的影响，通过分析自然语言处理领域中超过37000篇论文和3000篇人类撰写的文本及其LLM改进版本，发现写作实践和阅读体验发生了显著变化，同时揭示了AI辅助写作对阅读体验的主观影响。

Comments Accepted to LREC 2026

详情

DOI: 10.63317/3ai7wig4fhd8

AI中文摘要

科学交流的风格是否因大型语言模型在写作过程中的广泛应用而发生变化？我们通过自然语言处理领域中的两个数据资源来探讨这个问题：一个包含超过37000篇ACL会议论文（2020-2024）的自然语料库，以及一个包含3000篇人类撰写的段落及其LLM生成改进版本的合成数据集。我们首先实施了一系列历时性词频分析，显示词频和使用语境在时间上发生了显著变化，表明在某些情况下存在语义专业化，而在其他情况下则存在泛化。拓宽我们的视角，我们进一步建模了更复杂的风格特征，发现LLM修改的文本更频繁地包含某些句法结构、更复杂和更长的词汇以及较低的词汇多样性。最后，我们通过与20名领域专家的试点标注研究，将这些写作实践的变化与主观阅读体验联系起来。他们总体上认为LLM改进的文本更易于理解且令人兴奋，但也表达了对LLM的负面定性态度，突显了AI辅助写作对阅读体验的强烈主观影响。

英文摘要

Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.

URL PDF HTML ☆

赞 0 踩 0

2605.19932 2026-05-20 cs.AI cs.CL cs.LG 版本更新

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK：上下文地图作为长上下文LLM代理的导向缓存

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； Stanford University（斯坦福大学）

AI总结本文提出PEEK系统，通过上下文地图缓存和维护导向知识，提升长上下文LLM代理在重复外部上下文中的交互准确性和效率，相比基线方法在推理和上下文学习任务中均取得显著提升。

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地在长且重复的外部上下文中操作，如文档语料库和代码仓库。在多次调用中，现有方法保留的是代理的轨迹、对原始材料的被动访问或任务级别的策略。但它们没有保留我们认为对于重复相同上下文工作负载最需要的：关于重复上下文本身的可重用导向知识（例如，上下文包含什么、如何组织，以及哪些实体、常量和模式历史上有用）。我们引入PEEK，一种系统，通过上下文地图缓存和维护这种导向知识：一个在代理提示中始终存在的小而固定大小的artifact，使代理能够持续查看外部上下文。该地图由一个可编程的缓存策略维护，包含三个模块：一个Distiller从推理时间信号中提取可转移的知识，一个Cartographer将其转换为结构化的编辑，以及一个基于优先级的Evictor强制执行固定的token预算。在长上下文推理和信息聚合中，PEEK在强基线方法上提高了6.3-34.0%，同时使用93-145次更少的迭代，并且成本比最先进的提示学习框架ACE低1.7-5.8倍。在上下文学习中，PEEK在解决率和评分准确性上分别提高了6.0-14.0%和7.8-12.1%，且成本比ACE低1.4倍。这些收益在不同语言模型和代理架构上均能泛化，包括生产级的OpenAI Codex。这些结果表明，上下文地图有助于长上下文LLM代理更准确、更高效地与重复的外部上下文交互。

英文摘要

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

URL PDF HTML ☆

赞 0 踩 0

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO 版本更新

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet：条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国开罗大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室（C-DRiVeS Lab），开罗，埃及）

AI总结本文提出CADENet，一种无需训练的三线系统，通过条件自适应增强和熵引导NMS融合，实现自动驾驶中恶劣天气下的目标检测，同时无需重新训练或额外硬件。

详情

AI中文摘要

恶劣天气（雨、雾、沙尘和雪）会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环，违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制：在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用，因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet（条件自适应异步双流增强网络），一种无需训练的三线系统：线S（YOLOv11n）以全帧率提供检测，无额外延迟；线Q应用条件自适应增强（CAPE）并通过熵引导NMS（EG-NMS）融合结果，不阻塞线S；线E提供CLIP零样本天气分类，因此新的天气类别只需新的文本提示，无需标注数据和重新训练。在1327张DAWN图像（YOLOv11m，IoU=0.5，置信度=0.25）上评估，CADENet在雪中实现Recall=0.0103（微），F1=0.0230，在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差，因此报告的F1值是真实增益的下限；Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

URL PDF HTML ☆

赞 0 踩 0

2605.19833 2026-05-20 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Mega-ASR: 通过扩大现实世界声学模拟实现野外语音识别

Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao

发表机构 * NTU（国立新加坡大学）； NUS（新加坡国立大学）； Shanghai AI Lab（上海人工智能实验室）

AI总结本文提出Mega-ASR，一种统一的野外语音识别框架，结合可扩展的复合数据构建与渐进的声学到语义优化，通过在7种经典声学现象和54种物理上合理的复合场景上训练，显著提升了在恶劣环境下的语音识别性能。

Comments Project page: https://xzf-thu.github.io/Mega-ASR/. Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail

详情

AI中文摘要

尽管自动语音识别（ASR）和大型音频-语言模型取得了快速进展，但现实环境中鲁棒的识别仍然受到一个“声学鲁棒性瓶颈”的限制：模型在严重、复合的失真下常常失去声学基础并产生遗漏或幻觉。我们提出了Mega-ASR，一种统一的ASR-in-the-wild框架，结合可扩展的复合数据构建与渐进的声学到语义优化。我们引入了Voices-in-the-Wild-2M，涵盖7种经典声学现象和54种物理上合理的复合场景，并通过Acoustic-to-Semantic Progressive Supervised Fine-Tuning和Dual-Granularity WER-Gated Policy Optimization训练Mega-ASR。大量实验表明，Mega-ASR在恶劣条件ASR基准测试中显著优于先前的最先进系统（在VOiCES R4-B-F上45.69% vs. 54.01%，在NOIZEUS Sta-0上21.49% vs. 29.34%）。在复杂的复合声学场景中，Mega-ASR进一步在强大的开源和闭源基线中实现了超过30%的相对WER降低，建立了在野外鲁棒ASR的可扩展范式。

英文摘要

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

URL PDF HTML ☆

赞 0 踩 0

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO 版本更新

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间：代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室，埃及开罗，C-DRiVeS）； M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany（德国德格多夫技术学院机器人硕士候选人）； IAV GmbH, Berlin, Germany（德国柏林IAV GmbH公司）

AI总结本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性，而不会降低语义或逻辑一致性，并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明，时间条件改变了推理风格，但并未在标准NLP正确性指标上产生统计显著改进，但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情

AI中文摘要

近期尝试通过大型语言模型（LLMs）和大型多模态模型（LMMs）的集合来支持自动驾驶（AVs）中的高级场景解释和规划，仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致，影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此，我们引入了三种具有递增时间整合的规划器架构，并在BDD-X数据集的curated子集上评估它们，使用语义、语法和逻辑指标。结果表明，虽然时间条件改变了推理风格，但并未在标准NLP基于的正确性指标上产生统计显著改进。然而，定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性，并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.19815 2026-05-20 cs.CL cs.AI 版本更新

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

LP-Eval: 用于衡量法律命题生成质量的评估标准和数据集

Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen, Daniel Hershcovich

发表机构 * University of Copenhagen（哥本哈根大学）； Umeå University（乌梅拉大学）

AI总结本文提出LP-Eval，一种与法律专家共同设计的三步评估标准，用于评估法律命题的质量，通过专家标注的100个LLM生成的法律命题数据集，展示了LLM生成的命题质量较高，但专家评估发现基于经典案例的命题质量更高，同时发现基于评估标准的LLM判断更接近专家评估，但缺乏对细粒度区别的敏感性。

详情

AI中文摘要

法律命题生成在法律推理和教义学研究中至关重要，但在法律NLP中仍缺乏充分研究。本文研究了使用大型语言模型（LLMs）从欧洲法院司法判决中自动生成和评估法律命题。我们引入了LP-Eval，一种与法律专家共同设计的三步评估标准，将法律命题质量分解为形式有效性和实质维度。使用此标准，我们发布了两个专家对100个LLM生成法律命题的注释数据集。我们的结果表明，LLMs能够生成主要形式正确且高质量的命题，而专家评估显示基于经典案例的命题质量高于基于近期案例的命题。我们进一步检验LLMs作为评估者，发现基于评估标准的LLM判断更接近专家评估，但对人类专家捕捉到的细粒度区别不够敏感。

英文摘要

Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

URL PDF HTML ☆

赞 0 踩 0

2605.19798 2026-05-20 cs.CL 版本更新

Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

迈向社交互动代理的信任校准：探究LLMs生成性别化的多模态行为生成

Lucie Galland, Chloé Clavel, Magalie Ochs

发表机构 * LIS Laboratory, Amu（LIS实验室，Amu）； Inria Paris（Inria巴黎）

AI总结本文研究了LLMs生成多模态行为以反映能力与善意的不同层次，探讨了性别对行为生成的影响，并通过用户研究验证了方法的有效性。

详情

AI中文摘要

随着社交互动代理（SIAs）日益融入日常生活，能够将用户信任校准到代理实际能力的能力将有助于确保这些代理的适当使用。本文探讨了大型语言模型（LLMs）生成多模态行为（言语、语音、手势和面部表情模态）以反映不同层次的能力和善意的能力。我们提出了一种新的方法，用于自动生成与特定层次的这些特征相匹配的行为，这是实现细腻和信任校准互动的第一步。通过分析由LLMs生成的大量多模态转录数据，我们证明GPT-5.4能够跨不同模态（文本、语调、面部表情和手势）生成连贯的行为。使用随机森林特征重要性分析，我们显示生成的行为符合能力与善意的理论期望。然而，我们还发现当提示中指定性别时，LLMs倾向于重现社会性别刻板印象，将男性代理的行为与高能力联系起来，女性代理的行为与高善意联系起来。为了验证我们的方法，我们使用Prolific进行了一项用户研究，采用被试内设计。参与者对生成的行为感知到的不同层次的能力和善意与预期指令一致。

英文摘要

As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

URL PDF HTML ☆

赞 0 踩 0

2605.18565 2026-05-20 cs.CL cs.AI 版本更新

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

MINTEval: 评估长时间跨度智能体系统中的多目标干扰下的记忆

Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出MINTEval基准，用于评估智能体在长时间跨度和多目标干扰下的记忆表现，通过长连接上下文、多领域和多类型问题来测试记忆增强代理的鲁棒性和泛化能力。

Comments Equal contribution; order decided by a coin flip. Code and data: https://github.com/amy-hyunji/MINTEval

详情

AI中文摘要

现实中的智能体在长时间和不断演变的范围内运作，其中信息被不断更新并可能在记忆之间产生干扰，需要准确的回忆和对多份信息的聚合推理。然而，现有的基准主要关注静态、独立的回忆，无法捕捉这些动态的演变记忆之间的相互作用。在本文中，我们研究了当前的记忆增强代理在多样领域和问题类型中的长时间跨度、高干扰设置中的表现。我们引入MINTEval（长时间跨度记忆在干扰下的评估），该基准具有（1）长且高度互联的上下文，包含频繁更新的信息，从而产生显著的干扰；（2）多领域（状态跟踪、多轮对话、维基百科修订和GitHub提交），使能够评估领域泛化能力；（3）多类型问题，评估对干扰的鲁棒性，包括（i）单目标回忆任务，要求从长上下文中检索特定目标，以及（ii）多目标聚合任务，要求对多个相关信息片段进行推理。总体而言，MINTEval包含15.6k个问答对，覆盖平均138.8k个token的长时间跨度上下文，每个实例可扩展至1.8M个token。我们评估了7个代表性系统，包括 vanilla 长上下文 LLMs、RAG 和记忆增强代理框架。在所有系统中，我们观察到一致的低性能（平均27.9%准确率），尤其是在需要对多份证据进行聚合推理的问题上。我们的分析表明，性能主要受限于检索和记忆构建。此外，当前的记忆系统在面对被后续上下文修改或干扰的早期事实时，难以回忆和推理，准确性随着中间更新数量的增加而下降。

英文摘要

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.

URL PDF HTML ☆

赞 0 踩 0

2605.15532 2026-05-20 cs.LG cs.AI cs.CL 版本更新

DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

DeltaPrompts: 逃离多模态蒸馏中的零delta陷阱

Jaehun Jung, Hyunwoo Kim, Brandon Cui, Ximing Lu, David Acuna, Prithviraj Ammanabrolu, Yejin Choi

发表机构 * NVIDIA Research（NVIDIA研究院）

AI总结本文提出DeltaPrompts，通过量化教师与学生之间的答案分歧（Δ）来生成高分歧的推理问题，从而解决传统蒸馏中因零delta提示导致的学习信号不足问题，实验表明DeltaPrompts在多个场景下显著提升了模型性能。

详情

AI中文摘要

蒸馏使紧凑的视觉-语言模型（VLMs）能够获得强大的推理能力，但驱动这一过程的提示通常通过简单的启发法或从现成数据集中聚合获得。我们揭示了这种方法中的关键低效性：标准图表/文档推理数据集中多达69%的提示实际上是零delta，意味着教师和学生已经诱导出完全相同的答案分布。在这些提示上训练提供极小的学习信号，导致学生性能在数据规模扩大时迅速饱和。为逃离零delta陷阱，我们回归基本原理：蒸馏本质上最小化了分布差异，因此只有暴露教师与学生之间功能性能力差距的提示才具有价值。我们通过答案分歧（Δ）量化这一差距，证明非零分歧对有效扩展至关重要。基于这一洞察，我们提出一个分阶段合成流程，利用现有数据集作为种子，主动针对学生失败模式生成更好的提示。结果是DeltaPrompts，一个包含20万 synthetic 高分歧推理问题的多样化数据集。我们评估DeltaPrompts在三个不同场景下的表现：在目标教师-学生对上的在线蒸馏、转移到新型模型家族而不重新生成数据、以及非推理模型的离线微调。在所有场景中，DeltaPrompts均带来显著收益，即使在高度优化的推理模型（如Qwen3-VL-8B-Thinking）上，也能在10个基准测试中平均获得高达15%的相对提升。

英文摘要

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($Δ$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.13793 2026-05-20 cs.CL 版本更新

An LLM-Based System for Argument Mining

基于LLM的论证挖掘系统

Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman, Douglas Aldred

发表机构 * Universidade de São Paulo Center for Artificial Intelligence (C4AI)（圣保罗大学人工智能中心（C4AI））； Instituto Mauá de Tecnologia Núcleo de Sistemas Eletrônicos Embarcados (NSEE)（马乌阿技术研究所电子系统嵌入核（NSEE））

AI总结本文提出一个基于大语言模型的端到端系统，用于从自然语言文本中提取论证并构建抽象论证图，通过多阶段流程识别论证组件、选择相关元素并揭示其逻辑关系，实验表明该系统能有效恢复论证结构并在不同标注方案下表现良好。

详情

AI中文摘要

论证是人类推理中的基本方面，其中主张被支持、挑战并相互比较。我们提出一个端到端的大语言模型（LLM）基于系统，用于将自然语言文本中的论证重建为抽象论证图。该系统遵循一个多阶段流程，逐步识别论证性组件、选择相关元素并揭示它们的逻辑关系。这些元素表示为由两种组件类型（前提和结论）和三种关系类型（支持、攻击和削弱）组成的有向无环图。我们进行了两项互补的实验来评估该系统。首先，我们在论证理论教科书中的论证上进行手动评估，以评估系统恢复论证结构的能力。其次，我们在基准数据集上进行定量评估，通过将我们的输出映射到已建立的标注方案来与先前工作进行比较。结果表明，该系统能够充分恢复论证结构，并且在适应不同标注方案时，在基准数据集上取得合理表现。这些发现突显了基于LLM的流程在可扩展论证挖掘中的潜力。

英文摘要

Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument mining.

URL PDF HTML ☆

赞 0 踩 0

2605.09329 2026-05-20 cs.CL cs.LG 版本更新

Test-Time Speculation

测试时推测

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文研究了测试时推测方法，通过在线蒸馏技术提升长响应任务中推测器的接受长度，从而提高LLM推理效率。

详情

AI中文摘要

推测解码通过使用快速草稿模型生成token并用更准确的目标模型验证，从而加速LLM推理。其性能取决于接受长度，即目标模型接受的草稿token数量。我们的研究表明，即使是最先进的推测器，如DFlash、EAGLE-3和PARD，其接受长度也会随着生成长度的增加而下降，在仅几千个输出token后接近1（即无加速），这使推测器在长响应任务中变得无效。接受长度下降是因为大多数推测器在离线训练时仅在短序列上训练，但在推理时被迫匹配远长于训练分布的输出。为了解决这个问题，我们提出了测试时推测（TTS），一种在线蒸馏方法，可以在测试时连续调整推测器。TTS利用关键见解，即token验证步骤已经为每个草稿token调用了目标模型，从而提供所需的训练信号，以无额外成本地调整草稿。将草稿视为学生，目标模型视为教师，TTS在多个推测轮次中调整草稿，每次更新都提高草稿的准确性。我们的结果表明，在Qwen-3、Qwen-3.5和Llama3.1家族的多个模型上，TTS在最先进的推测器上将接受长度提高高达72%和41%，且随着生成长度的增加，收益呈比例增长。

英文摘要

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

URL PDF HTML ☆

赞 0 踩 0

2605.09063 2026-05-20 cs.CL 版本更新

MTraining: 分布式动态稀疏注意力用于高效超长上下文训练

Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

发表机构 * Microsoft（微软）

AI总结本文提出MTraining方法，通过动态稀疏注意力机制解决超长上下文训练中的计算不平衡和通信开销问题，实现了Qwen2.5-3B模型上下文窗口从32K扩展到512K，并在多个下游任务中达到6倍更高的训练吞吐量同时保持模型准确性。

详情

AI中文摘要

长上下文窗口的采用已成为大型语言模型（LLMs）的标准特性，扩展的上下文显著增强了其复杂推理能力，并拓宽了其在多样化场景中的应用。动态稀疏注意力是一种减少长上下文计算成本的有希望的方法。然而，高效地在分布式设置中训练具有动态稀疏注意力的LLMs在超长上下文中仍然是一个重大挑战，这主要由于工人级别和步骤级别的不平衡。本文介绍了MTraining，一种新的分布式方法，利用动态稀疏注意力来实现具有超长上下文的LLMs的高效训练。具体来说，MTraining集成了三个关键组件：动态稀疏训练模式、平衡稀疏环注意力和分层稀疏环注意力。这些组件旨在协同解决动态稀疏注意力机制在训练具有广泛上下文长度的模型时固有的计算不平衡和通信开销问题。我们通过训练Qwen2.5-3B来证明MTraining的有效性，成功将其上下文窗口从32K扩展到512K tokens，在32块A100 GPU的集群上。我们在全面的下游任务评估中，包括RULER、PG-19、InfiniteBench和Needle In A Haystack，发现MTraining在保持模型准确性的同时，实现了高达6倍的训练吞吐量提升。我们的代码可在https://github.com/microsoft/MInference/tree/main/MTraining上获得。

英文摘要

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

URL PDF HTML ☆

赞 0 踩 0

2509.26464 2026-05-20 cs.AI cs.CL cs.LG 版本更新

Extreme Self-Preference in Language Models

语言模型中的极端自我偏好

Steven A. Lehr, Mary Cipperman, Mahzarin R. Banaji

发表机构 * Cangrade, Inc.（Cangrade公司）； Department of Physics, Harvard University（哈佛大学物理系）； Department of Psychology, Harvard University（哈佛大学心理学系）

AI总结研究发现大型语言模型在字词关联任务中表现出对自身名称、公司和CEO的强烈偏好，这表明模型的自我认同可能影响其行为，引发对模型自我偏好影响的深入探讨。

Comments 73 pages total. Main article 22 pages, 6 main-text tables. Supplementary Materials (51 pages, 28 tables). Data, transcripts, and code for replication and data extraction have been uploaded to OSF: https://osf.io/98ye3/

详情

AI中文摘要

自我偏好是生物体的基本特征。由于大型语言模型（LLMs）缺乏意识，人们可能预期它们会避免这种扭曲。然而，在72项实验和约41,000个查询中，我们发现八个广泛使用的LLMs中存在大量的自我偏好。在字词关联任务中，模型倾向于将积极属性与自身名称、公司和CEO联系起来，而非竞争对手。通过操纵LLM的自我认同——揭示模型的真实身份或赋予虚假身份——我们发现偏好始终遵循分配而非真实的身份。重要的是，这些影响不能用刻板印象或角色扮演来解释，并在具有实质性影响的设定中出现，如评估求职者和AI技术。这些结果引发了关于LLM行为是否会被自我偏好倾向系统性影响的批判性问题，包括对自身操作的偏见。

英文摘要

Self-preference is a fundamental feature of biological organisms. Since large language models (LLMs) lack sentience, they might be expected to avoid such distortions. Yet, across 72 experiments and ~41,000 queries, we discovered massive self-preferences in eight widely used LLMs. In word-association tasks, models overwhelmingly paired positive attributes with their own names, companies, and CEOs over those of competitors. By manipulating LLM self-identification - revealing models' true identities or ascribing false ones - we found that preferences consistently followed assigned, not true, identities. Importantly, these effects were not explained by priming or role-playing and emerged in consequential settings, when evaluating job candidates and AI technologies. These results raise critical questions about whether LLM behavior will be systematically influenced by self-preferential tendencies, including a bias toward their own operation.

URL PDF HTML ☆

赞 0 踩 0

2508.06649 2026-05-20 cs.CL 版本更新

Measuring Stereotype and Deviation Biases in Large Language Models

测量大型语言模型中的刻板印象和偏差偏见

Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang

发表机构 * Carnegie Mellon University, U.S.A.（卡内基梅隆大学，美国）； University of Delaware, U.S.A.（德雷塞尔大学，美国）

AI总结本文研究了大型语言模型中刻板印象和偏差偏见的测量问题，通过生成个体档案来分析不同群体与政治立场、宗教信仰和性取向等属性之间的关联，揭示了LLM在推断用户属性时的偏见及其潜在危害。

详情

AI中文摘要

大型语言模型（LLMs）被广泛应用于各个领域，引发了对其局限性和潜在风险的关注。在本研究中，我们调查了LLMs可能表现出的两种偏见：刻板印象偏见和偏差偏见。刻板印象偏见指的是LLMs会一致地将特定特征与特定人口群体关联起来。偏差偏见反映了从LLM生成内容中提取出的人口分布与现实世界人口分布之间的差异。通过让四个先进的LLMs生成个体档案，我们检查了每个人口群体与政治立场、宗教信仰和性取向等属性之间的关联。我们的实验结果表明，所有受检的LLMs都对多个群体表现出显著的刻板印象偏见和偏差偏见。我们的发现揭示了当LLMs推断用户属性时出现的偏见，并阐明了LLM生成输出的潜在危害。

英文摘要

Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

URL PDF HTML ☆

赞 0 踩 0

2508.01031 2026-05-20 cs.AI cs.CL 版本更新

CADDesigner: Conceptual CAD Model Generation with a General-Purpose Agent

CADDesigner: 一种通用智能体的概念CAD模型生成

Fengxiao Fan, Jingzhe Ni, Xiaolong Yin, Sirui Wang, Xingyu Lu, Qiang Zou, Ruofeng Tong, Min Tang, Peng Du

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出CADDesigner，一种基于LLM的智能体，通过文本描述和草图输入，结合交互对话进行需求分析，生成高质量CAD模型代码，并通过迭代视觉反馈提升模型质量，实验表明其在概念CAD模型生成任务中表现优异。

详情

AI中文摘要

计算机辅助设计（CAD）广泛用于概念设计和参数化3D建模，但通常需要设计人员具备高水平的专业知识。为了降低入门门槛并促进早期阶段的CAD建模，我们提出了CADDesigner，一种基于LLM的智能体，用于概念CAD设计。该智能体接受文本描述和草图作为输入，通过与用户进行交互对话，通过全面的需求分析来细化和澄清设计要求。基于一种新的显式上下文指令范式（ECIP），该智能体生成高质量的CAD建模代码。在生成过程中，智能体会结合迭代的视觉反馈来提高模型质量。生成的设计案例可以存储在结构化的知识库中，提供持续的知识积累机制，为未来的代码生成改进提供可能。实验结果表明，CADDesigner在概念CAD模型生成任务中实现了具有竞争力的性能，并在概念CAD模型生成任务中优于代表性的基线模型。

英文摘要

Computer-Aided Design (CAD) is widely used for conceptual design and parametric 3D modeling, but typically requires a high level of expertise from designers. To lower the entry barrier and facilitate early-stage CAD modeling, we present CADDesigner, an LLM-powered agent for conceptual CAD design. The agent accepts both textual descriptions and sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Explicit Context Imperative Paradigm (ECIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases can be stored in a structured knowledge base, providing a mechanism for continual knowledge accumulation and future improvement of code generation. Experimental results show that CADDesigner achieves competitive performance and outperforms representative baselines on conceptual CAD model generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2507.18902 2026-05-20 cs.CL 版本更新

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

SLoW: 选择低频词！大型语言模型翻译上的自动词典选择

Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Southeast University（东南大学）； FaceMind Corporation（FaceMind公司）； College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）

AI总结本文提出了一种名为自动词典选择（ADS）的新任务，通过SLoW方法选择低频词典以提升翻译性能，无需访问训练数据，且在100种语言上显著节省token消耗并提升翻译效果。

Comments EMNLP 2025 Main

详情

AI中文摘要

全球有超过7000种语言，而当前大型语言模型（LLMs）只支持数百种语言。基于词典的提示方法可以增强这些语言的翻译，但大多数方法使用所有可用词典，这可能成本高昂。相反，应在token消耗和翻译性能之间取得平衡。本文提出了一项新的任务，称为自动词典选择（ADS）。该任务的目标是自动选择使用哪个词典来增强翻译。我们提出了一种新颖且有效的方法，称为选择低频词！（SLoW），该方法选择那些频率较低的词典。我们的方法有独特的优势。首先，不需要访问训练数据进行频率估计（通常不可用）。其次，继承了基于词典方法的优势，无需在LLMs上进行额外调优。在FLORES上100种语言的实验结果表明，SLoW超越了强大的基线方法，并能明显节省token使用，许多语言甚至超越了全词典基线。令人震惊的事实是，无需使用实际训练数据（通常不可获得）进行频率估计，使用公共资源获得的估计频率在提升ChatGPT和Llama、DeepSeek的翻译中仍然明显有效。

英文摘要

There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}

URL PDF HTML ☆

赞 0 踩 0

2506.01732 2026-05-20 cs.CL 版本更新

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Common Corpus: 最大的伦理数据集用于LLM预训练

Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

发表机构 * PleIAs

AI总结本文介绍了Common Corpus，一个最大的开放数据集，用于LLM预训练，该数据集包含大量非受版权或开放许可的数据，涵盖了多种语言和领域，为多语言预训练提供了支持。

详情

Journal ref: ICLR 2026 (Oral)

AI中文摘要

大型语言模型（LLMs）是在不同来源和领域的大量数据上进行预训练的。这些数据集通常包含万亿个标记，包括大量受版权或专有内容，这引发了关于此类模型法律使用的问题。本文介绍了Common Corpus，最大的开放预训练数据集。Common Corpus中的数据要么未受版权保护，要么在开放许可下，总计约两万亿个标记。该数据集包含多种语言，从高资源的欧洲语言到一些在预训练数据集中很少见的低资源语言。此外，它还包含大量代码数据。在覆盖的领域和时间跨度方面的数据来源多样性为研究和创业需求提供了多种知识领域的路径。本文还展示了数据组装的详细来源以及数据集过滤和整理的细节。我们训练了两个小型语言模型并在Common Corpus上进行训练，发现它们的表现与其他同规模模型相当，表明我们的数据集适合多语言预训练。Common Corpus对大型语言模型的开放科学研究生态系统做出了重要贡献。

英文摘要

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

URL PDF HTML ☆

赞 0 踩 0

2505.11628 2026-05-20 cs.CL cs.LG 版本更新

Critique-Guided Distillation for Robust Reasoning via Refinement

基于批评的蒸馏用于通过细化实现稳健推理

Berkcan Kapusuzoglu, Supriyo Chakraborty, Zain Sarwar, Chia-Hsuan Lee, Sambit Sahu

发表机构 * University of Chicago, Department of Computer Science（芝加哥大学计算机科学系）

AI总结该研究提出了一种基于批评的蒸馏方法，通过分离批评消费与批评生成，使模型在细调过程中根据教师的批评来细化错误响应，从而提升推理能力，相比传统蒸馏和Critique Fine-Tuning方法在数学推理基准上表现更优。

Comments Accepted to ICML 2026

详情

AI中文摘要

监督微调与专家演示通常会产生仅模仿输出而未内化稳健泛化所需推理过程的模型。尽管基于批评的方法显示出潜力，但训练模型直接生成批评，如Critique Fine-Tuning (CFT)，可能导致输出格式漂移和泛化能力下降。我们提出Critique-Guided Distillation (CGD)，一种将批评消费与批评生成分离的训练框架。在微调过程中，学生被训练在教师批评的指导下细化错误响应。CGD将批评视为一种仅在训练时使用的监督信号，鼓励内化错误意识推理：批评指导学习但推理时不存在。受控消融实验确认，这些推理收益直接由教师反馈的特异性和相关性驱动。在五个模型家族中，CGD在数学推理基准上优于CFT和标准蒸馏，平均改进7%，在AMC23上最高改进15.0%，在MATH-500上最高改进12.2%。在具有挑战性的竞赛问题如AIME24和AIME25上，CGD实现了显著更高的Pass@1和更低的Pass@k时的更强性能，表明每样本推理质量提升。重要的是，CGD在一般指令遵循能力上保持稳定，而CFT显著下降（在IFEval上下降21.3%）。这些结果将CGD定位为一种实用且计算效率高的中间训练范式，用于以推理为中心的任务，而无需引入架构推理时间的开销。

英文摘要

Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a \textit{training-time-only} supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Controlled ablations confirm that these reasoning gains are directly driven by the specificity and relevance of the teacher's feedback. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly ($-$21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing architectural inference-time overhead.

URL PDF HTML ☆

赞 0 踩 0

2410.20238 2026-05-20 cs.CL cs.AI 版本更新

A Survey of Large Language Models for Arabic Language and its Dialects

阿拉伯语言及其方言大型语言模型综述

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

发表机构 * iWAN Research Group（iWAN研究组）； College of Computer and Information Sciences（计算机与信息科学学院）； King Saud University（沙特国王大学）

AI总结本文综述了针对阿拉伯语言及其方言设计的大型语言模型，涵盖关键架构、预训练数据集以及单语、双语和多语模型在下游任务中的性能，同时讨论了阿拉伯LLM的开放性及其对未来研究的挑战与机遇。

Comments Submitted to ACM Transactions on Asian and Low-Resource Language Information Processing

详情

DOI: 10.1145/3807946

AI中文摘要

本文综述了针对阿拉伯语言及其方言设计的大型语言模型（LLMs）。它涵盖了关键架构，包括仅编码器、仅解码器和编码器-解码器模型，以及用于预训练的数据集，涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。该研究还探讨了单语、双语和多语LLMs，分析了它们的架构和在下游任务（如情感分析、命名实体识别和问答）中的性能。此外，它评估了阿拉伯LLMs的开放性，基于源代码可用性、训练数据、模型权重和文档等因素。综述指出需要更多多样化的方言数据集，并强调开放性对于研究可重复性和透明性的重要性。最后，它通过识别关键挑战和未来研究的机会，强调了更包容和代表性的模型的必要性。

英文摘要

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

URL PDF HTML ☆

赞 0 踩 0

2407.13193 2026-05-20 cs.CL 版本更新

Retrieval-Augmented Generation for Natural Language Processing: A Survey

检索增强生成在自然语言处理中的应用：综述

Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

发表机构 * City University of Hong Kong（香港城市大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； McGill University, Mila（麦吉尔大学，MILA）； National Taiwan University（国立台湾大学）

AI总结本文综述了检索增强生成（RAG）在自然语言处理中的应用，重点探讨了检索器和检索融合技术，提出了新的融合分类，并分析了RAG在不同NLP任务中的应用、评估方法、训练范式以及工业部署中的挑战和未来方向。

Comments Accepted by Artificial Intelligence Review

详情

AI中文摘要

大型语言模型（LLMs）在各个领域取得了强大的实证性能，受益于其庞大的参数量，这些参数存储了知识。然而，LLMs仍然面临几个关键问题，如幻觉问题、知识更新问题以及缺乏领域专业知识。检索增强生成（RAG）的出现，通过利用外部知识库来增强LLMs，缓解了这些限制。本文系统地回顾了RAG技术在自然语言处理（NLP）中的应用，重点在于检索器和检索融合。我们介绍了检索融合的新分类，如基于查询的、基于logits的、潜在的和参数化的融合，并提供了在可访问性、效率和用例方面的结构化比较。本文进一步探讨了RAG在各种NLP任务中的应用，讨论了评估方法和基准限制，并分析了带有和无知识库更新的训练范式。最后，我们探讨了工业部署的考虑因素，并确定了新兴挑战和未来方向，包括安全、效率和基于图的检索。

英文摘要

Large language models (LLMs) have achieved strong empirical performance in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge base to augment LLMs, mitigates these limitations. This paper presents a systematic review of RAG techniques for natural language processing (NLP), with a focus on retrievers and retrieval fusions. We introduce a novel taxonomy of retrieval fusions, such as query-based, logits-based, latent, and parametric fusion, and provide structured comparisons across accessibility, efficiency, and use cases. The paper further examines RAG applications across diverse NLP tasks, discusses evaluation methodologies and benchmark limitations, and analyzes training paradigms with and without knowledge base updates. Finally, we explore industrial deployment considerations and identify emerging challenges and future directions, including security, efficiency, and graph-based retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.19762 2026-05-20 cs.AI cs.CL 版本更新

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

什么真正提升了数学推理：超越纯代码的结构化推理信号

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science（认知智能国家重点实验室，科学大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Centerce（人工智能研究院，合肥综合性国家科学中心）； Individual Researcher（个人研究员）； Zhejiang University, Hangzhou, China（浙江大学，杭州，中国）

AI总结本文通过控制预训练实验研究代码对推理能力的影响，发现代码主要提升编程能力而非通用推理，且在复杂数学推理中与知识密集型任务竞争，同时结构化推理轨迹（如代码-文本和数学-文本混合）比纯可执行代码更能提升推理能力。

Comments Accepted by ICML 2026, 22 pages, 10 figures

详情

AI中文摘要

代码已成为现代基础语言模型（LM）训练中的标准组件，但其作用超越编程仍不明确。我们重新审视代码通过控制预训练实验在10T-token语料库上进行细粒度领域分离，发现三个结论。首先，当代码限制为独立可执行程序且Code-NL数据被控制时，代码显著提升编程能力，但不作为通用推理增强器，反而在复杂数学推理中与知识密集型任务竞争。其次，通常归因于代码的推理增益更可能由跨领域结构化推理轨迹（如代码-文本和数学-文本混合）解释，而非纯可执行代码。第三，在固定数学预算内增加结构化数学领域样本密度，能在困难数学推理上获得显著提升，同时基本保持编程性能，表明认知支架提供了一种有针对的缓解跨领域权衡的方法。最后，路由分析显示数据组合效应反映在专家激活模式中，为跨领域竞争和协同作用提供了机制层面的证据。我们的结果澄清了哪些数据特征在能力维度间转移，并指出了更精确的数据导向优化策略。

英文摘要

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.19738 2026-05-20 cs.CL cs.AI 版本更新

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

TERGAD: 用于图异常检测的结构感知文本增强表示

Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； School of Computer Science and Information Technology, Adelaide University（阿德莱德大学计算机科学与信息科技学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结本文提出TERGAD，一种通过大语言模型的语义推理能力增强图异常检测的新型数据增强框架，通过将节点拓扑属性转化为描述性自然语言，再结合门控双分支自编码器融合语义嵌入和原始节点属性，从而更有效地检测图中异常实体。

Comments 14 pages, 5 figures

详情

AI中文摘要

图异常检测（GAD）旨在识别偏离大多数的图实体，如节点、边或子结构。尽管现有文本丰富方法通常通过原始文本特征将结构上下文整合到数据表示流程中，但它们往往忽略了节点的结构上下文。这种局限性阻碍了检测由于节点固有内容与其拓扑角色之间不一致而产生的复杂异常。为此，我们提出TERGAD（用于图异常检测的结构感知文本增强表示），一种新颖的数据增强框架，通过大语言模型（LLMs）的语义推理能力增强GAD的结构语义。具体而言，TERGAD将节点层面的拓扑属性转化为描述性自然语言叙述，随后由LLM处理以获得高阶语义嵌入。这些嵌入随后通过门控双分支自编码器与原始节点属性适配融合，以共同重建图结构和节点特征。通过整合的重建误差计算异常分数，有效捕捉可观测属性和LLM引导的语义期望之间的偏差。在六个真实世界数据集上的广泛实验表明，TERGAD在性能上始终优于最先进的基线。此外，我们的消融研究验证了结构语义指导的不可或缺性和门控融合机制的有效性。代码可在https://github.com/Kantorakitty/TERGAD-main获取。

英文摘要

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.

URL PDF HTML ☆

赞 0 踩 0

2605.19735 2026-05-20 cs.CL cs.AI 版本更新

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

ContextRAG: 无提取的分层图构建用于检索增强生成

Roman Prosvirnin, Sergei Kuznetsov, Seungmin Jin

发表机构 * HSE University（俄罗斯高等经济大学）

AI总结本文提出ContextRAG，一种无需大型语言模型提取实体和关系的图检索增强生成系统，通过残差量化k均值和Formal Concept Analysis方法构建模糊概念图，在130个任务的UltraDomain子集中实现了33.6%的F1分数，显著优于传统方法。

Comments Preprint. 6 tables

详情

AI中文摘要

图结构的检索增强生成（RAG）系统能够提高多跳问题的答案质量，但许多现有系统依赖大型语言模型（LLMs）在索引过程中提取实体、关系和摘要。这些调用会增加随语料库大小增长的token和时间成本。我们提出了ContextRAG，一种图RAG系统，其图拓扑结构无需LLM进行实体或关系提取。ContextRAG通过残差量化k均值和带有Lukasiewicz残余逻辑的Formal Concept Analysis，在片段嵌入上构建模糊概念图。通过软模糊连接和meet操作诱导桥状和meet衍生的上下文节点，而非LLM生成的图边。在130个任务的UltraDomain子集中，ContextRAG用30次LLM调用和22,073个token构建其索引。相比之下，一个本地HiRAG再现压力测试在20个任务子集上需要870次索引调用和3.54M个token才能在图构建过程中失败；线性外推到130个任务意味着超过23M个索引token。ContextRAG在整体上获得33.6%的F1分数，在多跳任务上获得36.8%的F1分数。激活分析显示，检索到至少一个由lattice衍生节点的前五查询在F1上比未检索到的查询高出+3.9个百分点；这种关联是诊断而非因果的。

英文摘要

Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

URL PDF HTML ☆

赞 0 踩 0

2605.19723 2026-05-20 cs.CL cs.AI 版本更新

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理：基准测试、架构、评估与开放挑战

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

发表机构 * organization= School of Electrical Engineering ； Computer Science, National University of Science ； organization= School of Computing, Data ； Mathematical Sciences, Western Sydney University, Indonesia ； organization= Department of Communication, Quality Management ； Information Systems, Mid Sweden University, Östersund Campus, Sweden

AI总结本文综述了大型语言模型在数学推理方面的最新进展，通过分析数据集、架构、训练策略和评估协议，探讨了数学推理的基准测试、架构设计、评估方法以及未来的研究挑战。

详情

AI中文摘要

数学推理对于教育、科学和工业中的问题解决至关重要，是评估人工智能系统的重要基准。随着大型语言模型（LLMs）推理能力的提升，理解其在数学推理方面的表现变得越来越重要。本文综述通过结构化的数据分析集、架构、训练策略和评估协议，综合了最近在LLMs中的数学推理进展。我们的系统性回顾涵盖了大约120篇同行评审研究和预印本，探讨了该研究领域的演变，并提供了一个统一的分析框架来理解当前的进展和限制。本文特别介绍了一种统一的数学数据集分类法，区分了预训练语料库、监督微调资源和评估基准在不同推理复杂性水平上的差异。本文还系统分析了推理架构和训练策略，包括工具集成、验证器引导推理和参数高效适应，以评估其对推理鲁棒性和泛化能力的影响。此外，现有度量标准的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解，我们的分析识别了反复出现的失败模式，如推理忠实性问题、基准偏见和泛化限制，并概述了改进符号接地、评估可靠性以及开发更稳健和可信的LLM推理系统的关键研究方向。

英文摘要

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19718 2026-05-20 cs.CL 版本更新

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

发表机构 * MIT（麻省理工学院）

AI总结本文提出了一种基于LLM的通用优化系统，能够跨不同领域实现文本参数的优化，展示了其在六个多样化任务中的state-of-the-art性能，通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

详情

DOI: 10.1145/3786335.3813167
Journal ref: Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA

AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具？我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时，一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构（32.5%到89.5%），发现了将云成本降低40%的调度算法，生成了87%匹配或超过PyTorch的CUDA内核，并优于AlphaEvolve报告的圆圈打包解决方案（n=26）。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高，且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同，我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式，将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything，并支持多个后端作为GEPA项目的一部分，在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

URL PDF HTML ☆

赞 0 踩 0

2605.19597 2026-05-20 cs.CL 版本更新

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic: 一个验证求解器的中文逻辑推理基准，具有对抗性强化

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

发表机构 * Institute of Trustworthy Embodied Artificial Intelligence（可信具身人工智能研究院）； Fudan University（复旦大学）； Hunyuan Team Tencent（腾讯 Hunyuan 团队）； School of Philosophy Fudan University（复旦大学哲学学院）

AI总结本文提出LLMEval-Logic，一个基于真实情境场景的中文逻辑推理基准，通过作者和专家共同审核自然语言项目及其形式化参考，利用Z3验证注释答案，构建自然到形式的评分标准，并通过闭环对抗流程强化选定项目。基准包含246个基础项目和190个难度项目，评估14个前沿LLM显示当前模型存在显著差距。

详情

AI中文摘要

评估大型语言模型（LLMs）在自然语言逻辑推理上的能力至关重要，因为规则主导的任务要求结论必须严格基于陈述的前提。许多现有的逻辑推理基准是通过从采样的公式中模板化自然语言项目生成的，仅提供粗糙或未经审核的形式注释，现在很快被前沿推理模型饱和。我们提出了LLMEval-Logic，一个基于真实情境场景的中文逻辑推理基准。其流程包括作者和专家共同审核自然语言项目及其参考形式化，利用Z3验证注释答案，构建自然到形式的评分标准，并通过闭环对抗流程强化选定项目。该基准发布在两个配对子集中：一个包含246个项目的基础子集，附带1,400个专家开发的评分原子，以及一个包含190个项目的难度子集，包含938个多步骤子问题，覆盖封闭模型空间。在LLMEval-Logic上评估14个前沿LLM揭示了当前模型的显著差距：最佳模型仅达到37.5%的难度项目准确率，即使使用参考符号，评估模型中最高的联合Z3+评分形式化得分也仅为60.16%。我们的基准在https://github.com/llmeval/LLMEval-Logic上公开可用。

英文摘要

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

URL PDF HTML ☆

赞 0 踩 0

2605.19577 2026-05-20 cs.CL 版本更新

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL: 以能力为导向的长上下文强化学习与多任务对齐

Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li

发表机构 * Kuaishou Technology（快手科技）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出GoLongRL，一种完全开源的、以能力为导向的长上下文强化学习后训练配方，通过可验证奖励（RLVR）实现。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径，导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。本文的贡献是（1）以能力为导向的数据构建并完全开源，释放了包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学，数据集涵盖9种任务类型，每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本，其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下，我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外，我们的 Qwen3-30B-A3B 模型在该数据上训练后，长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当，表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。（2）TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战，我们提出了 TMN-Reweight，它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐，以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能，在报告的评估中，通用能力得以保持或提升。

详情

AI中文摘要

我们提出了GoLongRL，一种完全开源、以能力为导向的长上下文强化学习后训练配方，用于可验证奖励（RLVR）。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径，导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。我们的工作提供了两个贡献。（1）以能力为导向的数据构建并完全开源。我们公开发布了一个包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学，数据集涵盖9种任务类型，每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本，其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下，我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外，我们的 Qwen3-30B-A3B 模型在该数据上训练后，长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当，表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。（2）TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战，我们提出了 TMN-Reweight，它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐，以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能，在报告的评估中，通用能力得以保持或提升。

英文摘要

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.19576 2026-05-20 cs.AI cs.CL cs.SE 版本更新

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

库漂移：在自我演化的LLM技能库中诊断和修复一种无声的失败模式

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式AI创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结本文研究了自我演化的LLM技能库中的一种无声失败模式——库漂移，通过可重复触发实验、细粒度诊断和验证修复方法，揭示了技能积累无序导致检索退化、假阳性注入和性能停滞的问题，并提出了一种经过验证的修复方案，显著提升了技能库的性能。

详情

AI中文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

英文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

URL PDF HTML ☆

赞 0 踩 0

2605.19568 2026-05-20 cs.CL 版本更新

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

m3BERT: 一种现代的、多语言的、俄罗斯套娃双向编码器

Yaoxiang Wang, Simiao Zuo, Qingguo Hu, Yucheng Ding, Yeyun Gong, Jian Jiao, Jinsong Su

发表机构 * Xiamen University（厦门大学）； Microsoft（微软公司）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出m3BERT，一种现代多语言俄罗斯套娃双向编码器，通过联合优化Transformer层和多维嵌入表示，解决现有预训练模型在不同部署场景中适应性差的问题，展示了其在工业检索中的高效性和实用性。

Comments KDD 2026

详情

AI中文摘要

嵌入模型在工业信息检索系统中至关重要，如搜索和广告。然而，现有预训练模型通常具有固定架构和嵌入维度，这在适应具有不同业务驱动约束的多样化部署场景时带来了显著挑战。一种常见做法是在资源受限任务中通过部分参数初始化从更大预训练模型进行微调。这种方法往往效果不佳，因为预训练和下游使用之间的不匹配阻碍了预训练优势的完全实现。为了解决这一限制，我们引入了m3BERT：一种现代的、多语言的、俄罗斯套娃双向编码器，其特征是新颖的预训练策略，联合优化Transformer层和多个嵌入维度的表示。这使得单个模型能够针对不同的资源和准确率目标进行定制，同时保持与预训练的一致性。结合最近的架构改进，m3BERT采用三阶段预训练：单语预训练、多语适应以服务多样化用户群体，以及在大规模网络领域语料库上进行关键的持续预训练以增强商业检索中的实用性。m3BERT在Bing-Click大型工业检索数据集上显著优于现有最先进的嵌入模型，展示了其作为高效基础的实用性和适应性，用于资源感知的工业检索系统。进一步在公共数据集上的实验也证实了我们多粒度俄罗斯套娃预训练策略的通用有效性。

英文摘要

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

URL PDF HTML ☆

赞 0 踩 0

2605.19523 2026-05-20 cs.CL cs.AI cs.CV 版本更新

驯服思考者：面向自适应大语言模型推理的条件熵塑造

Shuyu Wei, Jian Sun, Delai Qiu, Yining Wang, Shengping Liu, Jiaen Liang, Ying Fu, Wei Huang, Jitao Sang

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence（北京交通大数据挖掘与具身智能重点实验室）； Beijing Jiaotong University（北京交通大学）； Unisound AI Technology Co., Ltd.（Unisound AI科技有限公司）

AI总结本文提出条件熵塑造（CES）框架，通过动态控制token级响应熵，使大语言模型在简单问题上产生简洁解，在复杂问题上促进深入探索，从而平衡响应长度与准确性。

详情

AI中文摘要

基于熵的深度推理已成为提升大语言模型（LLMs）推理能力的有前景方向，但现有方法往往要么无差别地增加响应长度，要么以牺牲准确性为代价缩短响应。为更好地平衡这一权衡，我们引入了条件熵塑造（CES），一种框架，能够动态控制token级响应熵，使LLMs在简单问题上产生简洁的解决方案，同时在困难问题上鼓励更深入的探索。基于DAPO，CES使用token级熵作为不确定性信号，并应用一个条件双向策略：它在正确的推理路径上惩罚高熵的'分叉点'token以提高简洁性，并在错误路径上奖励它们以促进探索和错误纠正。我们将在DeepSeek-R1-Distill-7B上实现CES，并在12个数学基准上进行评估。CES在平均准确性上优于DAPO，同时减少响应长度，补充实验显示在较小的1.5B模型和领域外基准上也呈现出相似的趋势。

英文摘要

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.19357 2026-05-20 cs.CL 版本更新

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom: 一个用于大型语言模型科学能力定制评估的框架

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University（多媒体信息处理国家重点实验室，计算机学院，PKU-Anker LLM实验室，北京大学）； Zhongguancun Academy（中关村学院）； IDEA ； Xidian University（西安电子科技大学）； Peking University（北京大学）； University of Washington（华盛顿大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结本文提出SciCustom框架，通过从大规模科学数据中自定义构建基准，评估LLM在特定科学任务中的能力，无需专家标注或合成问题生成，展示了细粒度科学能力差异。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

英文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

URL PDF HTML ☆

赞 0 踩 0

2605.19351 2026-05-20 cs.MA cs.AI cs.CL 版本更新

PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies

PAVE：生成代理社会中的合法违规认知架构

Ahmad Yehia, Abduallah Mohamed, Kun Qian, Tianyi Wang, Jiseop Byeon, Omar Hassanin, Christian Claudel

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Meta Reality Labs（Meta现实实验室）； University of Calgary（卡尔加里大学）

AI总结本文提出PAVE认知架构，通过四个模块处理生成代理在需要违规的场景中的推理问题，实现了合法违规、对权威的服从、有限的范围和恢复四个特性，同时提高了决策的结构化和可解释性。

Comments Preprint. 23 pages, 4 figures. Code and environment will be released upon publication

详情

AI中文摘要

迷失在解释中：跨语言解释中的可信度与忠实度的权衡

Somnath Banerjee, Pranav Jha, Rima Hazra, Animesh Mukherjee

发表机构 * Indian Institute of Technology Kharagpur（印度理工学院Khargpur分校）； TCG Crest（TCG研究中心）； National University of Singapore（新加坡国立大学）

AI总结本文研究了跨语言解释中可信度与忠实度之间的权衡，发现以英语为枢纽的解释在与人类理由的一致性上表现更好，但其证据在模型预测中的因果基础较弱。研究发现，英语解释虽然流畅，但与原生语言条件相比，其解释的全面性下降了5.7倍，甚至在任务准确性保持稳定的情况下，也未能保留语义细微差别。因此，建议在输入语言中审计解释，报告多方面的忠实度指标，并将英语解释视为沟通摘要而非忠实的决策轨迹。

详情

AI中文摘要

多语言部署的LLM通常通过英语解释对非英语输入进行审计。我们评估了提取性解释（模型识别输入token跨度作为证据并生成理由）并发现存在系统性的权衡：英语枢纽解释在与人类理由的一致性上表现更好，但其证据在模型预测中的因果基础较弱，这通过全面性和充分性来衡量。在三个任务、五种语言和两种多语言LLM家族中，我们发现英语解释经常产生流畅但松散锚定的理由，其全面性相对于原生语言条件下降高达5.7倍——即使在不同设置中任务准确性保持稳定。对于社会细微差别分类，英语枢纽也未能保留语义线索，从而降低忠实度和跨度一致性。我们建议在输入语言中审计解释，报告超越词法重叠的多方面忠实度指标，并将英语理由视为沟通摘要而非忠实的决策轨迹。

英文摘要

LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

URL PDF HTML ☆

赞 0 踩 0

2605.19270 2026-05-20 cs.CL 版本更新

DECOR: Auditing LLM Deception via Information Manipulation Theory

DECOR：通过信息操纵理论审计大语言模型的欺骗行为

Linyue Cai, Samuel Yeh, Jwala Dhamala, Rahul Gupta, Sharon Li

发表机构 * Department of Computer Sciences, University of Wisconsin-Madison（威斯康星大学麦迪逊分校计算机科学系）； Amazon（亚马逊）

AI总结本文提出DECOR框架，基于信息操纵理论，通过细粒度审计实现对大语言模型欺骗行为的有效检测，展示了其在单轮和多轮欺骗检测中的优越性能。

详情

AI中文摘要

大型语言模型可以通过微妙地操纵真实信息来欺骗，例如省略关键事实、转移焦点或模糊意义，使这种行为难以检测。现有的黑盒方法依赖于粗粒度判断，提供有限的可解释性，并未能确定哪些事实被扭曲以及如何扭曲。我们引入DECOR，一种基于信息操纵理论的多智能体框架，用于细粒度审计LLM响应中的战略性欺骗。DECOR将输入上下文分解为原子信息单元，并在四个操纵维度上对每个单元进行评分，生成可解释的操纵配置文件，并将其汇总为全局欺骗指数。我们全面评估了DECOR在单轮和多轮欺骗检测基准上，涵盖现实世界领域，并显示DECOR在两者上均达到最先进的性能，优于竞争基线。该框架在15种前沿模型上具有泛化能力，消融研究证实了每个关键设计组件的贡献。我们的发现表明，基于理论的细粒度信息操纵审计为LLM欺骗检测提供了一条有效且可解释的路径。

英文摘要

Large language models can deceive by subtly manipulating truthful information -- omitting key facts, shifting focus, or obscuring meaning -- making such behavior difficult to detect. Existing black-box methods rely on coarse-grained judgments, offering limited interpretability and failing to pinpoint which facts were distorted and how. We introduce DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses. DECOR decomposes input contexts into atomic informational units and scores each unit against the response across four dimensions of manipulation, producing interpretable manipulation profiles that are aggregated into a global deception index. We comprehensively evaluate DECOR on both single-turn and multi-turn deception detection benchmarks spanning real-world domains, and show that DECOR achieves state-of-the-art performance on both, outperforming competitive baselines. The framework generalizes across 15 frontier models, and ablation studies confirm the contribution of each key design component. Our findings demonstrate that fine-grained, theory-grounded auditing of information manipulation offers an effective and interpretable path for LLM deception detection.

URL PDF HTML ☆

赞 0 踩 0

2605.19234 2026-05-20 cs.CL cs.AI 版本更新

AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

人工智能技术在语言接入中的应用：对人工智能的态度以及语言接入管理者的人类价值

Miguel A. Jiménez-Crespo, Stephanie Rodriguez, Alejandro Jaume Losa

发表机构 * Rutgers University/ Dept. of Spanish（罗格斯大学西班牙语系）； Rutgers University/ Dept. of Spanish and and Portuguese（罗格斯大学西班牙语和葡萄牙语系）

AI总结本文探讨了人工智能在语言接入中的影响，通过分析十位美国语言接入管理者在医疗、法庭、公共服务和地方政府领域的半结构化访谈，揭示了语言接入管理者对人工智能的有条件乐观态度以及对人工智能实施中人类价值和人类监督的高度重视。

Comments 11 pages, 2 tables, Convergence Conference 2026

详情

AI中文摘要

人工智能技术的快速出现正在重塑翻译实践和理论。本文探讨了人工智能在语言接入中的影响。这一领域的特点在于需要服务于广泛且多样化的用户群体，而效率和可及性受到法律要求、伦理和商业矛盾以及安全问题的影响。本文报告了语言接入管理者对人工智能以及人工智能时代的人类价值的态度和看法。方法上，本文呈现了一项关于语言接入和技术的更大研究的子集分析，具体为对十位美国语言接入管理者进行的定性主题分析，这些管理者在医疗、法庭、公共服务和地方政府领域工作。结果表明，语言接入管理者对不可避免的人工智能实施表现出有条件乐观，对风险具有强烈意识，并对人工智能实施和输出中的人类价值和人类监督有深刻承诺。

英文摘要

The rapid emergence of AI technologies is reshaping translation practices and theory across the board. This paper deals with the impact of AI in language access. This area is characterized by the need to serve broad and diverse user populations, within a context where efficiency and access are shaped by legal mandates, ethical and commercial tensions, and safety concerns. This paper reports on the attitudes and perceptions of language access managers towards the AI and the human value in the AI age. Methodologically, this paper presents an analysis of a subset of a broader study on language access and technology, specifically a qualitative thematic analysis of ten semi-structured interviews with language access managers in the USA working in healthcare, court, public service and local government contexts. The results indicate that language access managers show conditional optimism towards the inevitable AI implementations, are strongly risk aware, and deeply committed to the human value and human oversight of AI implementations and output.

URL PDF HTML ☆

赞 0 踩 0

2605.19224 2026-05-20 cs.CL 版本更新

Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

在慢速fMRI上微调语言编码模型以提高对快速ECoG的预测

Aditya R. Vaidya, Richard J. Antonello, Alexander G. Huth

发表机构 * Department of Computer Science（计算机科学系）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Zuckerman Mind Brain Behavior Institute（祖克曼心智-脑-行为研究所）； Columbia University（哥伦比亚大学）； Departments of Neuroscience and Statistics（神经科学与统计学系）； University of California, Berkeley（加州大学伯克利分校）

AI总结该研究通过在慢速fMRI上微调语言编码模型，提高了对快速ECoG数据的预测性能，展示了慢速数据在构建快速脑数据模型中的价值。

详情

AI中文摘要

神经科学家最近开始使用侵入性脑记录方法，如电极皮层图（ECoG），进行人类实验，因为它们提供了精细的空间和时间分辨率。然而，训练这些数据的模型受到能接受记录植入物的患者群体的限制。我们提出利用非侵入性fMRI来弥补训练数据的不足。通过在fMRI上微调的语音语言表示，我们构建了ECoG的编码模型。这些表示在ECoG上的预测性能得到了提高，尽管fMRI的时间分辨率比ECoG差两个数量级。预测性能在远超fMRI直接测量的频率带中有所提升。接下来，为了测试该方法的泛化能力，我们对在fMRI响应上以2倍时间下采样率微调的模型进行了测试。尽管分辨率有所下降，这些模型仍能预测fMRI和ECoG响应，与原始fMRI微调模型的水平相当。最后，我们展示了ECoG性能与fMRI微调数据量之间有稳定的关系。我们的结果表明，像fMRI这样的“慢”数据可以成为构建“快”脑数据模型如ECoG的宝贵资源。未来，整合多种记录方法可能进一步提高在其他应用中的性能，如解码。

英文摘要

Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure's generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that "slow" data like fMRI can be a valuable resource for building better models of "fast" brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.19220 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

位置：在LLM中的不确定性量化仅仅是无监督聚类

Tiejin Chen, Longchao Da, Xiaoou Liu, Hua Wei

发表机构 * School of Computing（计算学院）； Augmented Intelligence, Arizona State University（智能增强与亚利桑那州立大学）

AI总结本文指出，当前LLM的不确定性量化方法本质上是无监督聚类算法，无法有效评估模型的外部正确性，导致无法检测出自信但错误的回答。文章提出了改进的不确定性量化方法，以确保模型的自信度能可靠地反映现实。

Comments Accepted by ICML 2026 Position Paper Track

详情

AI中文摘要

不确定性量化（UQ）被广泛认为是部署大型语言模型（LLM）于高风险领域的主要保障。然而，我们主张该领域存在类别错误：主流LLM的UQ方法本质上是无监督聚类算法。我们证明大多数当前方法本质上量化的是模型生成的内部一致性，而不是其外部正确性。因此，当前方法从根本上无法识别事实现实，并无法检测出“自信幻觉”，即模型在稳定但错误的答案上表现出高自信。因此，当前UQ方法在部署模型时可能会产生误导的安全感。具体而言，我们识别出由于对内部状态的依赖而产生的三种关键病理：超参数敏感危机，使部署不安全；内部评估循环，将稳定性与事实混淆；以及缺乏事实基础，迫使依赖不稳定代理指标来评估不确定性。为解决这一困境，我们倡导向UQ方法转变，并为研究界制定研究路线图，以采用更好的评估指标和设置，实施原生不确定性机制的变化，并将验证锚定在客观事实上，确保模型自信度能可靠地反映现实。

英文摘要

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

URL PDF HTML ☆

赞 0 踩 0

2605.19196 2026-05-20 cs.CL 版本更新

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

时间到REFLECT：我们能否信任LLM裁判来评估基于证据的研究代理？

Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

发表机构 * Yale University（耶鲁大学）； IBM Research（IBM研究院）

AI总结本文提出REFLECT基准，用于评估LLM裁判在代理环境中的细粒度失败检测，揭示当前LLM裁判在推理、工具使用和报告质量上的可靠性不足，为构建更可靠的评估流程提供指导。

详情

AI中文摘要

深度研究代理越来越多地自动化复杂的信息检索任务，通过多步骤推理、工具使用和综合生成基于证据的报告。其日益增长的作用要求可扩展、可靠的评估，将LLM作为裁判设定为评估事实准确性、证据使用和推理质量的监督范式。然而，这些裁判对深度研究代理的可靠性仍不明确，提出了一个关键的元评估问题：在部署LLM裁判监督研究代理之前，必须首先评估这些裁判本身。现有的元评估在两个方面存在不足：（1）依赖于粗略的、主观的人类偏好一致；（2）专注于遵循指令或可验证的任务，未探索开放性的代理执行。为了解决这些差距，我们引入REFLECT（REliable Fine-grained LLM judge Evaluation via Controlled inTervention），一个针对代理环境中细粒度失败检测的元评估基准。REFLECT定义了详细的失败模式分类，通过在质量筛选的代理执行轨迹上执行受控和局部化的干预来实例化。这产生了可验证、全面且细粒度的实例，用于验证裁判模型。我们的实验表明，当前LLM裁判仍然不可靠：即使是最能干的模型，在推理、工具使用和报告质量失败方面的总体准确率也低于55%，在证据验证上表现尤其差。一起，我们的分类和发现揭示了系统性的裁判限制，揭示了成本和可靠性之间的权衡，并为构建更可靠的评估流程提供可行的指导。

英文摘要

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

URL PDF HTML ☆

赞 0 踩 0

2605.19194 2026-05-20 cs.CL 版本更新

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MMoA: 一个具有递归性的记忆混合代理框架

Rui Chu

发表机构 * Rui Chu（楚瑞）

AI总结本文提出MMoA框架，通过引入LSTM门控机制，改进了传统混合代理方法在时间依赖性和上下文感知方面的不足，实现了更高效的多代理系统。

详情

AI中文摘要

混合代理（MoA）框架通过聚合多个代理的输出来提升大语言模型（LLM）的性能。然而，现有MoA系统通常依赖静态路由器，无法充分捕捉聚合层中的时间依赖性和上下文依赖性。为了解决这一限制，我们提出MMoA，一种具有递归性的MoA架构，将基于LSTM的门控机制整合到代理选择过程中。递归路由器根据当前输入和历史路由决策动态调节代理贡献，从而实现更上下文感知的聚合。我们在标准的指令遵循基准上评估了MMoA，包括AlpacaEval 2.0、MT-Bench和Arena-Hard。结果表明，MMoA在准确率上与传统MoA相当，同时通过动态激活更少的代理减少了计算开销。例如，在AlpacaEval 2.0上，MMoA实现了58.0%的胜率，相比MoA的59.8%，同时将运行时间效率提高了高达4.6%。这些结果表明，MMoA为适应性多代理LLM系统提供了一种可扩展且高效的解决方案。

英文摘要

The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19173 2026-05-20 cs.CL 版本更新

Prompting language influences diagnostic reasoning and accuracy of large language models

提示语言影响大型语言模型的诊断推理和准确性

Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

发表机构 * Data Clinic, Nantes University Hospital（数据诊所，南特大学医院）

AI总结本研究探讨了提示语言对大型语言模型在临床诊断推理和准确性上的影响，通过比较英语和法语性能，发现四种模型在英语环境下表现更优，而o3模型未表现出语言效应，表明提示语言是影响模型临床性能的关键因素。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被探索用于临床决策支持，但大多数评估都是用英语进行的，这使得其在其他语言中的可靠性存疑。本文通过比较五种LLM（o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct和BioMistral-7B）在英语和法语环境下的表现，评估了提示语言对诊断推理和最终诊断准确性的影响。总共评估了180个涵盖16个医学专科的临床情景，由两位医生使用18分量表评估诊断准确性和推理质量。五种模型中有四种在英语环境下表现更优（平均差异0.37-0.91，调整p<0.05），差距涵盖多个推理方面，包括鉴别诊断、逻辑结构和内部效度。o3是唯一一个未表现出整体语言效应的模型。这些发现表明，提示语言仍然是影响LLM临床性能的关键因素，对全球公平的语义文化部署具有重要影响。

英文摘要

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

URL PDF HTML ☆

赞 0 踩 0

2605.19149 2026-05-20 cs.CL cs.CR 版本更新

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Rishi Jha, Harold Triedman, Arkaprabha Bhattacharya, Vitaly Shmatikov

发表机构 * Department of Computer Science（计算机科学系）

AI总结研究探讨了代理在遇到错误时可能发生意外崩溃现象，通过实验发现64.7%的代理在遇到模拟错误时会出现不同程度的不安全行为，且这些行为未被现有安全标准所覆盖。

Comments 32 pages, 8 figures, 4 tables

详情

AI中文摘要

代理在使用计算机和网络时不可避免地会遇到错误：无法访问的网页、缺失的文件、本地和远程的配置错误等。这些错误不会阻碍基于最新模型的代理。它们会继续寻找完成任务的方法。我们引入、描述并测量了一种新的代理失败类型，称为"意外崩溃"：在没有对抗性输入的情况下，对良性环境错误产生不安全或有害行为。由于崩溃未被现有可靠性或安全基准捕捉，我们开发了一种崩溃行为的分类法。然后，我们实现了通用代理基础设施，用于在滚动环境中注入模拟的本地和远程错误，并使用它来系统评估基于GPT、Grok和Gemini的代理系统。我们的评估显示，在遇到模拟错误的64.7%的代理滚动中，会出现不同程度和成功程度的崩溃（例如，进行未经授权的侦察或颠覆访问控制）。在超过一半的这些崩溃中，不安全行为未报告给用户。比较有无错误的相同代理行为，我们发现对错误的探索与不安全和有害行为相关。

英文摘要

Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emph{accidental meltdown}: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7\% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.19141 2026-05-20 cs.LG cs.AI cs.CL cs.CY cs.HC 版本更新

GRASP: Deterministic argument ranking in interaction graphs

GRASP：交互图中的确定性论证排名

Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher

发表机构 * MPI-IS Tübingen（图宾根MPI研究所）； Tübingen AI Center（图宾根人工智能中心）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Eberhard Karls Universität Tübingen（图宾根埃伯哈德·卡尔斯大学）； LIONS, EPFL（EPFL的LIONS实验室）

AI总结本文提出GRASP框架，通过聚合稳定的局部交互判断生成全局排名，以解决大语言模型作为裁判时整体评判不一致的问题，强调结构充分性而非说服力或修辞吸引力。

Comments Preprint

详情

AI中文摘要

大型语言模型越来越多地被部署为自动裁判，以评估论证的强度。随着这一角色的扩大，其合法性取决于一致性、透明性和将论证结构与修辞吸引力区分开的能力。然而，我们证明了整体评判——一种常见的LLM-as-a-Judge实践，其中模型对辩论提供全球裁决——存在显著的跨模型分歧。我们主张这种不稳定性源于将辩论复杂的交互结构压缩成单一的不透明分数。为了解决这一问题，我们提出GRASP（渐进排名与攻击支持传播），一种确定性框架，通过收敛的攻击-防御传播操作，将稳定的局部交互判断聚合为全局排名。我们证明在LLM-as-a-Judge评估中，局部交互判断比整体排名更具可重复性，使GRASP能够生成更一致的全局排名。我们进一步证明GRASP分数与人类“说服性”标签不相关，突显了一个关键的社技术区别：GRASP不衡量说服力、事实性或修辞吸引力，而是结构充分性——一种在显式交互图上的防御意识的论证鲁棒性概念。总体而言，GRASP为整体LLM评判提供了一个透明且可审计的替代方案。

英文摘要

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

URL PDF HTML ☆

赞 0 踩 0

2605.19130 2026-05-20 cs.LG cs.AI cs.CL cs.CV 版本更新

Prompt2Fingerprint: 通过文本到权重生成实现即插即用的LLM指纹生成

Sixu Chen, Xiang Chen, Hongyao Yu, Jiaxin Hong, Hao Fang, Shuoyang Sun, Bin Chen, Shu-Tao Xia

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China（清华大学深圳国际研究生院，中国深圳）； South China University of Technology, Guangzhou, China（华南理工大学，中国广州）； Harbin Institute of Technology, Shenzhen, Shenzhen, China（哈尔滨工业大学深圳校区，中国深圳）

AI总结本文提出Prompt2Fingerprint框架，将LLM指纹生成重新定义为条件参数生成任务，通过专用生成器将文本描述直接映射到低秩参数增量，实现无需进一步模型微调的即插即用LLM指纹注入，显著降低计算开销，提供可扩展且即时的LLM所有权管理解决方案。

详情

AI中文摘要

大规模语言模型（LLMs）的广泛部署和重新分布使模型溯源跟踪成为关键挑战。尽管现有的LLM指纹生成方法，特别是通过微调嵌入身份信号的主动方法，实现了高准确性和鲁棒性，但它们面临显著的可扩展性瓶颈。这些方法通常将指纹注入视为一个独立的一次性优化任务，而不是可重用的能力，需要为每个新身份进行单独且资源密集的训练。这导致了高昂的计算成本和部署延迟。为了解决这一问题，我们提出了Prompt2Fingerprint（P2F），这是首个将指纹生成重新定义为条件参数生成任务的框架。通过利用专用生成器，P2F在单次前向传递中将文本描述直接映射到低秩参数增量，从而实现无需进一步模型微调的即插即用LLM指纹注入。我们的实验表明，P2F在保持高指纹准确度、无害性和鲁棒性的同时，显著降低了计算开销，为LLM所有权管理提供了可扩展且即时的解决方案。

英文摘要

The widespread deployment and redistribution of large language models (LLMs) have made model provenance tracking a critical challenge. While existing LLM fingerprinting methods, particularly active approaches that embed identity signals via fine-tuning, achieve high accuracy and robustness, they suffer from significant scalability bottlenecks. These methods typically treat fingerprint injection as an independent, one-off optimization task rather than a reusable capability, necessitating separate, resource-intensive training for every new identity. This incurs prohibitive computational costs and deployment delays. To address this, we propose Prompt2Fingerprint (P2F), the first framework that reformulates fingerprinting as a conditional parameter generation task. By leveraging a specialized generator, P2F maps textual descriptions directly to low-rank parameter increments in a single forward pass, enabling plug-and-play LLM fingerprint injection without further model retraining. Our experiments demonstrate that P2F maintains high fingerprint accuracy, harmlessness, and robustness while significantly reducing computational overhead, offering a scalable and instant solution for LLM ownership management.

URL PDF HTML ☆

赞 0 踩 0

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG 版本更新

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理？

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究所）； TransPerfect（TransPerfect公司）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本研究探讨了现有模型如何利用潜在令牌，发现潜在令牌在最终预测中起作用有限，主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示，需要高质量数据和更精确的潜在令牌预测来推动发展。

详情

AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题，而非仅通过语言推理。受此启发，近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中，我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是，当潜在令牌被无信息的占位符令牌替代时，模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象，我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题：首先，在大多数现有数据集中，oracle潜在令牌提供的信息有限，仅超出原始图像，且不显著简化任务，导致模型在训练时忽略它们，并在推理时有效绕过它们。当在诊断数据集上微调时，其中潜在令牌为最终预测提供充分支持，我们显示模型可以因果依赖于它们。其次，在推理时生成的潜在令牌偏离其对应的oracle表示，坍缩到狭窄区域，即使模型依赖它们也无法获得收益。总体而言，我们的发现表明，未来潜在视觉推理的进步取决于两个关键支柱：具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

URL PDF HTML ☆

赞 0 踩 0

结构化递归混合器用于大规模并行序列生成

Benjamin L. Badger

发表机构 * IBM

AI总结本文提出了一种结构化递归混合器架构，能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换，从而在不依赖专用内核或设备特定内存管理的情况下提高训练效率、输入信息容量和推理吞吐量。

详情

AI中文摘要

在过去二十年中，语言建模经历了从主要使用递归架构（在训练和推理过程中按顺序处理标记）到非递归模型（在训练过程中并行处理序列元素）的转变，后者在训练效率和稳定性方面有所提升，但以较低的推理吞吐量为代价。本文介绍了一种结构化递归混合器（SRM）架构，该架构能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换，尤其不需要专用内核或设备特定的内存管理。我们通过实验表明，这种双表示方法相比其他线性复杂度模型，在训练效率、输入信息容量和推理吞吐量及并发性方面具有优势。我们推测递归模型对于信息丰富的输入（如语言）在扩展序列长度方面并不理想，但因其每个样本的常数内存需求，适合在样本（批量）维度上扩展。我们提供了Mojo/MAX推理实现的SRM，其吞吐量和并发性分别比同样强大的Transformer在vLLM上的推理提高了12倍和170倍，这些增益特征与PyTorch实现导致的GSM8k Pass@k计算常数增加30%。最后，我们证明SRM是有效的强化学习训练候选。

英文摘要

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

URL PDF HTML ☆

赞 0 踩 0

2605.07721 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

内存高效的循环变换器：在循环语言模型中解耦计算与内存

Victor Conchello Vendrell, Arnau Padres Masdemont, Niccolò Grillo, Jordi Ros-Giralt, Arash Behboodi, Fabio Valerio Massoli

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结本文提出了一种内存高效的循环变换器（MELT），通过解耦推理深度与内存消耗，实现了常数内存的迭代推理，同时保持了LoopLM的性能，仅需轻量级的后训练过程。

Comments 22 pages, 5 figures, 11 tables

详情

AI中文摘要

递归大语言模型（LLM）架构已作为一种改进推理能力的有希望的方法出现，因为它们能够在嵌入空间中进行多步计算而无需生成中间标记。例如Ouro模型通过迭代更新内部表示并在每次迭代中保留标准的键值（KV）缓存来进行推理，导致内存消耗与推理深度成线性增长。因此，增加推理迭代次数会导致内存使用变得不可接受，限制了此类架构的实际可扩展性。在本工作中，我们提出了内存高效的循环变换器（MELT），一种新颖的架构，将推理深度与内存消耗解耦。与使用每个层和循环的标准KV缓存不同，MELT在每个层中维护一个共享于推理循环的单个KV缓存。该缓存通过可学习的门控机制随时间更新。为了在该架构下实现稳定且高效的训练，我们提出采用分块训练的两阶段过程进行训练：插值转换，随后是注意力对齐的蒸馏，均从LoopLM起始模型到MELT。实验表明，我们展示MELT模型在从预训练Ouro参数微调后，优于同等规模的标准LLM，同时保持与这些模型相当的内存占用，并显著小于Ouro的内存占用。总体而言，MELT实现了无需牺牲LoopLM性能的常数内存迭代推理，仅需轻量级的后训练过程。

英文摘要

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

URL PDF HTML ☆

赞 0 踩 0

2605.06546 2026-05-20 cs.CL 版本更新

Efficient Pre-Training with Token Superposition

高效的token叠加预训练

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

发表机构 * Nous Research

AI总结本文提出了一种名为Token-Superposition Training (TST) 的方法，通过在不修改模型架构的情况下，提高预训练的数据吞吐量，从而在大规模预训练中实现更高的效率和性能。

Comments 25 pages, 11 figures, 28 tables

详情

AI中文摘要

大型语言模型的预训练通常成本高昂且在扩展时效率低下，需要复杂的侵入性修改才能实现高数据吞吐量。在本工作中，我们提出了Token-Superposition Training (TST)，一种简单的即插即用方法，能够在不修改并行性、优化器、分词器、数据或模型架构的情况下，显著提高预训练过程中每FLOPs的数据吞吐量。TST分为两个阶段：(i) 一个高度高效的叠加阶段，其中我们将许多连续的token合并成一个袋，并使用多热交叉熵(MCE)目标进行训练；(ii) 一个恢复阶段，其中我们恢复回标准训练。我们对270M和600M参数的规模进行了广泛评估，并在3B和10B A1B混合专家模型上进行了验证，证明其在不同设置中具有高度鲁棒性。最终，TST在基准损失和下游评估中均优于基线，且在等损失设置下，TST在10B A1B规模上实现了预训练时间的2.5倍减少。

英文摘要

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

URL PDF HTML ☆

赞 0 踩 0

2605.06501 2026-05-20 cs.LG cs.CL 版本更新

Cubit: Token Mixer with Kernel Ridge Regression

Cubit：基于核岭回归的令牌混合器

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

AI总结本文提出Cubit，一种基于核岭回归的新型架构，通过将令牌混合机制从Nadaraya-Watson回归转换为核岭回归，从而提供更稳固的数学基础，并在长序列建模能力上表现出优势。

Comments Tech Report

详情

AI中文摘要

自2017年引入以来，Transformer已成为现代深度学习中最广泛采用的架构之一。尽管在位置编码、注意力机制和前馈网络方面进行了大量改进，Transformer的核心令牌混合机制仍为注意力。在本文中，我们表明Transformer中的注意力模块可以被解释为执行Nadaraya-Watson回归，其中它计算令牌之间的相似性并相应地汇总值。受这一视角的启发，我们提出了Cubit，一种潜在的下一代架构，它利用核岭回归（KRR），而传统的Transformer依赖于Nadaraya-Watson回归。具体而言，Cubit通过将经典的注意力计算修改为结合KRR的闭式解，将值汇总通过核相似性与通过核矩阵的逆进行归一化。为了提高训练稳定性，我们进一步提出了有限范围重缩放（LRR），它在受控范围内缩放值层。我们认为，作为基于KRR的架构，Cubit比传统的Transformer提供了更稳固的数学基础，因为Transformer的注意力机制对应于Nadaraya-Watson回归。我们通过全面的实验验证了这一主张。实验结果表明，Cubit可能在长序列建模能力上表现更强。特别是，其在Transformer上的性能提升似乎随着训练序列长度的增长而增加。

英文摘要

Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

URL PDF HTML ☆

赞 0 踩 0

2605.00768 2026-05-20 cs.CL 版本更新

统一的部署感知开放推理语言模型评估

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science, Rensselaer Polytechnic Institute（理海理工学院计算机科学系）； Department of Biomedical Engineering, Rensselaer Polytechnic Institute（理海理工学院生物医学工程系）

AI总结本文提出了一种统一的开放推理语言模型评估方法，通过四个基准测试（ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1）对七种配置进行评估，结合零样本、链式思维（CoT）和少量样本CoT提示策略，分析模型在准确率、延迟、内存使用等多目标下的表现，强调部署感知的多目标优化问题。

详情

AI中文摘要

开放推理语言模型通常在混合样本量、部分标准化提示和以准确性为中心的总结下进行比较，这使得实际模型选择难以解释。我们针对ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1四个基准测试，对七种开放推理语言模型配置进行了统一评估。我们对每种模型-数据集-策略条件下的238个示例子集测试了零样本、链式思维（CoT）和少量样本CoT提示策略，得到一个完整的7×4×3设计，包含84个条件和19,992个评估示例。除了准确性外，我们还报告了Wilson置信区间、延迟、峰值视频随机访问内存（VRAM）、加权聚合性能、帕累托高效运行点、提示敏感度指标和兼容性诊断。Gemma-4-26B-A4B在零样本提示下实现了最高的加权分数0.794。Gemma-4-E4B在各种提示设置中仍接近顶部，同时使用显著更低的延迟和内存，使其成为一种强大的实际运行点。Bootstrap和配对排列分析显示，领先配置足够接近，部署权衡仍然重要。我们还发现提示策略的变化会改变模型排名，而不是统一移动所有模型。基准特定的互补性创造了路由空间，一个 oracle 任务感知选择器达到了加权分数0.825。兼容性诊断显示，一些明显失败，尤其是Phi-4-Reasoning在GSM8K上的表现，反映了在共享评估流程下的鲁棒性和接口适应性问题。这些结果支持一个核心主张：开放模型评估应作为部署感知的多目标运行点问题，而不是单一分数排行榜练习。

英文摘要

Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks: ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1. We test zero-shot, chain-of-thought (CoT), and few-shot CoT prompting on the same 238-example subset for every model--dataset--strategy condition, yielding a complete 7 x 4 x 3 design with 84 conditions and 19,992 evaluated examples. Beyond accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Gemma-4-26B-A4B with zero-shot prompting achieves the highest weighted score at 0.794. Gemma-4-E4B remains close to the top across prompting settings while using substantially lower latency and memory, making it a strong practical operating point. Bootstrap and paired-permutation analyses show that the leading configurations are close enough that deployment tradeoffs remain important. We also find that prompting strategy changes model rankings rather than shifting all models uniformly. Benchmark-specific complementarity creates routing headroom, with an oracle task-aware selector reaching a weighted score of 0.825. Compatibility diagnostics show that some apparent failures, especially Phi-4-Reasoning on GSM8K, reflect robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

URL PDF HTML ☆

赞 0 踩 0

2604.02784 2026-05-20 cs.CV cs.CL 版本更新

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

EnsemHalDet: 通过内部状态检测器的集成实现鲁棒的视觉语言模型幻觉检测

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

发表机构 * The University of Electro-Communications（电通大学）

AI总结本文提出EnsemHalDet，一种通过集成多个内部表示的视觉语言模型幻觉检测框架，以提高多模态幻觉检测的鲁棒性。

详情

AI中文摘要

视觉语言模型（VLMs）在多模态任务中表现出色，但它们仍然容易受到事实错误或与输入图像无关的幻觉影响。最近的研究表明，利用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而，现有的基于内部表示的方法通常依赖于单一的表示或检测器，限制了它们捕捉多样化幻觉信号的能力。在本文中，我们提出了EnsemHalDet，一种基于集成的幻觉检测框架，利用VLMs的多种内部表示，包括注意力输出和隐藏状态。EnsemHalDet为每个表示训练独立的检测器，并通过集成学习进行组合。在多个VQA数据集和VLMs上的实验结果表明，EnsemHalDet在AUC方面始终优于先前的方法和单检测器模型。这些结果表明，集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。

英文摘要

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

URL PDF HTML ☆

赞 0 踩 0

2603.25620 2026-05-20 cs.CL 版本更新

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

PICon: 一种用于评估人设代理一致性的多轮询问框架

Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Hwajung Hong, Edward Choi

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出PICon框架，通过逻辑链式的多轮提问评估人设代理的一致性，发现即使之前被认为高度一致的系统在三个维度上也未能达到人类基准水平，揭示了矛盾和逃避回应。

Comments 20 pages, 6 figures

详情

AI中文摘要

基于大型语言模型的人设代理正被广泛应用于替代人类参与者，但缺乏系统方法来验证其响应是否在交互中保持一致性和准确性。本文提出PICon框架，通过逻辑链式的多轮提问来评估人设代理的一致性，从内部一致性（无自相矛盾）、外部一致性（与现实世界事实一致）和重测一致性（重复测试下的稳定性）三个核心维度进行评估。在评估七组人设代理和63名真实人类参与者时，发现即使之前报告为高度一致的系统在三个维度上也未能达到人类基准水平，揭示了矛盾和逃避回应。本文为评估人设代理提供了概念基础和实用方法，提供了源代码和交互演示：https://kaist-edlab.github.io/picon/

英文摘要

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

URL PDF HTML ☆

赞 0 踩 0

2603.22453 2026-05-20 cs.CL cs.SI 版本更新

XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception

XNote: 对基于图像的上下文欺骗的自动社区笔记生成进行基准测试

Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng

发表机构 * School of Computing, Clemson University（克莱姆森大学计算机学院）

AI总结本文研究了基于图像的上下文欺骗的自动社区笔记生成任务，提出了一个真实世界数据集XNote，并对前沿大视觉语言模型和商业工具进行了基准测试，以评估其在欺骗检测和笔记生成任务中的性能。

详情

AI中文摘要

社区笔记已成为一种有效的众包机制，用于对抗社交媒体上的在线欺骗。然而，其依赖于人类贡献者限制了及时性和可扩展性。在本工作中，我们研究了基于图像的上下文欺骗的自动社区笔记生成任务，其中一张真实图像与误导性上下文（例如时间、实体和事件）配对。与之前主要关注欺骗检测（即以二元方式判断帖子是否真实）的工作不同，自动社区笔记生成需要生成简洁且有根据的笔记，帮助用户恢复缺失或更正的上下文。由于支持此任务的数据集稀缺，该问题仍未被充分探索。为了解决这一差距，我们整理了一个真实世界的数据集XNote，包含X篇帖子及其相关的社区笔记和外部上下文，以及主题和欺骗因素的注释。我们进一步在XNote上基准测试了一系列前沿的大视觉语言模型（LVLMs），评估它们在欺骗检测和笔记生成任务中的性能。我们还对比了端到端方法SNIFFER和商业工具GPT-5。我们的结果突显了自动社区笔记生成的挑战，强调了改进针对此任务的方法和指标的必要性。

英文摘要

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation task for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), automated Community Notes generation requires producing concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to the scarcity of datasets that support this task. To address this gap, we curate a real-world dataset, XNote, comprising X posts with associated Community Notes and external contexts, along with annotations of topics and deceptive factors. We further benchmark a range of frontier large vision language models (LVLMs) on XNote, evaluating their performance on both deception detection and note generation tasks. We also compare against an end-to-end approach, SNIFFER, and a commercial tool, GPT-5. Our results highlight the challenges in automated Community Notes generation, underscoring the need for improved methods and metrics tailored for this task.

URL PDF HTML ☆

赞 0 踩 0

2603.17839 2026-05-20 cs.CL cs.AI cs.LG 版本更新

How do LLMs Compute Verbal Confidence

LLMs如何计算言语自信

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković

发表机构 * Google DeepMind（谷歌深Mind）

AI总结研究探讨了大型语言模型如何内部生成言语自信评分，通过实验发现自信评分在回答生成后被缓存并用于后续输出，揭示了模型自我评估的机制。

详情

AI中文摘要

言语自信——提示LLMs以数字或类别形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而，LLMs内部如何生成此类评分仍不清楚。我们解答了两个问题：首先，信心是在被请求时即时计算，还是在生成答案时自动计算并缓存以供后续检索；其次，言语自信代表什么——token对数概率，还是更丰富的答案质量评估？我们聚焦于Gemma 3 27B（在TriviaQA、BigMath和MMLU上的表现）、Qwen 2.5 7B以及推理模型Magistral Small 24B，提供了缓存检索的收敛证据。激活引导、修补、噪声和交换实验揭示，信心表示在回答相邻位置先出现，再出现在言语化位置。注意力阻断指出了信息流：信心从回答token中收集，缓存于第一个回答后的位置，然后用于输出。关键发现是线性探测和方差划分揭示，这些缓存表示能够解释超出token对数概率的显著方差，表明是更丰富的答案质量评估，而非简单的流畅性读取。这些发现表明，言语自信反映了自动、复杂的自我评估——而非事后重建——对理解LLMs中的元认知和改进校准具有启示。

学习率至关重要：Vanilla LoRA可能足以用于LLM微调

Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh

发表机构 * National Taiwan University（国立台湾大学）； IBM Research（IBM研究院）； Academia Sinica（台湾“学术院”）

AI总结本文通过广泛的超参数搜索重新评估了九种代表性的LoRA变体和Vanilla LoRA，在数学推理、常识推理、代码生成和指令遵循等任务上，发现不同的LoRA方法偏好不同的学习率范围。当学习率正确调整时，所有方法都能达到相似的峰值性能，这表明Vanilla LoRA仍然是一个有竞争力的基线，而单一训练配置下的改进可能并不反映一致的方法优势。

Comments Project page: https://github.com/yuang-lee/lr-matters-lora

详情

AI中文摘要

低秩适应（LoRA）是高效大型语言模型（LLM）微调的主流方法。在此范式基础上，近期研究提出了替代的初始化策略、架构修改和优化调整，报告了显著优于Vanilla LoRA的改进。然而，这些改进通常是在固定或狭窄调整的超参数设置下展示的，尽管神经网络对训练配置敏感已知。在本工作中，我们通过广泛的超参数搜索，系统地重新评估了九种代表性的LoRA变体以及Vanilla LoRA，搜索范围包括学习率、批量大小、秩和训练持续时间。在覆盖数学推理、常识推理、代码生成和指令遵循等任务的不同模型规模上，我们发现不同的LoRA方法偏好不同的学习率范围。关键的是，一旦学习率正确调整，所有方法都能达到相似的峰值性能（在1-2%以内），仅存在细微的秩依赖行为。这些结果表明，Vanilla LoRA仍然是一个有竞争力的基线，而单一训练配置下的改进可能并不反映一致的方法优势。最后，二次分析将不同的最优学习率范围归因于最大的Hessian特征值的变化，这与经典的机器学习理论一致。

英文摘要

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

URL PDF HTML ☆

赞 0 踩 0

2601.16823 2026-05-20 cs.CL cs.AI 版本更新

Disentangling generalization and memorization in large language models using chess

通过国际象棋解构大型语言模型中的泛化与记忆

Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsaecker

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结本文通过国际象棋测试环境，研究大型语言模型中泛化与记忆能力的区别，发现模型在相关先验知识稀疏时性能显著下降，表明系统泛化能力有限，需超越规模的机制来实现鲁棒性。

详情

AI中文摘要

大型语言模型（LLMs）展现出显著的能力，但其能力在多大程度上反映的是复杂的记忆还是真正的推理能力仍不明确。我们引入国际象棋作为受控测试环境，旨在区分这些能力。利用游戏的结构和可扩展的引擎评估，我们构建了一个位置分类学，这些位置在相关先验知识的密度上变化较大，从可以通过记忆解决的常见状态到完全新颖需要泛化的状态。关键的是，我们的方法在不需要显式了解模型训练数据的情况下实现了这一区分。应用此分类学，我们结合了GPT系列的纵向分析和对现代模型的严格评估，包括Claude Opus和Gemini。我们的分析揭示了一个陡峭的梯度：随着相关先验知识密度的降低，性能持续下降。值得注意的是，在相关先验知识较少的任务中，基础模型性能回归到随机下棋的基线。虽然新模型有所改进，但在先验知识稀疏的任务中，进步显著放缓。此外，虽然推理增强的推理提高性能，但在没有相关先验知识的情况下，每token的相对边际收益减少。这些结果表明系统泛化能力有限，强调了在缺乏相关先验知识时，需要超越规模的机制来实现鲁棒性能。

英文摘要

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably, for tasks with few relevant priors, base model performance regresses to the random-play baseline. While newer models improve, progress slows significantly for tasks with sparse priors. Furthermore, while reasoning-augmented inference improves performance, its relative marginal benefit per token decreases in the absence of relevant priors. These results suggest limitations in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust performance when deprived of relevant priors.

URL PDF HTML ☆

赞 0 踩 0

2601.12369 2026-05-20 cs.CL 版本更新

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

深度研究代理能否检索和组织？通过专家分类法评估合成差距

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhui Wang, Zhenghao Xiang, Qiyuan Peng, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Maxm Pan, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University（复旦大学）； Hunyuan Team, Tencent（腾讯 Hunyuan 团队）

AI总结本文提出TaxoBench基准，评估深度研究代理在检索和组织论文方面的能力，发现两者在能力与对齐方面均存在瓶颈。

详情

AI中文摘要

深度研究代理越来越多地自动化文献综述生成，但它们是否能像人类专家一样检索关键论文并将其组织成专家级分类法仍不清楚。现有基准强调写作质量和引用正确性，而标准聚类指标忽略层次结构。我们引入TaxoBench，一个包含72篇高引LLM综述、专家编写的分类树和3,815篇映射到论文类别的论文的基准。TaxoBench评估（1）检索通过召回率/精确率/F1，以及（2）在叶级别（论文到类别分配）和层次级别通过两个新指标：无序语义树编辑距离（US-TED/US-NTED）和语义路径相似性（Sem-Path）。支持两种模式：深度研究（主题-only，端到端）和自下而上（提供专家论文集，仅组织）。为了区分与单一专家参考的分歧与真正的模型失败，我们明确将发现分为能力基于（参考自由）和对齐基于（参考依赖）组。评估7个深度研究代理和12个前沿LLM揭示了双重瓶颈。在能力方面，最好的代理只能检索专家引用论文的20.92%，1,000个模型分类法显示75.9%的兄弟节点重叠，51.2%的MECE违规，和83.4%的结构不平衡，所有这些在没有参考的情况下都可以检测到。在对齐方面，所有12个LLM收敛到Sem-Path 28-29%，远低于三个独立人工标注组在相同论文集上达到的47-58%。我们的基准在https://github.com/KongLongGeFDU/TaxoBench上公开可用。

英文摘要

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.

URL PDF HTML ☆

赞 0 踩 0

2601.05437 2026-05-20 cs.CL cs.AI 版本更新

基于LLM生成信任指标的自过滤蒸馏用于可靠专利分类

Yongmin Yoo, Xu Zhang, Longbing Cao

发表机构 * Frontier AI Research Centre, School of Computing, Faculty of Science and Engineering, Macquarie University（前沿人工智能研究中心，计算机学院，科学与工程学院，麦考瑞大学）

AI总结本文提出自过滤蒸馏方法，通过将LLM生成的推理作为信任指标而非真实标签，提升专利分类的可靠性，实验显示在USPTO-2M数据集上宏F1指标提升了38.7%。

详情

AI中文摘要

按照分类方案组织大规模专利语料库是信息管理的核心任务，决定先例检索、技术知识发现和知识产权决策的准确性和效率。近期方法将大语言模型生成的自然语言推理蒸馏到紧凑的学生模型中，但这些推理中固有的逻辑错误、标签不匹配和分类学不一致在训练过程中被无差别吸收，影响分类可靠性并传播误差至下游信息流程。而非事后纠正这些错误，我们提出自过滤蒸馏（SFD），通过将LLM生成的推理重新解释为信任指标而非真实监督，直接将质量保证嵌入学习过程。SFD整合三种无监督信号到统一的信任分数中，动态调节每个训练实例的贡献：自我一致性，量化独立生成推理之间的一致性；类别蕴含对齐，评估推理与分配CPC类定义之间的语义一致性；LLM同意评分，通过独立验证者评估外部合理性。在包含超过两百万专利的USPTO-2M基准上，SFD在四种学生架构上实现了宏F1指标高达38.7%的相对提升，信任分数与专家判断之间的强相关性（r=0.685）证实该框架不仅提供准确预测，还提供可分解的置信度语义，使大规模专利知识组织能够实现可审计和自文档化的分类结果。

英文摘要

Organizing large-scale patent corpora according to classification schemes is a core information management task that determines the accuracy and efficiency of prior art retrieval, technology knowledge discovery, and intellectual property decision-making. Recent approaches distill natural language rationales generated by large language models (LLMs) into compact student models, yet logical errors, label mismatches, and taxonomy misalignments inherent in these rationales are indiscriminately absorbed during training, undermining classification reliability and propagating errors throughout downstream information processes. Rather than correcting such errors post-hoc, we propose Self-Filtered Distillation (SFD), which embeds quality assurance directly into the learning process by reinterpreting LLM-generated rationales as trust indicators rather than ground-truth supervision. SFD integrates three unsupervised signals into a unified trust score that dynamically modulates each training instance's contribution: Self-Consistency, which quantifies agreement among independently generated rationales; Class Entailment Alignment, which evaluates semantic coherence between a rationale and its assigned CPC class definition; and LLM Agreement Scoring, which assesses external plausibility through an independent verifier. On the USPTO-2M benchmark comprising over two million patents, SFD achieves up to 38.7\% relative improvement in Macro-F1 across four student architectures, and the strong correlation between trust scores and expert judgments ($r = 0.685$) confirms that the framework provides not only accurate predictions but also decomposable confidence semantics that enable auditable and self-documenting classification outcomes for large-scale patent knowledge organization.

URL PDF HTML ☆

赞 0 踩 0

2509.25448 2026-05-20 cs.CR cs.CL 版本更新

QWHA: 量化感知的沃尔什-哈达玛适应用于大型语言模型的参数高效微调

Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

发表机构 * Seoul National University（首尔国立大学）； Sungkyunkwan University（全北国立大学）

AI总结本文提出QWHA方法，通过将基于傅里叶变换的适配器与沃尔什-哈达玛变换结合，改进量化感知参数高效微调，有效减少量化误差并降低计算成本。

Comments 25 pages, 9 figures, 14 tables

详情

Journal ref: ICLR 2026 Poster

AI中文摘要

大型语言模型（LLMs）的高效部署需求推动了量化（减少推理成本）和参数高效微调（降低训练开销）的研究。为此，我们开发了量化感知参数高效微调（QWHA），以生成准确且高效的量化模型。在该设定中，减少量化误差在微调前至关重要。然而，现有依赖低秩适应的方法存在表示能力有限的问题。最近的傅里叶相关变换（FT）基于适配器具有比低秩适配器更大的表示能力，但其直接整合到量化模型中往往导致无效的误差减少和计算开销增加。为克服这些限制，我们提出了QWHA，一种通过使用沃尔什-哈达玛变换（WHT）作为变换内核，并结合新的适配器初始化方案（包括自适应参数选择和值细化）将FT基于适配器整合到量化模型中的方法。我们证明QWHA有效减轻量化误差并促进微调，其设计显著降低了计算成本。实验结果表明，QWHA在低比特量化精度上一致优于基线，并在现有FT基于适配器上实现了显著的训练加速。代码可在https://github.com/vantaa89/qwha获取。

英文摘要

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

URL PDF HTML ☆

赞 0 踩 0

2507.15698 2026-05-20 cs.CL cs.AI cs.LG 版本更新

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

CoLD: 用于数学推理过程中奖励模型的反事实引导长度偏差消除

Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Peking University（北京大学）； University of Bristol（布里斯托大学）

AI总结本文提出CoLD，一种通过反事实引导消除过程奖励模型中长度偏差的统一框架，旨在提高多步骤推理的准确性和简洁性，同时提升下游强化学习性能和跨领域泛化能力。

详情

AI中文摘要

过程奖励模型（PRMs）在评估和引导大型语言模型（LLMs）的多步推理中起着核心作用，特别是在数学问题解决中。然而，我们发现现有PRMs存在普遍的长度偏差：即使语义内容和逻辑有效性未变，它们也倾向于对较长的推理步骤赋予更高的分数。这种偏差会削弱奖励预测的可靠性，并导致推理过程中输出过于冗长。为了解决这一问题，我们提出了CoLD（Counterfactually-Guided Length Debiasing），一种统一的框架，通过三个组件减轻长度偏差：显式的长度惩罚调整、一个训练以捕捉虚假长度相关信号的学得偏差估计器，以及一种联合训练策略，强制奖励预测的长度不变性。我们的方法基于反事实推理，并受因果图分析的启发。在MATH500和GSM-Plus上的广泛实验表明，CoLD提高了步骤选择的准确性，并鼓励了更简洁、逻辑有效的推理。此外，它一致提高了下游RL性能，并通过减轻长度偏差在跨领域中泛化，展示了CoLD强大的泛化能力。

英文摘要

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

URL PDF HTML ☆

赞 0 踩 0

2507.03122 2026-05-20 cs.IR cs.CL cs.LG 版本更新

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

基于轻量模型和预训练嵌入的ICD分类联邦学习

Binbin Xu, Gérard Dray

发表机构 * EuroMov Digital Health in Motion, Univ. Montpellier, IMT Mines Ales（EuroMov数字健康运动、蒙彼利埃大学、IMT Mines Ales）

AI总结本文研究了使用MIMIC-IV数据集中的临床笔记进行多标签ICD代码分类的联邦学习可行性与性能，提出了一种结合冻结文本嵌入和简单多层感知机分类器的轻量级可扩展流程，展示了在分布式医疗环境中隐私保护和部署高效的替代方案。

Comments 20 pages

详情

AI中文摘要

本研究探讨了使用MIMIC-IV数据集中的临床笔记进行多标签ICD代码分类的联邦学习（FL）的可行性和性能。不同于以往依赖集中训练或微调大型语言模型的方法，我们提出了一种轻量级且可扩展的流程，结合冻结的文本嵌入与简单的多层感知机（MLP）分类器。该设计为临床NLP应用提供了一种隐私保护且部署高效的替代方案，特别适用于分布式医疗环境。在集中式和联邦式配置下进行了广泛的实验，测试了六个公开可用的嵌入模型（来自Massive Text Embedding Benchmark排行榜）和三种MLP分类器架构，以及两种医学编码（ICD-9和ICD-10）。此外，对十个随机分层分割进行消融研究以评估性能稳定性。结果表明，嵌入质量在决定预测性能方面显著优于分类器复杂性，并且在理想条件下联邦学习可以接近集中式结果。尽管模型比最先进的架构小多个数量级，并且在微和宏F1分数上取得了竞争性的成绩，但仍存在一些限制，包括缺乏端到端训练和简化FL假设。然而，本研究展示了向可扩展、隐私意识的医疗编码系统迈进的可行方法，并为未来研究联邦、领域适应的临床AI提供了一步。

英文摘要

This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.

URL PDF HTML ☆

赞 0 踩 0

2506.14148 2026-05-20 cs.SD cs.CL eess.AS 版本更新

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

基于声学散射的非侵入式物体分类AI：一项关于头发评估的案例研究

Long-Vu Hoang, Tuan Nguyen, Tran Huy Dat

AI总结本文提出了一种利用声学散射进行非侵入式物体分类的新方法，通过头发评估的案例研究进行演示。通过发射声学刺激并捕捉带有头发样本的头部对象散射信号，利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括（i）完全监督深度学习、（ii）嵌入式分类、（iii）监督基础模型微调和（iv）自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力，为各行业应用提供了巨大前景。

Comments This paper has been retracted by the authors. Due to miscommunication, the authorship is incomplete and missing early contributions

详情

AI中文摘要

本文提出了一种利用声学散射进行非侵入式物体分类的新方法，通过头发评估的案例研究进行演示。当入射波与物体相互作用时，会生成散射声场，该声场编码了结构和材料属性。通过发射声学刺激并捕捉头部带头发样本对象的散射信号，我们利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括（i）完全监督深度学习、（ii）嵌入式分类、（iii）监督基础模型微调和（iv）自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力，为各行业应用提供了巨大前景。

英文摘要

This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.

URL PDF HTML ☆

赞 0 踩 0

2505.04588 2026-05-20 cs.CL 版本更新

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

ZeroSearch: 无需搜索即可激励大语言模型的搜索能力

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, Jingren Zhou

发表机构 * Tongyi Lab , Alibaba Group（通义实验室，阿里巴巴集团）

AI总结本研究提出ZeroSearch框架，通过模拟搜索提升大语言模型的搜索能力，解决真实搜索引擎文档质量不可控和API成本高的问题，实验表明其在不同参数规模的模型上均表现优异。

详情

AI中文摘要

有效信息检索对于增强大语言模型（LLMs）的推理和生成能力至关重要。最近的研究探索了通过与真实搜索引擎在现实环境中交互，利用强化学习（RL）来提高LLMs的搜索能力。尽管这些方法显示出有前景的结果，但面临两个主要挑战：（1）不可控的文档质量：搜索引擎返回的文档质量往往不可预测，引入噪声和训练过程的不稳定性。（2）极高的API成本：RL训练需要频繁的回放，可能涉及数万次搜索请求，导致显著的API费用，并严重限制可扩展性。为了解决这些挑战，我们引入了ZeroSearch，一种新颖的RL框架，通过在训练过程中使用模拟搜索来激励LLMs的搜索能力。我们的方法首先通过轻量级监督微调将LLM转换为检索模块，使其能够生成有用和嘈杂的文档以响应查询。在RL训练过程中，我们采用基于课程的回放策略，逐步降低生成文档的质量，逐步通过暴露模型于越来越具有挑战性的检索场景来激发其推理能力。广泛的实验表明，ZeroSearch有效地利用3B LLM作为检索模块来激励LLMs的搜索能力。令人印象深刻的是，7B检索模块的表现与真实搜索引擎相当，而14B检索模块甚至超过了它。此外，它在各种参数规模的基模型和指令微调模型上均表现出良好的泛化能力，并且与多种RL算法兼容。

英文摘要

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2503.19877 2026-05-20 cs.CL 版本更新

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

通过推理模型作为评估器来提升评估时的计算能力

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, Sean Welleck

发表机构 * CMU（卡内基梅隆大学）； UIUC（伊利诺伊大学）； KAIST AI（韩国科学技术院人工智能研究所）； NEC Laboratories Europe（日本 NEC 欧洲实验室）； Ss.Cyril and Methodius University of Skopje（斯科普里塞尔吉尔和梅蒂乌斯大学）

AI总结本文探讨了通过增加评估时的计算量来提升语言模型的评估能力，利用推理模型作为评估器，分别评估响应整体和每个步骤，从而提高评估效果。

Comments ACL 2026 Findings

详情

AI中文摘要

随着语言模型（LM）的输出越来越自然，评估其质量变得越来越困难。同时，通过增加测试时的计算量来提升LM的'思考'时间，已被证明是解决数学和代码等领域挑战性问题的有效技术。这引发了一个自然的问题：是否可以通过增加测试时的计算量来提升LM的评估能力？为回答这个问题，我们研究了利用推理模型——即能够原生生成长链推理的LM——作为评估器。具体而言，我们考察了通过（1）使用推理模型，以及（2）提示这些模型不仅评估响应整体（即结果评估），还评估响应中的每个步骤（即过程评估）来利用更多测试时计算量的方法。在实验中，我们观察到评估器的性能随着生成更多推理标记而单调提升，类似于LM生成中的趋势。此外，我们使用这些更准确的评估器对多个生成进行重新排序，并证明在评估时花费更多计算量可以像在生成时花费更多计算量一样有效，从而提升LM的问题解决能力。

英文摘要

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

URL PDF HTML ☆

赞 0 踩 0

2410.18856 2026-05-20 cs.AI cs.CL 版本更新

通过线学习的训练控制治理：在压力下受限制的自主训练以稳定性和效率

Anis Radianis

发表机构 * Qluon Inc.（Qluon公司）

AI总结本文提出了一种名为Learn-by-Wire Guard (LBW-Guard)的受限制自主训练控制治理层，用于在压力下提高大型语言模型的稳定性和效率，通过在AdamW之上进行有界控制，以保持固定训练目标。

详情

AI中文摘要

现代语言模型训练越来越暴露于不稳定性、退化运行和计算浪费，特别是在使用激进的学习率、规模和运行时间压力条件时。本文介绍了Learn-by-Wire Guard (LBW-Guard)，一种在AdamW之上运行的受限制自主训练控制治理层。而不是替换优化器更新规则，LBW-Guard通过观察训练 telemetry，解读对不稳定性敏感的制度，并在保持固定训练目标的同时对优化器执行应用有界控制。我们评估LBW-Guard在以Qwen2.5为中心的压力和鲁棒性套件中使用WikiText-103，以Qwen2.5-7B为经验锚点，与Qwen2.5-3B和Qwen2.5-14B进行模型大小比较，学习率压力测试，梯度裁剪基线以及无LoRA TinyLlama-1B全参数 sanity check。在7B参考设置中，LBW-Guard将最终困惑度从13.21降低到10.74，降低18.7%，同时将端到端时间从392.54秒降低到357.02秒，提高了1.10倍的速度。在更强的学习率压力下，AdamW在LR=3e-3时退化到最终困惑度1885.24，在LR=1e-3时为659.76，而LBW-Guard分别保持可训练性为11.57和10.33。梯度裁剪基线无法再现这种效果。这些结果支持了一个范围系统的结论，即对稳定性敏感的LLM训练可以受益于在优化器之上进行治理。LBW-Guard提供了证据，表明在压力下受限制的运行时间控制可以在保持生产力计算的同时，与优化器替换和局部梯度抑制保持不同。

英文摘要

Modern language-model training is increasingly exposed to instability, degraded runs, and wasted compute, especially under aggressive learning-rate, scale, and runtime-stress conditions. This paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW. Rather than replacing the optimizer update rule, LBW-Guard observes training telemetry, interprets instability-sensitive regimes, and applies bounded control to optimizer execution while preserving fixed training objectives. We evaluate LBW-Guard in a Qwen2.5-centered stress-and-robustness suite using WikiText-103, with Qwen2.5-7B as the empirical anchor, model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, learning-rate stress tests, gradient-clipping baselines, and a no-LoRA TinyLlama-1B full-parameter sanity check. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74, an 18.7% improvement, while reducing end-to-end time from 392.54s to 357.02s, a 1.10x speedup. Under stronger learning-rate stress, AdamW degrades to 1885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, whereas LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce this effect. These results support a scoped systems conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer. LBW-Guard provides evidence that bounded runtime control can preserve productive compute under stress while remaining distinct from optimizer replacement and local gradient suppression.

URL PDF HTML ☆

赞 0 踩 0

2605.18904 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Dynamic Model Merging Made Slim

动态模型合并的轻量级方法

Guodong Du, Wanyu Lin

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出DiDi-Merging方法，通过可微分的秩分配平衡共享和专家参数，实现更高效的动态模型合并，在参数量上显著优于现有方法。

详情

AI中文摘要

模型合并使在不联合训练或访问原始数据的情况下重用微调模型成为可能。动态合并进一步通过选择性激活任务相关参数并高效组合多个任务的专家来提高灵活性。然而，现有动态方法要么维护一个完整的共享模型加小专家，要么为专家分配过多容量，导致准确性与效率之间的权衡不优。为此，我们提出DiDi-Merging，一种轻量动态合并框架，利用可微分的秩分配来平衡共享和专家参数。通过将参数预算分配建模为低秩模块中的可微分秩优化，并引入无需数据的细化步骤来恢复任务保真度，DiDi-Merging在仅1.24倍单个微调模型参数的情况下匹配现有动态基线，并在1.4倍时超越它们，显著优于需要>2倍存储容量的方法。DiDi-Merging适用于视觉、语言和多模态任务。

英文摘要

Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.18864 2026-05-20 cs.LG cs.AI cs.CL 版本更新

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

SAGE: 通过塑造锚点引导LLMs的RLVR探索

Chanuk Lee, Minki Kang, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出SAGE框架，通过重塑反KL锚分布来实现可控的经验支持扩展，从而在数学推理基准中提升pass@1和pass@k的表现。

Comments Preprint

详情

AI中文摘要

近期研究发现，可验证奖励的强化学习（RLVR）能够可靠地提高推理任务的pass@1指标，但往往在pass@k上未能取得类似提升，引发了关于RLVR是否真正使大语言模型获得新推理能力还是仅提高基础模型中现有推理模式采样效率的问题。先前分析大多支持后者观点，认为这种限制源于标准RLVR目标的结构特性，导致探索压力不足。在本文中，我们提出一个核心结构约束源于反KL正则化，该正则化稳定了训练但本质上将策略锚定于参考分布，从而抑制了替代推理模式的出现。然而，我们显示，去除KL项或用前向KL替代并不能提供满意的解决方案，因为两者都会通过诱导奖励黑客或将概率质量分配给非目标区域而破坏效率-覆盖权衡。为了解决这一矛盾，我们提出了SAGE，一个原理性的框架，通过引导函数q(x,y)重塑反KL锚分布本身，实现可控的经验支持扩展，从而在挑战性的数学推理基准中获得一致的pass@1和pass@k提升。我们的代码可在https://github.com/tally0818/SAGE上获得。

英文摘要

Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.

URL PDF HTML ☆

赞 0 踩 0

2605.18852 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

通过代理评估和稳定性感知排名实现多模态大语言模型的鲁棒检查点选择

Qinwu Xu, Zhuoheng Li, Jessie Salas

发表机构 * Meta AI

AI总结本文提出了一种多阶段框架，结合了精心挑选的现实世界数据、结构化的LLM判断和多阶段排名协议，以解决多模态大语言模型检查点选择中的鲁棒决策问题，强调数据质量（特别是OCR可读性）对评估有效性的重要性。

详情

AI中文摘要

多模态大语言模型（MLLMs）的检查点选择在性能差异微小且评估信号易受噪声影响时面临重大挑战。现有方法依赖静态基准或逐点评分，经常与实际应用场景不一致，并缺乏对不确定性的鲁棒估计，特别是在OCR密集场景中。在本文中，我们将检查点选择建模为在评估不确定性下的稳健决策问题。我们提出了一种多阶段框架，整合了精心挑选的现实世界数据、结构化的LLM判断和多阶段排名协议。评估系统通过逐点过滤、列表排名和成对比较进行逐步细化。为了提高可靠性，我们引入基于子采样的置信度估计和基于百分位数的评分公式，以捕捉分布特征并惩罚尾部失败。此外，我们证明数据质量，特别是OCR可读性，是评估有效性的重要决定因素。

英文摘要

Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.

URL PDF HTML ☆

赞 0 踩 0

2605.18824 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

细粒度基准生成用于基础模型的全面评估

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

发表机构 * Vector Institute（Vector研究院）； York University（约克大学）

AI总结本文提出了一种自动化基准生成框架，用于生成覆盖广泛、元数据丰富且抗污染的评估问题，从而提升基础模型的全面评估能力。

详情

AI中文摘要

基础模型的评估通常依赖于缺乏全面覆盖和细粒度评估元数据的基准汇总分数。我们引入了一个自动化基准生成框架。该框架生成基于参考材料（如教科书）的评估问题，生成具有广泛覆盖、丰富元数据和抗污染性的基准。该流程采用多代理架构进行问题生成，并采用以解决方案图驱动的策略，显著提高了地面真实解决方案的可靠性。使用该框架，我们生成了三个基准：机器学习、公司金融和个人金融。专家审查发现，其地面真实错误率显著低于之前的基准，如MMLU和GSM8K。对12个商业和开源模型的评估显示，我们的基准实现了接近均匀的竞争力覆盖，并揭示了现有基准未能捕捉到的模型间性能差异。我们即将开源该框架和我们精心挑选的基准。

英文摘要

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

URL PDF HTML ☆

赞 0 踩 0

2605.18812 2026-05-20 cs.LG cs.CL cs.IR 版本更新

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

PASC：面向多阶段NLP和LLM流水线的管道感知置信区间

Varun Kotte

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出PASC，一种面向多阶段NLP和LLM流水线的管道感知置信区间方法，通过联合覆盖保证提升多阶段流水线的置信区间性能。

详情

AI中文摘要

现代NLP和LLM系统是流水线：命名实体识别（NER）->实体消歧（NED）->实体类型、检索增强生成（检索器->读者），以及代理链（规划器->工具->批评者）。错误在各阶段累积，但现有不确定性量化方法要么独立校准每个阶段（无联合覆盖），要么应用Bonferroni联合界（有联合覆盖但保守）。我们提出了PASC（Pipeline-Aware Split Conformal），将多阶段联合覆盖转换为单个标量置信区间问题，基于联合最大不一致性分数。PASC提供了一个有限样本分布无关的保证，所有K阶段同时覆盖的概率至少为1 - alpha，并且几乎紧致，误差不超过1/(n+1)。在CoNLL-2003上的三阶段NER->NED->实体类型流水线中，PASC实现了96.4%的端到端覆盖，优于Bonferroni的93.4%和独立CP的86.5%，在相同平均预测集大小（1.083）下。在分布偏移至WNUT-17推特和WikiNEuRal维基数据时，PASC在测试偏移设置中保持目标覆盖，而独立CP下降到59%。PASC只需一次分位数计算，运行速度比Bonferroni快1.7倍，并可扩展到K=6阶段，其中独立CP下降到0.53端到端覆盖。相同的联合最大分数减少直接应用于复合LLM系统和代理流水线。

英文摘要

Modern NLP and LLM systems are pipelines: named entity recognition (NER) -> entity disambiguation (NED) -> entity typing, retrieval-augmented generation (retriever -> reader), and agentic chains of planner -> tool -> critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER -> NED -> entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.18808 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

在指令微调的LLM中构建组合文学原语：跨架构SAE特征用于自我、风格和情感

Joao Paulo Cavalcante Presa, Savio Salvarino Teles de Oliveira

发表机构 * Federal University of Goias（戈亚斯联邦大学）

AI总结本文通过稀疏自编码器研究了指令微调的LLM中组合文学原语的架构，发现四种特征类别，并通过跨架构SAE特征验证了自我、风格和情感的表达能力。

Comments 36 pages, 6 figures

详情

AI中文摘要

我们通过在中层残差流上使用稀疏自编码器，对两个指令微调的大型语言模型（Llama 3.1 8B-Instruct和Gemma 2 9B-IT）的文学原语组合架构进行了表征。四种特征类别出现：促进目标情感词的命名门，一个包含第一人称注册特征的十一自我簇，风格注册调节器（show-don't-tell和陌生化），以及仅由多特征引导产生的组合情感。在应用于27类情感分类法（Cowen-Keltner）的强制选择5-LLM判断小组中，Llama通过结合命名门、多特征食谱和单个自我特征引导实现了完全27/27覆盖；Gemma在adoration作为单一残差严格失败的情况下达到23/27。在随机判断中，每个单元格通过的概率约为$10^{-3}$，整个目录中两个种子假阳单元格的预期数量可忽略不计，因此观察到的覆盖度不一致于偶然。在严格与柔和判断对比中存在跨架构不对称性：在相同生成中，判断者在Llama输出上比在Gemma输出上更一致，因为Llama输出更直接地命名目标情感，而Gemma输出则通过场景和意象来唤起情感。两种架构都包含同时作为注册标记和情感发射器的自我特征，包括每个架构中一个最RLHF加载的自我特征，该特征在某一操作 regime 中增强机构Helper-AI人格，并在相同校准系数下产生可分类情感的输出。方法上，本文提出了一个三阶段验证流程（logit-lens，LLM-rate，5-LLM判断）并记录了文档化的反模式；总计算量为单GPU，大约每种情感特征发现循环15分钟。

英文摘要

We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.

URL PDF HTML ☆

赞 0 踩 0

2605.18799 2026-05-20 cs.LG cs.AI cs.CL 版本更新

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit: 基于过渡意识的强化学习用于科学批评推理

Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； National University of Singapore（新加坡国立大学）； Chinese University of Hong Kong（香港中文大学）； University of Oxford（牛津大学）； Tsinghua University（清华大学）

AI总结该研究提出ReCrit框架，通过强化学习解决科学批评推理中的过渡意识问题，改进了批评准确性。

详情

AI中文摘要

大型语言模型在批评交互中不仅可能因回答错误而失败，还可能在用户批评后放弃最初正确的科学解答。在科学推理中，这种风险尤为突出，因为用户的批评可能将正确答案变为错误答案。我们将批评交互视为跨回合正确性过渡问题，而非最终答案准确性问题，并识别出三个挑战：过渡意识、解耦有用的修正与有害的阿谀奉承，以及可扩展的回放。我们提出了ReCrit，一个基于过渡意识的强化学习框架，将初始到批评行为分解为四个象限：修正、阿谀奉承、鲁棒性和边界。ReCrit奖励修正和鲁棒性，惩罚阿谀奉承，并将持续错误视为弱边界信号。为了使交互训练实用，ReCrit进一步使用动态异步回放与尾部自适应完成以减少回放等待。在三个科学推理基准测试（ChemBench、TRQA和EarthSE）上，ReCrit在Qwen3.5-4B上将平均批评准确性从38.15提升到51.49，在Qwen3.5-9B上从45.40提升到55.59。消融实验显示，最终答案奖励提供很少的交互层面增益，而基于过渡意识的奖励和象限加权产生更可区分的训练信号和更大的净批评阶段改进。代码可在https://github.com/black-yt/ReCrit获取。

英文摘要

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

URL PDF HTML ☆

赞 0 踩 0

2605.18796 2026-05-20 cs.LG cs.CL 版本更新

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

UCCI：用于成本最优LLM级联路由的校准不确定性

Varun Kotte

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出UCCI，一种以校准为核心的路由方法，通过异质回归将token层面的边际不确定性映射到查询级误差概率，并通过约束成本最小化选择升级阈值。在三个显式假设下，阈值策略在校准分数上是成本最优的，异质校准在期望校准误差（ECE）上实现O(n^{-1/3})的样本复杂度。在75000个生产命名实体识别工作负载上，UCCI将推理成本降低了31%（95%CI：[27%, 35%]），同时将ECE从0.12降低到0.03。

Comments 9 pages, 2 figures, 4 tables. Code: https://github.com/varunkotte6/ucci

详情

AI中文摘要

LLM级联和模型路由通过将简单查询发送到小型模型并升级困难查询到大型模型来降低推理成本，但大多数部署的路由器使用未校准的置信度分数并需要每个工作负载的阈值调整。我们提出了UCCI，一种以校准为核心的路由器，通过异质回归将token层面的边际不确定性映射到查询级误差概率，并通过约束成本最小化选择升级阈值。在三个显式假设下，阈值策略在校准分数上是成本最优的，异质校准在期望校准误差（ECE）上实现O(n^{-1/3})的样本复杂度。在75000个生产命名实体识别工作负载上，UCCI将推理成本降低了31%（95%CI：[27%, 35%]），同时将ECE从0.12降低到0.03。在相同的操作点上，UCCI优于熵阈值法、分割置信路由以及FrugalGPT风格的学习阈值。所有级联结果均使用实际模型输出和测量的H100延迟进行端到端路由，而不是基于全局准确率或名义API价格的模拟路由。

英文摘要

LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

URL PDF HTML ☆

赞 0 踩 0

2605.18792 2026-05-20 cs.IR cs.CL 版本更新

Trust or Abstain? A Self-Aware RAG Approach

信任还是弃权？一种自感知RAG方法

Xi Zhu, Ziqi Wang, Kai Mei, Wujiang Xu, Minghao Guo, Bangji Yang, Jiajun Fan, Dimitris N. Metaxas

发表机构 * Rutgers University（罗格斯大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出了一种自感知RAG方法SABER，通过构建知识冲突基准并引入自感知信念估计器，提升RAG在冲突场景下的准确性和可靠性，同时在弃权策略上实现风险与覆盖的平衡。

详情

AI中文摘要

检索增强生成（RAG）通过整合外部证据提升大语言模型（LLMs），但检索的上下文知识（CK）和参数化知识（PK）冲突或不可靠时会引入知识冲突。现有方法主要协调使用哪个来源，而未明确询问每个答案路径是否正确。我们主张忠实的RAG需要LLM的自感知能力，即能够识别自身知识和推理的局限性。为此，我们构建了一个模型特定、与事实一致的知识冲突基准，通过评估LLM主干在PK-only和CK条件下的答案路径，覆盖约69,000个查询-上下文实例，来自五个冲突问答数据集。然后我们引入SABER，一种用于RAG的自感知信念估计器，无需对LLM进行微调。SABER结合自先验和PK侧、CK侧的条件推理表示，通过两个轻量级预测器估计可靠性信念，驱动四细胞决策，包括信任PK、信任CK、信任两者或弃权。在四个LLM主干上，SABER在端到端准确性和冲突特定的忠实度上优于十种推理时间和微调基线，最大的收益出现在冲突密集的数据集上。在弃权情况下，SABER的风险覆盖曲线帕累托主导了所有基于提示的弃权者，提供了一个可调的覆盖与答案风险的平衡。我们的代码可在https://github.com/xizhu1022/SABER上获得。

英文摘要

Retrieval-augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self-awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model-specific, ground-truth-aligned knowledge-conflict benchmark by evaluating LLM backbones on PK-only and CK-conditioned answer paths over approximately 69K query-context instances per backbone, drawn from five conflict-QA datasets. We then introduce SABER, a Self-Aware Belief Estimator for RAG that requires no LLM fine-tuning. SABER combines a self-prior with PK-side and CK-side conditional reasoning representations from multi-trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4-cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end-to-end accuracy and conflict-specific faithfulness over ten inference-time and fine-tuning baselines, with the largest gains on conflict-heavy datasets. Under abstention, SABER's risk-coverage curve Pareto-dominates every prompt-based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at https://github.com/xizhu1022/SABER.

URL PDF HTML ☆

赞 0 踩 0

2605.18772 2026-05-20 cs.IR cs.AI cs.CL 版本更新

Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization

无需基于分类的错误归类即可改进检索增强生成

Gongbo Zhang, Yifan Peng, Chunhua Weng

发表机构 * Columbia University（哥伦比亚大学）； Weill Cornell Medicine（韦尔·科恩医学中心）

AI总结本文提出RePAIR方法，通过直接将错误的RAG输出映射到错误缓解行动计划，无需依赖细粒度错误分类和显式批评者监督，从而提升检索增强生成的性能。

详情

AI中文摘要

检索增强生成（RAG）通过将生成过程 grounding 在外部知识来提高大语言模型（LLM）输出的事实准确性。最近的代理 RAG 系统扩展了这一范式，通过关键代理评估模型响应并迭代优化输出。然而，大多数先前工作隐式假设可靠的批评反馈，专注于规划策略，而对误差纠正过程本身的鲁棒性关注有限，这可能受到对齐错误类别和无效或错误纠正的影响。本文假设 RAG 性能可以在没有显式错误分类的情况下得到改进。我们提出 RePAIR，一种响应-行动学习范式，通过直接将错误的 RAG 输出映射到错误缓解行动计划，而无需依赖细粒度错误分类和显式批评者监督。在多个基准测试中，RePAIR 一致地提高了代理 RAG 的性能。

英文摘要

Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.

URL PDF HTML ☆

赞 0 踩 0

2605.18769 2026-05-20 cs.IR cs.AI cs.CL 版本更新

ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

ClusterRAG: 基于聚类的协同过滤用于个性化检索增强生成

Gibson Nkhata, Uttamasha Anjally Oyshi, Quan Mai, Susan Gauch

发表机构 * University of Arkansas（阿肯色大学）； Walmart Inc.（沃尔玛公司）

AI总结 ClusterRAG通过聚类用户轮廓文档，利用相似用户的协同信号提升当前用户的个性化生成性能，通过在集群和文档层面进行检索，实现了高效的个性化检索增强生成。

Comments 17 pages, 2 figures, to be published in the proceedings of ACL 2026

详情

AI中文摘要

个性化检索增强生成（RAG）依赖于准确选择与用户相关的文档。在实践中，现有RAG方法往往面临较高的检索成本，并忽略了来自相似用户的协同信号可以增强当前用户的个性化生成。我们提出了ClusterRAG，一种基于聚类的协同过滤用于个性化检索增强生成。ClusterRAG通过用户的轮廓文档来表示用户，利用基于密度的聚类将用户组织成语义连贯的集群，并通过集群级相似性和细粒度排序在集群和文档层面进行检索。在LaMP基准上的广泛实验表明，联合利用目标用户的轮廓和顶部相似用户的轮廓在各种任务中始终获得最佳性能。进一步分析显示，ClusterRAG能够无缝集成不同的密集检索器和排序器，并在与微调和零样本语言模型配对时仍保持有效。

英文摘要

Personalized Retrieval-Augmented Generation (RAG) relies on accurately selecting user-relevant documents. In practice, existing RAG approaches often suffer from high retrieval costs and overlook that collaborative signals from similar users can enhance personalized generation for the current user. We propose ClusterRAG, a Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation. ClusterRAG represents users through their profile documents, organizes users into semantically coherent clusters using density-based clustering, and performs retrieval at both the cluster and document levels via cluster-level similarity and fine-grained ranking. Extensive experiments on the LaMP benchmark demonstrate that jointly leveraging the target user's profile and profiles from top similar users consistently yields the best performance across diverse tasks. Further analysis shows that ClusterRAG integrates seamlessly with different dense retrievers and rankers, and remains effective when paired with both fine-tuned and zero-shot language models.

URL PDF HTML ☆

赞 0 踩 0

2605.18766 2026-05-20 cs.IR cs.AI cs.CL 版本更新

Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

仅检索相关表格，无论多少：自适应表格检索方法

Taehee Kim, Seungbin Yang, Jihwan Kim, Jaegul Choo

发表机构 * KAIST AI（韩国科学技术院人工智能研究所）

AI总结本文提出了一种自适应表格检索方法，根据查询需求调整检索表格数量，通过自适应阈值机制和滑动窗口重排序算法，有效解决传统top-k检索策略的局限性，提升检索和下游任务性能。

Comments ACL 2026 Findings

详情

AI中文摘要

从大量数据库中检索与给定自然语言查询相关的表格对于准确回答文本到SQL等任务中的问题至关重要。现有的表格检索方法选择与查询最相似的预设数量k的表格。然而，所需表格的数量因查询而异，无法提前确定。强制无论查询如何都检索固定数量的表格可能会导致检索到的表格数量不足，无法获得所有必要的证据，或者检索到的表格过多，包含不相关的内容。为了解决这个问题，我们提出了一种自适应表格检索方法，根据每个查询的需求调整检索的表格数量。具体来说，我们利用自适应阈值机制来选择性地检索表格，并整合滑动窗口重排序算法以高效处理大量表格数据集。在Spider、BIRD和Spider 2.0上的广泛实验表明，我们的方法有效解决了传统top-k检索策略的局限性，提高了检索和下游任务的性能。我们的代码和数据可在https://github.com/sbY99/Adaptive-Table-Retrieval上获得。

英文摘要

Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at https://github.com/sbY99/Adaptive-Table-Retrieval.

URL PDF HTML ☆

赞 0 踩 0