arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.27366 2026-05-27 cs.AI cs.CL cs.LG cs.MA 版本更新

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

MUSE-Autoskill: 通过技能创建、记忆、管理和评估实现自我进化智能体

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

发表机构 * ByteDance Inc.（字节跳动公司）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结提出MUSE-Autoskill框架，通过统一的技能生命周期（创建、记忆、管理、评估和优化）使LLM智能体持续提升任务解决能力，实验表明生命周期管理的技能可提高任务成功率、效率、复用性和跨智能体迁移。

Comments 30 pages, 8 figures, 13 tables, working in progress

详情

AI中文摘要

大型语言模型（LLM）智能体依赖可复用技能来解决复杂任务。然而，现有的技能创建方法将技能视为孤立和静态的工件，限制了其可复用性、可靠性和长期改进。我们提出了MUSE-Autoskill智能体（记忆利用技能进化），一个以技能为中心的智能体框架，让智能体通过统一的技能生命周期（创建、记忆、管理、评估和优化）持续提升任务解决能力。我们的框架使智能体能够按需创建技能，跨任务存储和复用技能，高效组织和选择技能，并通过单元测试和运行时反馈评估技能以进行持续优化。我们进一步引入了技能级记忆，为每个技能跨任务积累经验，从而实现更有效的复用和随时间适应。在SkillsBench上的实验提供了初步证据，表明生命周期管理的技能可以提高任务成功率、效率、复用性和跨智能体迁移，突出了将技能视为长期存在、具有经验意识和可测试资产的重要性。

英文摘要

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

URL PDF HTML ☆

赞 0 踩 0

2605.27358 2026-05-27 cs.LG cs.AI cs.CL 版本更新

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

发表机构 * Meta AI

AI总结针对设备端部署，提出MobileMoE系列子十亿参数MoE语言模型，通过联合优化架构和四阶段训练，在14个基准上匹配或超越领先的密集模型和MoE模型，并在智能手机上实现高效推理。

详情

AI中文摘要

混合专家（MoE）已成为千亿参数语言模型的事实标准架构，但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距，我们提出MobileMoE，一系列设备端MoE语言模型，具有子十亿激活参数（0.3-0.9B激活，1.3-5.3B总参数），为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律，在移动内存和计算约束下联合优化MoE架构，识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构，我们采用四阶段方案训练MobileMoE，包括预训练、中期训练、指令微调和量化感知训练，全部使用开源数据集。在14个基准上，MobileMoE匹配或超越领先的设备端密集LLM，推理FLOPs减少2-4倍，并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步，我们提供了首个在商用智能手机上的高效MoE推理，并进行了全面的设备端性能分析。在相当的INT4权重内存下，MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍，解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.27354 2026-05-27 cs.LG cs.AI cs.CL 版本更新

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University（清华大学）

AI总结提出SAERL框架，通过稀疏自编码器提取模型内部状态，建模数据多样性、难度和质量，用于强化学习数据工程，提升准确率并减少训练步数。

详情

AI中文摘要

模型内部状态编码了大型语言模型（LLM）处理其训练数据时的丰富信息；然而，后训练数据工程主要依赖外部信号，忽略了模型内部状态中丰富的内在信号。我们提出了SAERL，一个用于LLM强化学习（RL）的数据工程框架。它使用稀疏自编码器（SAE）这一先进的机制可解释性工具提取的模型内部状态，建模三种内在数据属性：多样性、难度和质量。每个属性支撑一个具体的数据工程操作：用于批次多样性控制的SAE空间聚类与适度批次混合、用于从易到难课程排序的难度代理，以及用于数据过滤的质量探针。SAERL在Qwen2.5-Math-1.5B上相比原始GRPO平均准确率提升3.00%，并以减少20%的训练步数达到目标准确率，在模型规模和RL算法上均有一致收益。实验表明，SAE在不同模型家族和规模间有效迁移，作为一种轻量级且可重用的数据工程工具。这些结果证明，模型内部状态是后训练数据工程中强大且实用的信号来源。

英文摘要

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

URL PDF HTML ☆

赞 0 踩 0

2605.27345 2026-05-27 cs.CL 版本更新

MATCHA: Matching Text via Contrastive Semantic Alignment

MATCHA: 通过对比语义对齐进行文本匹配

Siran Li, Ece Sena Etoglu, Carsten Eickhoff, Seyed Ali Bahrainian

发表机构 * University of Tübingen（图宾根大学）

AI总结针对现有评估指标无法区分语义矛盾的问题，提出MATCHA指标，通过双视角对比学习同时奖励语义一致性和惩罚矛盾，在多个基准上优于ROUGE和BERTScore。

详情

AI中文摘要

可靠的评估对于理解大型语言模型（LLM）的性能至关重要，但当今常用的指标，即词元重叠分数（如ROUGE）和基于嵌入的度量（如BERTScore），常常误判文档的语义相似性。我们的研究表明，词元重叠指标和基于嵌入的指标通常会将几乎相同的分数分配给直接相互矛盾的文本，从而可能掩盖根本性错误。我们引入了MATCHA，一种自动度量指标，它同时奖励与参考的语义一致性并惩罚矛盾。MATCHA采用双视角方法，衡量（i）与黄金文本的接近程度和（ii）与对抗性生成的反事实矛盾的距离。在八个公开基准上，MATCHA在问答、图像字幕生成、自然语言推理、摘要和语义文本相似性任务中，与人工标注相比，优于流行指标。在TruthfulQA数据集（即没有训练集的数据集，其中没有基于嵌入的指标可以局部训练）上，这种在根据参考匹配文本方面的改进相对于ROUGE-L达到18.38%，相对于BERTScore达到20.82%。定量比较和定性人工评估都证实了MATCHA的有效性和正确性，并揭示了现有指标的根本弱点。与23个嵌入模型（包括最先进的模型）作为类似BERTScore的度量相比，MATCHA在仅基于参考区分正确和错误陈述方面仍然是最准确的。我们的代码和指标公开可用（https://github.com/Siran-Li/MATCHA）。

英文摘要

Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (https://github.com/Siran-Li/MATCHA).

URL PDF HTML ☆

赞 0 踩 0

2605.27338 2026-05-27 cs.AI cs.CC cs.CL cs.LO 版本更新

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

带有弱约束的2-ASP(Q)程序：复杂性与高效实现

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

发表机构 * University of Calabria（卡拉布里亚大学）； Rende Italy（意大利伦德）

AI总结本文研究了带有两个量词和弱约束的ASP(Q)程序（2-ASP(Q)^w）的复杂性，并提出基于CEGAR技术的Casper系统实现策略，实验证明其有效性。

2605.27333 2026-05-27 cs.CL 版本更新

FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

FinHarness：面向金融LLM代理的内联生命周期安全约束框架

Haoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang, Yancheng Chen, Jiayu Liang, Qian Li, Hanning Lu, Kefu Xu, Hao Zheng, Chongyang Zhang, Hao Peng, Philip S. Yu

发表机构 * Peking University（北京大学）； Nanyang Technological University（南洋理工大学）； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； University of Chinese Academy of Sciences（中国科学院大学）； Soochow University（苏州大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； University of Leeds（利兹大学）； Fullive Innovation (Beijing) AI Technology Co., Ltd.（全维创新（北京）人工智能科技有限公司）； Beihang University（北京航空航天大学）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结针对金融LLM代理在阻止提示诱导的未授权操作与批准合法多步骤业务流程之间的冲突，提出FinHarness内联安全约束框架，通过查询监控、工具监控和级联模块实现逐步骤风险评估与自适应验证，显著降低攻击成功率并保持良性批准率。

详情

AI中文摘要

探究大型语言模型的文化意识：跨文化美学文体学案例研究

Jiashuo Wang, Fenggang Yu, Jian Wang, Chak Tou Leong, Xiaoyu Shen, Chunpu Xu, Jiawen Duan, Wenjie Li, Johan F. Hoorn

发表机构 * Department of Computing, Hong Kong Polytechnic University（香港理工大学计算机系）； Institute of Digital Twin, Eastern Institute of Technology, Ningbo（宁波东部技术研究所数字孪生研究所）； Department of Language Science and Technology, Hong Kong Polytechnic University（香港理工大学语言科学与技术系）； School of Design, Hong Kong Polytechnic University（香港理工大学设计学院）； Research Institute for Quantum Technology, Hong Kong Polytechnic University（香港理工大学量子技术研究所）； Department of Communication Science, Vrije Universiteit Amsterdam（阿姆斯特丹自由大学传播科学系）

AI总结通过构建C4STYLI基准（包含香港和中国大陆的高度风格化翻译电影片名和广告标语），评估大型语言模型在跨文化美学文体识别和生成方面的能力，发现模型依赖表层语言信息而非风格结构，对香港特定风格结构敏感度有限。

Comments IJCAI 2026 Human-Centred AI track

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在多样化的文化背景中，但它们掌握美学文体学（即策略性地使用语言以唤起文化共鸣）的能力仍未得到充分探索。我们整理了C4STYLI，一个包含来自香港和中国大陆的高度风格化翻译电影片名和广告标语的基准，通过行为识别和生产能力的视角评估LLMs。广泛评估表明，LLMs在风格识别上与人类不同，且这种识别能力在不同文本领域有所变化。此外，LLMs中的风格识别和生成性能并不一致。为了进一步检查LLMs在风格识别中是否真正捕捉到风格信息，我们使用逻辑回归探针进行了结构消融。我们发现，在香港背景下，LLMs中的风格识别主要依赖于表层语言信息而非风格结构。这表明对香港特定风格结构的敏感度有限。

英文摘要

Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4STYLI, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure.

URL PDF HTML ☆

赞 0 踩 0

2605.27294 2026-05-27 cs.CL cs.IR 版本更新

Separating Semantic Competition from Context Length in RAG Reading

在RAG阅读中区分语义竞争与上下文长度

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar, Cien Zhang, Svetlana Karslioglu, Akash Vishwakarma

发表机构 * Meta Platforms（Meta平台）

AI总结通过匹配对照实验，分离出检索增强生成中阅读器的语义竞争效应，证明性能下降部分源于竞争而非仅上下文长度。

Comments 4 pages, 1 figure, 2 tables

详情

AI中文摘要

检索增强生成（RAG）系统即使在检索到正确段落时也可能错误回答。模型仍需阅读检索到的段落，并在看似相关的段落中识别出包含答案的那一个。这种段落阅读模型称为阅读器。它的失败仅仅是因为上下文更长，还是因为其他段落与正确段落真正竞争？我们引入并展示了一种RAG阅读的匹配对照协议：保持段落数量和长度固定，但将强竞争段落替换为不那么竞争的实段。我们在SQuAD上对两个紧凑开放模型应用此对照。这种替换部分恢复了性能，对F1和答案包含的影响最强。对于Phi-2，它恢复了+6.0 EM点、+7.0答案包含点和+0.057 F1。对于Qwen2.5-1.5B，它恢复了+4.5 EM点、+9.0答案包含点和+0.068 F1。为了跟踪性能如何随竞争段落积累而变化，我们还报告了保留曲线，并在曲线未交叉半保留时用右删失半衰期进行总结。这些结果共同表明，该协议分离了与上下文长度不同的竞争效应，尽管该效应对F1和答案包含比精确匹配更清晰，并且也随片段长度变化。

英文摘要

Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.

URL PDF HTML ☆

赞 0 踩 0

2605.27288 2026-05-27 cs.CL cs.AI cs.LG 版本更新

ENPMR-Bench: 情感支持代理的主动记忆检索基准

Xing Fu, Yulin Hu, Mengtong Ji, Haozhen Li, Yixin Sun, Weixiang Zhao, Yanyan Zhao, Bing Qin

发表机构 * Research Center for Social Computing and Interactive Robotics（社会计算与交互机器人研究中心）

AI总结提出ENPMR-Bench基准，基于马斯洛需求层次评估情感支持代理主动推断用户潜在情感需求并检索适当记忆的能力，实验表明当前检索范式存在显著缺陷。

详情

AI中文摘要

记忆增强的语言代理越来越多地部署在情感支持等情感应用中，在这些应用中，理解和响应用户的潜在情感需求至关重要。然而，现有研究通常将记忆视为事实检索的工具，忽视了其在塑造用户情感体验中的作用。在这项工作中，我们引入了ENPMR-Bench，一个用于评估情感需求感知的主动记忆检索（ENPMR）的基准，这是一种核心能力，使代理能够推断用户的潜在情感需求并主动检索适当的记忆以支持共情交互。基于马斯洛需求层次，ENPMR-Bench包括超过1,800个记忆增强对话，并定义了情感需求与支持性记忆类型之间的结构化映射。实验结果表明，当前的检索范式，包括基于嵌入和LLM驱动的方法，都存在显著缺陷，共情得分明显落后于黄金记忆条件。虽然思维链提示在一定程度上改善了推断的情感需求与检索记忆之间的一致性，但性能差距仍然显著。总之，这些发现揭示了当前代理的关键局限性，并指出了通过需求敏感的记忆检索推进个性化情感支持的方向。

英文摘要

Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users' emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users' latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow's hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.27239 2026-05-27 cs.CL 版本更新

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

时间同步性预测情感语料库中的标注质量

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, Botshelo Kondowe, Letlhogonolo Mohleleng, Hareaipha Nkopo Letsoalo, Shamsuddeen Hassan Muhammad, Vukosi Marivate

发表机构 * Data Science for Social Impact, University of Pretoria（社会影响力数据科学，南非普里特里亚大学）； Instituto Politécnico Nacional（技术学院国家）； Department of African Languages, University of Pretoria（非洲语言系，南非普里特里亚大学）； Imperial College London（伦敦帝国理工学院）

AI总结通过分析Setswana情感数据集，发现标注者间一致性随时间下降的主要原因是时间同步性，即同时标注的样本一致性高，而间隔较长的标注一致性低。

详情

AI中文摘要

当标注活动跨越数周或数月且标注者池较小时，标注质量难以维持。我们提出了一个Setswana情感数据集，包含3,565条推文，由三名母语标注者在八个批次中标注，并考察了标注者间一致性（IAA）随时间下降的原因。尽管总体Randolph自由边际Kappa为$κ= 0.76$，属于“优秀”，但每批次$κ$在整个标注任务中下降了超过32个百分点。通过六项针对性分析，我们发现：(i) 标签混淆集中在负面/中性边界；(ii) 两名标注者表现出与自动驾驶标注一致的运行长度漂移；(iii) $κ$的主要预测因子是时间同步性：一分钟内标注的推文达到$κ= 0.98$，而相隔超过一天标注的推文仅达到$κ= 0.65$。标注速度和推文级语言特征与$κ$无显著关联。我们评估了三种开放多语言编码器和专有模型（GPT-5和Gemini）在三类情感分类任务上的表现；微调相比预训练基线提升了29到43个宏F1分数，其中GPT-5少样本学习总体领先（62.2宏F1）。我们发布了数据集、每条标注的时间戳和分析代码，以支持未来非洲语言NLP资源的可重复质量审计。

英文摘要

Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.

URL PDF HTML ☆

赞 0 踩 0

2605.27220 2026-05-27 cs.CL cs.IR 版本更新

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

覆盖幻觉：从检索前路由失败到生产RAG系统中的检索后级联

Zafar Hussain, Kristoffer Nielbo

发表机构 * Aarhus University（奥胡斯大学）

AI总结本文通过丹麦国家百科全书的案例研究，发现合成查询高估了LLM增强的需求（覆盖幻觉），并提出一种检索后级联策略，按成本递增顺序执行工作流，仅在无结果时升级到LLM增强，从而在无需训练开销的情况下提升质量并降低延迟。

详情

AI中文摘要

在现代RAG流水线中，HyDE和查询扩展等查询增强方法被应用于每个查询，导致大量的LLM推理成本和端到端延迟增加。这种开销在实际生产流量中的经验依据仍未得到充分探索。我们以丹麦国家百科全书为案例研究，评估了来自生产流量和合成条件的20,000个查询-工作流对上的五种检索工作流。在该系统中，合成查询表明超过90%的查询需要LLM增强才能实现高检索覆盖率。然而，在我们的生产延迟策略下，只有27.8%的真实用户查询需要LLM增强。我们将这种差距称为覆盖幻觉，并将其归因于合成查询与真实查询分布之间的结构性不匹配。检索前路由无法解决这一差距，因为LLM增强的需求只有在搜索索引后才能揭示，这一结果得到了我们对四种机器学习范式的评估的证实。这种仅从查询无法检测到的覆盖差距，促使我们采用检索后级联策略，该策略按成本递增顺序运行工作流，仅当某一步骤未返回文档时才升级到LLM增强。该级联策略完全无需训练开销或辅助服务基础设施，在质量上比Always-HyDE提高了+0.140综合总体分数，延迟降低了31.8%，并且72.2%的真实用户查询无需LLM增强即可得到服务。

英文摘要

In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27204 2026-05-27 cs.CL cs.IR 版本更新

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

GraphReview: 基于LLM的图消息传递的科学论文评估

Pujun Zheng, Wanying Ren, Jiacheng Yao, Guoxiu He, Star X. Zhao

发表机构 * School of Economics and Management, East China Normal University（东华师范大学经济管理学院）； Institute of Big Data, Fudan University（复旦大学大数据研究院）

AI总结提出GraphReview框架，通过图消息传递整合论文内在质量、同期关联和历时关联，利用LLM生成节点先验和边比较证据，结合个性化PageRank进行质量排序、决策预测和审稿生成，在决策和排序指标上平均提升29.7%。

详情

AI中文摘要

科学论文评估通常不仅涉及评估稿件本身，还需要将其与同期研究和先前文献联系起来。然而，现有的基于LLM的方法通常分别建模这些信号，缺乏跨论文传播审稿证据的统一机制。我们提出$ extbf{GraphReview}$，一个基于图的LLM框架，将论文评估形式化为在语义论文图上进行审稿信号的消息传递。该图联合捕捉内在质量、同期论文之间的同步链接以及指向先前工作的历时链接。LLM用于估计节点级质量先验，并通过成对论文比较生成边级比较证据，而个性化PageRank整合审稿信号用于质量排序、决策预测和审稿生成。为了生成更高质量的图证据，我们提出了奖励诱导的最大似然目标来训练LLM骨干网络。实验表明，GraphReview始终优于最强基线，在决策和排序指标上平均提升29.7%，包括准确率提升23.7%，Spearman's $ρ$提升57.6%。它还生成更高质量的审稿文本，并在不同时间段和会议场所中有效泛化。代码可在https://github.com/ECNU-Text-Computing/GraphReview获取。

英文摘要

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $ρ$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU-Text-Computing/GraphReview.

URL PDF HTML ☆

赞 0 踩 0

2605.27195 2026-05-27 cs.CL 版本更新

超越二元：认知评分层级中的语音表征

Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann, Gerhard Eschweiler, Daniela Berg, Sam Gijsen, Paula Andrea Perez-Toro, Kerstin Ritter

发表机构 * 1 Hertie Institute for AI in Brain Health, University of Tübingen, Tübingen, Germany 2 Tübingen AI Center, University of Tübingen, Tübingen, Germany 3 Department of Psychology, Humboldt-Universität zu Berlin 4 Geriatric Center, Tübingen University Hospital, Tübingen, Germany 5 Tübingen Center for Mental Health (TüCMH), Department of Psychiatry ； Psychotherapy, Tübingen University Hospital, Tübingen, Germany 6 German Center for Mental Health (DZPG), Partner Site Tübingen, Tübingen, Germany 7 Department of Neurology, University Medical Center Schleswig-Holstein ； Kiel University, Kiel, Germany 8 Center for Neurology, University Hospital Tübingen ； Hertie Institute for Clinical Brain Research, Tübingen, Germany 9 Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany 10 Charit\'e--Universit\"atsmedizin, Department of Psychiatry

AI总结本研究利用5,754份德语神经心理学评估录音，比较手工声学特征与自监督学习嵌入在轻度认知障碍认知评估层级（任务、领域、全局）中的表现，发现任务约束与评估层级之间的关联。

详情

AI中文摘要

本研究考察了轻度认知障碍中语音表征与认知评估层级结构之间的关系。利用5,754份德语神经心理学评估录音，我们在三个评分层级（任务、领域和全局）上评估了六项认知任务。我们比较了手工声学特征与自监督学习（SSL）嵌入。结果表明，尽管SSL表示在较低层级通常优于手工特征，但这种趋势在MCI分类中发生逆转。此外，任务特定约束影响性能：响应自由度较大的任务随着层级增加表现出性能稀释，表明“专家”表示，而高度结构化任务的性能向更高层级增加，表明“通才”表示。这些发现揭示了自动临床语音分析中任务约束与评估层级之间的联系。

英文摘要

This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.27186 2026-05-27 cs.CL 版本更新

MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

MAIGO: 通过历史清理的在线策略自蒸馏缓解对话丢失

Haoyu Zheng, Yun Zhu, Shu Yuan, Shangming Chen, Qing Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； Huazhong University of Science and Technology（华中科技大学）； Fuzhou University（福州大学）； Tencent（腾讯）

AI总结针对大语言模型在多轮对话中性能下降（对话丢失）的问题，提出MAIGO方法，通过在线策略自蒸馏和清理历史助手回复来减少自污染，无需验证器或推理时辅助，显著提升多轮对话准确性。

详情

AI中文摘要

大语言模型通常能从完整指定的提示中解决任务，但当相同需求在多轮中展开时，性能会下降，这被称为对话丢失（LiC）差距。我们将这种退化部分归因于自污染：中间助手的回复进入后续上下文，并将早期偏差向前传递。受此机制启发，我们提出了MAIGO，一种在线策略自蒸馏方法，通过使用模型自身策略的历史清理参考来减少这种污染。对于中间轮次，MAIGO移除先前的助手回复，同时保留用户可见的分片前缀；对于回答轮次，它从基于完整用户侧对话的配对全视图参考中蒸馏。一个可靠性权重降低与干净参考不一致的中间轮次样本的权重。MAIGO不需要验证器奖励、状态标签或推理时辅助。在具有确定性验证器的LiC配对视图协议下，MAIGO将Qwen2.5-7B-Instruct的SHARDED准确率从52.8提升至66.1，SHARDED/FULL比率从66.5%提升至84.1%，同时保持FULL准确率在2.3个点以内。这些结果表明，自污染是LiC差距中一个可训练的成分。

英文摘要

Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.

URL PDF HTML ☆

赞 0 踩 0

2605.27168 2026-05-27 cs.CL cs.AI cs.CY 版本更新

Grounding Text Embeddings in Stakeholder Associations

将文本嵌入与利益相关者关联对齐

Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft, Kenneth C. Enevoldsen, Chris Russell

发表机构 * University of Oxford（牛津大学）； Institute for Wicked Problems（复杂问题研究所）； The Chinese University of Hong Kong（香港中文大学）； Danish Technical University（丹麦技术大学）； Aarhus University（奥胡斯大学）

AI总结提出利益相关者对齐练习方法，通过评估嵌入模型与人类专家的语义距离一致性，发现神经文本嵌入在丹麦政策案例中可靠性显著低于专家（差距19-26个百分点），且该差距在美国联邦AI用例中复现（16个百分点）。

详情

AI中文摘要

LLMs 已经是好导师：面向教学数学辅导的无训练提示优化

Unggi Lee, Minchul Shin, Yeil Jeong, Sookbun Lee, Jeongsu Moon, Kyungtae Joo, Eunjoo Lee, Hoilym Kwon

发表机构 * Korea University Sejong Campus（韩国大学世宗校区）； Gyeonggi Institute of Education（京畿教育学院）； Indiana University Bloomington（印第安纳大学布卢明顿分校）； Opentutorials ； Chosun University（全州大学）； Korea University Korean Studies Center（韩国大学韩学研究中心）

AI总结本研究探索通过API调用优化系统提示的无训练方法，提出5种教育专用方法，在2个OOD基准上评估12种方法，发现所有方法均超越最强RL训练基线，ParetoGrad在事后解决率、泄漏控制和有用性上达到最佳帕累托平衡。

Comments 17 pages, 5 figures

详情

AI中文摘要

将LLMs与数学辅导对齐通常需要基于RL的训练和多GPU基础设施。我们研究无训练提示优化——仅通过API调用演化系统提示——是否可以作为实用替代方案。我们改编了7种已发表方法并提出了5种教育专用方法，在2个OOD基准套件上的5种条件下评估这12种方法。所有12种最佳方法配置均超越了最强的RL训练基线（R_total = 0.633），我们的ParetoGrad在事后解决率、泄漏控制和有用性上实现了最佳帕累托平衡，而非在任何单一组件上占优。使用包含82个代码的教育代码本进行行为分析发现，无训练方法依赖教学知识模式的频率是RL训练模型的2-3倍，同时意图级脚手架减少了约10个百分点。我们还发现一个任务依赖的推理模式效应，在无训练和基于RL的范式中一致。我们的方法仅通过提示和最小计算即可高效开发教学对齐的LLM导师。

英文摘要

Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.

URL PDF HTML ☆

赞 0 踩 0

2605.27083 2026-05-27 cs.CL cs.CR 版本更新

On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning

反事实知识训练在LLM遗忘中的隐藏代价

Xiaotian Ye, Xiaohan Wang, Mengqi Zhang, Shu Wu

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Huazhong University of Science and Technology（华中科技大学）； Shandong University（山东大学）； NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences（神经网络计划、人工智能研究所、中国科学院自动化研究所）

AI总结本文发现反事实微调（CFT）在LLM遗忘中存在知识冲突和幻觉溢出两大问题，并引入扩展基准RWKU+及诊断工具进行系统分析。

详情

AI中文摘要

反事实微调（CFT）已成为大语言模型（LLM）遗忘的一种有前景的范式，通过训练模型生成替代的虚构知识来取代不需要的内容。然而，在这项工作中，我们发现该范式在某些方面仍不如其他范式，并识别出导致这一差距的两个先前被忽视的陷阱：（1）知识冲突，即反事实语料库中的相互不一致导致冲突梯度，破坏参数优化；（2）幻觉溢出，即拟合虚假目标会灌输持久的捏造偏差，增加无关领域的幻觉率。为了系统诊断这些问题，我们引入了RWKU+，这是一个扩展的基准，配备了新颖的权衡指标和梯度级诊断工具。我们的工作进一步讨论了该范式的局限性和开销，旨在为更严格的LLM遗忘研究提供见解和可操作的指导。

英文摘要

Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.

URL PDF HTML ☆

赞 0 踩 0

2605.27072 2026-05-27 cs.CL cs.AI 版本更新

E3: Issue-Level Backtesting for Automated Research Critique

E3: 面向自动化研究评论的问题级回测

Yashwardhan Chaudhuri, Sanyam Jain, Paridhi Mundra

AI总结提出E3自动化评论助手，通过问题级回测协议评估其在识别研究论文技术问题上的表现，相比人类评审和LLM基线实现最高召回率。

详情

AI中文摘要

我们提出E3，一个自动化评论助手，通过识别研究论文中与决策相关的技术问题来增强评审者和工程团队。对于每个问题，E3报告其性质、位置、对贡献的影响以及解决该问题所需的分析或证据，涵盖无根据的主张、缺失的消融实验、弱基线、隐藏假设、有效性威胁和数据泄露风险。为了在没有污染混杂因素的情况下评估E3，我们采用问题级回测协议：语料库仅限于每个自动化来源训练截止日期之后发表的论文，并且对于每篇论文，一个仅观察匿名评审的元裁判将每个问题-来源对标记为“捕获”、“部分”或“遗漏”。应用于100篇ICLR 2026论文和4598个被评判的问题行，将E3与ICLR人类评审以及基于OpenAI的gpt-5.4和Anthropic的claude-opus-4-6构建的两个提示匹配的LLM基线进行比较，使用元裁判gpt-5.5，E3在每个聚合指标上达到最高召回率。包含部分的召回率达到90.2%，比GPT高15.5个百分点，比Claude高17.1个百分点，比人类评审高29.2个百分点，严格召回率保持顺序为65.8%。在人类评审提出的问题上，E3恢复了89.6%；在人类评审遗漏的问题上，它额外发现了1635行被纳入评判联合集，比次优来源多406行。语料库、基线提示、裁判提示模板和评估代码已发布。

英文摘要

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.

URL PDF HTML ☆

赞 0 踩 0

2605.27068 2026-05-27 cs.CL cs.AI cs.MA 版本更新

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

QUACK: 多模态社交推理智能体中的沟通知识质疑、理解与审计

Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu

发表机构 * McGill University（麦吉尔大学）； Mila - Quebec AI Institute（魁北克人工智能研究所）； University of Cambridge（剑桥大学）； MBZUAI - Mohamed bin Zayed University of Artificial Intelligence（MBZUAI - 摩苏尔·本·扎耶德人工智能大学）； University of Toronto（多伦多大学）； Salesforce

AI总结提出QUACK框架，通过游戏结果、行为轨迹和话语一致性三级评估，自动审计多模态社交推理智能体语言与感知行为的一致性，发现最强智能体仍有15.1%的空间幻觉和过半无据指控。

详情

AI中文摘要

社交推理游戏已成为探测大型语言模型智能体推理、欺骗、协调和信念建模的热门测试平台。然而，大多数环境仅通过胜率等游戏结果评分，且主要局限于纯文本交互，难以判断智能体的语言是否真正基于其感知和行动，也难以识别其行为背后的失败模式。为填补这一空白，我们引入了QUACK，一个用于审计多模态社交推理中智能体语言基础的开源环境和评估框架。QUACK在三个层面评估智能体：游戏结果、行为轨迹和话语级一致性。其核心的陈述验证流水线从引擎日志重建每个智能体的真实轨迹，并对照检查每个讨论声明，自动标记空间幻觉、无据指控、欺骗崩溃和语言-行动不一致。在同质和跨模型对抗设置下评估三个前沿视觉语言模型，我们发现即使是最强的智能体，其可验证的空间声明中有15.1%是幻觉，且超过一半的指控缺乏有据证据。我们在https://github.com/AAAAA-Academia-Attractions/QUACK发布完整的引擎、评估框架、工具包和日志。

英文摘要

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

URL PDF HTML ☆

赞 0 踩 0

2605.27066 2026-05-27 cs.CL cs.IR 版本更新

Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search

工业搜索中基于大语言模型的查询驱动事件时间线摘要

Mingyue Wang, Xingyu Xie, Hang Yang, Li Gao, Lixin Su, Ge Chen, Dawei Yin, Daiting Shi

发表机构 * Baidu Inc.（百度公司）

AI总结提出QDET系统，通过多任务微调和强化学习实现查询驱动的事件时间线摘要，在百度搜索中显著提升用户参与度。

Comments Accepted at KDD 2026

详情

DOI: 10.1145/3770855.3818439

AI中文摘要

理解事件如何随时间演变对于处理热门新闻查询的搜索引擎至关重要。我们提出了QDET（查询驱动事件时间线摘要），这是一个部署在百度搜索上的生产系统，用于构建聚焦的事件时间线以解释特定查询事件。与传统的以主题为中心、旨在全面覆盖的方法不同，QDET从每天检索的数百万文档形成的嘈杂候选集中识别并组织与查询密切相关的子事件。QDET包含两个关键创新：（1）多任务监督微调，包含三个辅助任务——时间顺序、因果判断和时间线完成——使紧凑模型在专业领域匹配更大通用模型的性能；（2）基于强化学习的事件简洁摘要，在保持语义质量的同时强制执行严格长度约束，实现了88.2%的长度合规性，并在约束满足上比671B规模模型高出7.7个百分点。我们微调的7B参数模型在时间线摘要上达到76.2%的F1分数，略超DeepSeek-R1-671B的零样本性能（76.1% F1），而仅使用其1%的参数——表明领域特定优化能够以大幅降低的计算成本实现质量相当的生产就绪模型。百度搜索上的在线A/B测试验证了实际效果，与单任务基线相比，点击率提升5.5%，停留时间延长4.6%，探索深度增加4.4%。我们进一步证明时间线理解可迁移到热度预测，确认了对下游任务的有效知识迁移。

英文摘要

Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.27062 2026-05-27 cs.CL cs.LG 版本更新

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR: 一个大规模说话人标注的欧洲葡萄牙语议会会议语音语料库

Francisco Teixeira, Carlos Carvalho, Mariana Julião, Catarina Botelho, Rubén Solera-Ureña, Sérgio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

发表机构 * INESC-ID ； Instituto Superior Técnico（理工学院）

AI总结为弥补欧洲葡萄牙语语音资源不足，构建了FalAR语料库，包含5800小时议会会议语音及说话人标注，实验表明作为预训练数据可使ASR词错误率相对降低14%。

Comments Published in LREC2026

详情

AI中文摘要

自动语音识别（ASR）的最先进性能在很大程度上依赖于大规模标注语料库的可用性。这增加了数据收集工作的需求，特别是对于代表性不足的语言和方言变体。由于欧洲葡萄牙语（EP）的说话人数量较少（约1100万），在目前可用的大规模语音数据资源中，它被巴西葡萄牙语（BP）（约2亿说话人）所掩盖，导致EP用户的语音系统性能不佳。为了弥补这一差距，并遵循其他语言的类似数据收集工作，我们提出了FalAR，一个大规模、说话人标注的欧洲葡萄牙语议会会议语音语料库。FalAR涵盖约20年，包含5800小时的语音数据。此外，4850小时具有说话人身份标注，总共1180个说话人，附带元数据包括年龄、性别、政治派别和议会角色。该语料库使用最先进的EP CAMÕES ASR模型进行转录参考对齐。在本文中，我们描述了数据收集过程以及FalAR语料库的主要特征。此外，我们评估了数据量和对齐准确性对ASR性能的权衡，实验表明，将FalAR作为预训练数据可以使基线模型的词错误率相对降低高达14%。

英文摘要

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.

URL PDF HTML ☆

赞 0 踩 0

2605.27050 2026-05-27 cs.CL cs.LG 版本更新

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu：一种以数据为中心的低资源机器翻译方法

Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar

发表机构 * Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai（孟买韦尔马塔·吉贾拜技术学院计算机工程与信息技术系）； Tübingen AI Center, University of Tübingen, Germany（图宾根大学图宾根人工智能中心，德国）

AI总结提出BhashaSetu数据集，通过大规模、多领域、形态感知的英-马拉地语平行语料库，并验证语料库级去重对低资源神经机器翻译质量的关键影响。

详情

AI中文摘要

我们提出了BhashaSetu，一个语言丰富的英语-马拉地语平行数据集，解决了低资源神经机器翻译（NMT）中持续存在的数据限制问题。马拉地语有超过9500万使用者，但在不同领域的高质量平行语料库中仍然代表性不足。我们的数据集包含来自新闻、政治、医疗、文学和文化等异构来源的278万个句子对，并提供了词干化和词形还原表示以支持形态感知分析。我们使用BLEU、spBLEU、chrF++和TER指标对多个最先进的翻译模型进行了基准测试，并使用LoRA对NLLB-200-distilled-600M进行了参数高效微调。我们消融实验的一个关键发现是：语料库级去重是预处理中对下游质量贡献最大的单一因素（去除它会使性能降低1.17 BLEU和2.21 chrF++），这表明对于低资源、形态丰富的语言，有纪律的跨源语料库卫生是一种低成本、高影响力的干预措施。该数据集已公开发布，以促进可重复且语言信息丰富的低资源NMT研究。

英文摘要

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.

URL PDF HTML ☆

赞 0 踩 0

2605.27045 2026-05-27 cs.CL 版本更新

ExTax: Explainable Disinformation Detection via Persuasion, Emotion, and Narrative Role Taxonomies

ExTax：基于说服、情感和叙事角色分类学的可解释虚假信息检测

Shang Luo, Yingguang Yang, Zhenchen Sun, Yang Liu, Bin Chong, Jingru Chen, Yancheng Chen, Jiayu Liang, Kefu Xu, Hao Peng, Philip S. Yu

发表机构 * Peking University（北京大学）； University of Science and Technology of China（中国科学技术大学）； North China University of Science and Technology（华北理工大学）； Tsinghua University（清华大学）； Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； University of Chinese Academy of Sciences（中国科学院大学）； Soochow University（苏州大学）； Beihang University（北航）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结提出ExTax框架，统一说服修辞、情感操纵和叙事角色为17维分类空间，通过熵驱动动态标签平滑和多头注意力融合分类与上下文特征，实现可解释的虚假信息检测，在跨域基准上达到0.8456 Macro F1。

详情

AI中文摘要

LLMs的普及加速了高度流畅虚假信息的生成和传播，使得传统的句法语义验证越来越不足。这种欺骗很少仅依赖表面虚假；相反，它常常结合说服性修辞、情感操纵和叙事角色构建，通过多种认知途径影响读者的解读。然而，现有检测器通常强调孤立信号——如句法、外部知识、说服或情感线索——因此难以捕捉虚假信息背后的多方面操纵意图或提供人类可审计的解释。为填补这一空白，我们提出了 extbf{ExTax}，一个面向分类学的可解释虚假信息检测框架。ExTax将说服修辞、情感操纵和叙事角色统一到17维分类空间中，涵盖6种说服修辞策略、5种情感操纵方法和6种叙事角色类别。它从多个前沿LLMs中提取属性，通过熵驱动动态标签平滑协调它们的分歧，并通过异构多头注意力将所得分类表示与上下文编码融合，将每个预测基于可解释的操纵画像。在五个跨领域和跨体裁基准上，ExTax实现了0.8456的整体Macro $F_1$，优于最先进的深度学习和基于LLM的基线。在严重的体裁不平衡下，最强的深度基线从0.9454降至0.6194，而ExTax保持稳健。

英文摘要

The democratization of LLMs has accelerated the generation and circulation of highly fluent disinformation, making traditional syntax-semantic verification increasingly insufficient. Such deception rarely relies solely on surface-level falsity; instead, it often combines persuasive rhetoric, emotional manipulation, and narrative role construction to influence readers' interpretations through multiple cognitive pathways. However, existing detectors typically emphasize isolated signals -- such as syntax, external knowledge, persuasion, or affective cues -- and therefore struggle to capture the multi-faceted manipulative intents underlying disinformation or provide human-auditable explanations. To address this gap, we present \textbf{ExTax}, a taxonomy-aligned framework for explainable disinformation detection. ExTax unifies persuasive rhetoric, emotional manipulation, and narrative roles into a 17-dimensional taxonomic space, covering 6 persuasive-rhetoric strategies, 5 emotional-manipulation methods, and 6 narrative-role categories. It elicits attributes from multiple frontier LLMs, reconciles their disagreements through Entropy-driven Dynamic Label Smoothing, and fuses the resulting taxonomic representations with contextual encodings via Heterogeneous Multi-Head Attention, grounding each prediction in an interpretable manipulation profile. Across five cross-domain and cross-genre benchmarks, ExTax achieves an overall Macro $F_1$ of $0.8456$, outperforming state-of-the-art deep learning and LLM-based baselines. It also remains robust under severe genre imbalance, where the strongest deep baseline degrades from $0.9454$ to $0.6194$.

URL PDF HTML ☆

赞 0 踩 0

2605.27033 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

发表机构 * Universitat Pompeu Fabra（庞培法布拉大学）； ICREA

AI总结提出s-Trace方法估计最优子图，发现LLM计算分为早期稀疏核心和后期密集细化两个阶段，且计算量与模型不确定性相关。

详情

AI中文摘要

基于Transformer的大型语言模型（LLMs）由数十亿个参数组成，这些参数排列在深度和宽度都很大的计算图中，但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法，以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法，我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点（主要位于后期层，且越来越多地由注意力头组成）会导致近似完整输出分布的逐步细化。此外，我们发现每个输入所需的计算量与模型不确定性相关，并且更稀疏的子图编码浅层统计信息，例如单字频率。总体而言，我们的结果表明，有效的LLM计算中存在一致的模块化组织，其中稀疏的早期层核心提供粗略预测，然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

URL PDF HTML ☆

赞 0 踩 0

2605.27030 2026-05-27 cs.CL 版本更新

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

分享更多，搜索更少：面向高效测试时间扩展的协作并行思考

Xinglin Wang, Hao Lin, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

发表机构 * School of Computer Science, Beijing Institute of Technology（北京理工大学计算机学院）； Xiaohongshu Inc（小红书公司）

AI总结提出一种无需训练的协作并行思考框架，通过在并行分支间共享搜索信息来减少冗余探索，从而在测试时间扩展中实现更优的准确率-延迟帕累托边界。

Comments Preprint

详情

AI中文摘要

测试时间扩展（TTS）通过分配额外的推理计算来探索解空间，从而增强大型语言模型的推理能力。然而，现有的并行TTS方法通常在搜索过程中保持分支隔离：中间发现保持分支私有，无法及时指导其他分支。这种信息隔离导致大量冗余探索，因为分支反复重新发现其他地方已有的信息，并且需要更多搜索步骤来收集做出正确回答所需的完整决策信息。为弥补这一差距，我们提出协作并行思考（CPT），一种无需训练的推理框架，能够在并行分支间实现搜索时信息共享。CPT从正在运行的分支中提取紧凑的中间信息，维护一个去重的查询级信息池，并通过输入上下文广播池条目，使得后续搜索步骤中的每个分支能够重用其他分支的发现，而不是重新发现相同信息。实验上，在HMMT和AIME基准测试上的结果表明，CPT在不同rollout预算和模型规模下，相比强基线建立了更强的准确率-延迟帕累托前沿，突显了搜索时协作作为高效并行TTS的一个有效方向。

英文摘要

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and cannot guide other branches in time. This information isolation causes substantial redundant exploration, as branches repeatedly rediscover information already found elsewhere and require more search steps to collect complete decision information needed to reach correct answers. To bridge this gap, we propose \textbf{Collaborative Parallel Thinking (CPT)}, a training-free inference framework that enables search-time information sharing across parallel branches. CPT extracts compact intermediate information from ongoing branches, maintains a deduplicated query-level information pool, and broadcasts pool entries through the input context, allowing each branch in subsequent search steps to reuse discoveries made by other branches rather than rediscover the same information. Empirically, experiments on HMMT and AIME benchmarks show that CPT establishes a stronger accuracy--latency Pareto frontier than strong baselines across rollout budgets and model scales, highlighting search-time collaboration as an effective direction for efficient parallel TTS.

URL PDF HTML ☆

赞 0 踩 0

2605.27025 2026-05-27 cs.CL cs.MM 版本更新

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

基于属性的LLM与仇恨言论标注的对齐诊断

Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser

发表机构 * School of Computation, Information and Technology, TU Munich（计算、信息与技术学院，慕尼黑技术大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结通过分析LLM在十个主观属性上的判断与人类标注的对齐情况，发现行为显式维度对齐良好而评价维度系统性反转，并提出基于置信度加权岭回归的属性组合方法，重构连续仇恨言论分数，R²达0.71。

详情

AI中文摘要

仇恨言论标注成本高昂、主观性强且容易产生标注者分歧，使得大规模数据集构建具有挑战性。我们系统分析了大型语言模型（LLM）在十个理论上基于主观属性（如去人性化、暴力和情感）上与人类判断的对齐程度，评估了Llama 3.1和Qwen 2.5的小型及大型变体。我们的分析揭示了所有模型的一致分裂：行为显式维度（侮辱、羞辱、攻击-防御）与人类标注高度相关，而评价维度（尊重、情感、仇恨言论）则系统性反转。人口统计角色条件化降低了模型置信度，但未改善对齐。基于这些发现，我们提出通过置信度加权岭回归组合属性级LLM预测，从测量仇恨言论语料库中重构连续仇恨言论分数，R²达到0.71，优于直接提示基线，表明结构化属性分解比端到端标签预测单独恢复出更丰富且更符合人类对齐的信号。

英文摘要

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

URL PDF HTML ☆

赞 0 踩 0

2605.27016 2026-05-27 cs.CL cs.AI cs.LG stat.ML 版本更新

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

评估不确定性估计器与LLM幻觉的相关性

Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau, Karteek Alahari

发表机构 * CREST, ENSAE Institut Polytechnique de Paris（CREST，巴黎高等理工学院）； Ekimetrics France（法国Ekimetrics）； Centre Inria de l’Université Grenoble Alpes（格勒诺布尔阿尔卑斯大学信息研究院）

AI总结通过系统实证研究，评估信息论、基于采样和反思性等不确定性估计器与LLM幻觉之间的关联，发现关联性高度可变且通常较弱，挑战了将不确定性作为幻觉直接信号的做法。

Comments 35 pages, 7 figures, 9 tables

详情

AI中文摘要

大型语言模型（LLM）容易产生幻觉，即与输入或训练数据不符的陈述，阻碍了可靠部署。同时，许多不确定性估计（UE）方法被提出来量化模型置信度，并常被隐含地视为模型失败的代理。然而，不确定性与幻觉之间的关系尚未得到充分表征。我们对不确定性估计器与LLM幻觉之间的关联进行了系统的实证研究。我们不是假设这种关联，而是直接评估它在何时以及在多大程度上成立。我们考虑了多种不确定性估计器，包括信息论、基于采样和反思性估计器，并检查了它们在幻觉设置中的行为。我们的实验涵盖了内在幻觉（违反输入忠实性）和外在幻觉（相对于训练数据的无根据主张），使用了四个互补基准，包括RAGTruth和HalluLens。我们发现，这种关联性高度可变且通常较弱，取决于幻觉类型和所评估的LLM。这些结果挑战了将不确定性作为幻觉直接信号的做法，并阐明了何时它能提供可操作的信息。

英文摘要

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

URL PDF HTML ☆

赞 0 踩 0

2605.27015 2026-05-27 cs.CL 版本更新

PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

PersLitEval：波斯文学问题上的细粒度基准与LLM评估

Ruhallah Niazi, Faeze Ghorbanpour, Alexander Fraser

发表机构 * School of Computation, Information and Technology, TU Munich（计算信息与技术学院，慕尼黑工业大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结提出PersLitEval基准，包含4514道波斯文学多选题，评估六种LLM在十种提示策略下的表现，发现模型在概念相似性任务上准确率高，但在拼写和构词等正式语言分析上困难，且提示策略显著影响性能。

详情

AI中文摘要

尽管多语言能力令人印象深刻，但大型语言模型（LLM）在非英语语言的文学知识方面仍然缺乏充分评估。我们引入了PersLitEval，这是一个包含4514道波斯文学多选题的基准，涵盖拼写、修辞手法、语法、词汇、构词和概念理解等八个细粒度类别，题目来源于Konkur大学入学考试材料。我们评估了六种LLM在十种提示策略下的表现，揭示了三个难度层级上显著的类别差异：模型在概念相似性任务上准确率较高，但在正式语言分析上表现不佳，其中拼写和构词对所有模型来说都是最难的。提示策略对性能有显著影响，其中带解释的少样本示例效果最佳，尤其是在正式语言类别上。错误分析识别出三种失败模式：语义理解差距、正式语言知识差距以及计数/枚举错误，表明不同类别需要不同的改进策略。

英文摘要

Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.26999 2026-05-27 cs.CL cs.CR 版本更新

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

提示注入检测是依赖于场景的：一种基于可解释结构信号的部署感知评估

Akindoyin Akinrele, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）

AI总结本研究通过多模型、多场景的实验框架，评估了提示注入检测方法，发现检测性能高度依赖于部署场景和阈值选择，其中基于Transformer的模型表现最佳，结构信号在特定场景下提供适度但一致的改进。

详情

AI中文摘要

提示注入对大型语言模型的安全部署构成严重威胁，然而现有的检测方法通常在有限的设置下进行评估，未能反映真实世界的操作约束。在这项工作中，我们使用多模型和多场景实验框架，对提示注入检测进行了部署感知评估。我们比较了基于词汇、语义、结构和Transformer的检测器，在多个分布外设置、重复数据划分以及排名和阈值部署指标下的表现。我们引入了可解释的结构信号，这些信号捕捉了层次覆盖、系统提示欺骗、角色重定义和逃避模式，并评估了它们在稀疏模型中以及与强编码器基线结合时的贡献。我们的结果表明，检测性能高度依赖于场景，并且对阈值选择敏感，没有单一模型在所有设置中占据主导地位。基于Transformer的模型实现了最强的整体性能，而结构信号在特定场景下提供了适度但一致的改进，并在更困难的场景中改善了低假阳性率行为。这些发现凸显了排名性能与部署有效性之间的差距，并强调了在现实操作约束下评估提示注入防御的重要性。代码将发布。

英文摘要

Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.26978 2026-05-27 cs.CL cs.SD 版本更新

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

PashtoTTS-Bench：低资源非拉丁文字文本转语音的自动化筛选

Hanif Rahman

发表机构 * Independent Researcher（独立研究员）

AI总结针对低资源非拉丁文字TTS评估中单一ASR往返WER的不足，提出INSV报告框架及其自动化筛选子集INSV-A，并实例化为PashtoTTS-Bench基准，通过多指标评估多个TTS系统。

详情

AI中文摘要

对于低资源非拉丁文字语言，当文本转语音（TTS）评估依赖于单一的ASR往返词错误率（WER）时可能会失败。系统可能不产生音频、说出邻近语言、仅在ASR转录中保留目标文字脚本，或者对母语者来说听起来不自然。我们引入了INSV（可懂度、自然度、脚本保真度和验证）报告框架，将这些情况分开。本文报告了INSV-A，即自动化筛选子集：合成完成度、ASR WER/CER、转录脚本保真率和音频语言识别。原生MOS和语音标注已指定但未在此版本中声明。我们将INSV-A实例化为PashtoTTS-Bench，一个针对普什图语TTS的带日期基准。2026年4月至5月的运行评估了Edge GulNawaz、Edge Latifa、OmniVoice clone、OmniVoice auto和一个乌尔都语阴性对照，使用200个FLEURS和200个过滤后的Common Voice 24提示。在独立的omniASR_CTC_300M_v2下，OmniVoice auto的WER最低（FLEURS 24.1%，CV24 27.4%），其次是Edge GulNawaz（32.8%，39.5%）、Edge Latifa（35.6%，47.7%）和OmniVoice clone（45.4%，34.8%）。低于自然语音基线的WER反映了干净的合成音频，不应被解读为优于原生语音。Whisper Large V3在检查的普什图语TTS音频上返回0.0%的普什图语标签，而MMS-LID-4017和SpeechBrain VoxLingua107将普什图语输出与乌尔都语对照区分开。该版本提供了提供者元数据、每句分数、LID审计、失败日志和用于添加系统的脚本。

英文摘要

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26969 2026-05-27 cs.CL cs.AI 版本更新

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon：基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Recon方法，通过动作重建分数评估推理轨迹的预测能力，以改进用户建模中的推理合成，在多个领域优于事后合理化基线。

详情

AI中文摘要

用户建模旨在使用语言模型（LM）从过去的上下文-动作对（例如对话轮次）语料库中模拟个体的行为，从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库，通常通过同时以上下文和动作为条件生成。然而，这种条件构成事后合理化而非推理：轨迹保证证明动作的合理性，但可能不编码潜在的潜在因果决策路径。我们提出Recon，它使用动作重建通过预测能力对推理轨迹进行评分：给定上下文和候选推理，重建模型预测动作，重建保真度决定推理质量。在四个领域，Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外，我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能，相对于基线实现了高达70.0%的胜率。我们进一步表明，Recon合成的推理可跨模型迁移，并改善重建模型之外的用户建模。我们的工作表明，事后合理化对于推理合成是不够的，有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

URL PDF HTML ☆

赞 0 踩 0

2605.26958 2026-05-27 cs.CL cs.AI 版本更新

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO：面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China（中国人民大学）； University of Southern California（南加州大学）； Zhejiang University（浙江大学）； Xiaohongshu Inc.（小红书公司）

AI总结针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题，提出Tournament-GRPO框架，通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励，在Deep Research Bench上取得4.52分提升。

详情

负责任的基于LLM的人机协商：通过共生脚手架扩展集体智能

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结提出一个三层共生人机框架，通过多样性放大、条款级溯源和人类主导批准，在扩展集体智能的同时保持主体性和合法性。

Comments Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology

详情

AI中文摘要

大型语言模型（LLM）可以在以前受轮流发言和引导带宽限制的规模上支持民主协商。最近的研究表明，LLM生成的群体陈述通常比人类中介的输出更受欢迎，而理论分析认为LLM放松了限制集体智能的同时性约束。然而，纯LLM中介存在使多元性崩溃、过度优化一致性以及当参与者无法质疑其如何被代表时损害合法性的风险。我们提出了一个共生的人机框架，分为三个层次：观察与多样性放大、具有条款级溯源的引导、以及人类优先批准。我们的贡献包括：具有显著性加权分级覆盖、多样性和擦除度量；结合交叉编码器相似性与因果剔除诊断的溯源管道；偏好条件权衡控制；公平感知的可争议工作流；对抗性鲁棒性测试；以及基于LLM作为评判者局限性证据的消融设计评估协议。结果是一个可测试的协商技术蓝图，能够在扩展集体智能的同时保持主体性和合法性。

英文摘要

Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.

URL PDF HTML ☆

赞 0 踩 0

2605.26937 2026-05-27 cs.CL cs.AI 版本更新

Beyond Questions: Evaluating What Large Language Models (Actually) Know

超越问题：评估大型语言模型（实际）知道什么

Luca Giordano, Simon Razniewski

发表机构 * ScaDS.AI Dresden/Leipzig & TU Dresden, Germany（ScaDS.AI 德尔布兰德/莱比锡及德累斯顿技术大学，德国）

AI总结提出开放知识评估新范式，通过开放式提示（如“告诉我关于M.L. King的一切”）评估模型自然表达的知识，并构建BeQu基准测试10,000个实体。

详情

AI中文摘要

大型语言模型（LLM）中的参数化知识是其成功的基石，但仍未被充分理解。现有的知识基准通常依赖于预定义的问题（例如，“M.L. King的出生日期是什么？”），仅评估基准设计者明确选择查询的知识，这是一种有问题的可用性偏差。在本文中，我们引入了开放知识评估，这是一种用于LLM知识基准测试的新范式。它不提出狭隘的问题，而是评估模型在响应开放式引发提示（例如，“告诉我关于M.L. King的一切”）时选择呈现的知识。这将焦点从预定义的答案检索转向表征模型自然表达的知识。我们用BeQu（超越问题）实例化这一范式，这是一个包含10,000个实体并配有用于陈述验证的参考语料库的基准。使用BeQu，我们评估了广泛的语言模型，并分析了推理努力、模型规模、提示格式和知识领域的影响。数据和排行榜可在此工作的GitHub仓库和基准网站上获取。

英文摘要

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

URL PDF HTML ☆

赞 0 踩 0

2605.26935 2026-05-27 cs.CL 版本更新

DunbaaBERT: From Sacrifice to Semantics

DunbaaBERT: 从牺牲到语义

Iffat Maab, Waleed Jamil, Raphael Schmitt

发表机构 * Research and Development Center for Large Language Models (LLMC), National Institute of Informatics, Tokyo（大型语言模型研发中心（LLMC），信息研究所，东京）； School of Computation, Information and Technology, Technical University of Munich, Germany（计算、信息与技术学院，慕尼黑技术大学，德国）； Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Germany（临床医学系，弗赖堡大学医学院及医学中心，德国）

AI总结本文提出DunbaaBERT，一种从零训练的乌尔都语RoBERTa-base模型族，通过不同词汇表大小在17GB语料上预训练，在多项下游任务中达到与强多语言基线相当的性能，并发现较大词汇表并不持续提升效果。

详情

AI中文摘要

大型语言模型在许多自然语言处理任务中取得了强劲性能，但由于资源有限和评估设置碎片化，乌尔都语仍相对未被充分探索。为填补这一空白，我们引入了DunbaaBERT，一个乌尔都语RoBERTa-base模型族，在去重后的17GB乌尔都语语料库上使用32k、52k和96k token的Byte-BPE词汇表从头训练。我们在内在和下游乌尔都语自然语言处理基准上评估DunbaaBERT，涵盖语言可接受性、新闻分类、攻击性语言检测和情感分析，同时分析词汇表大小对性能和效率权衡的影响。在各项基准中，DunbaaBERT变体与强多语言基线相比取得了有竞争力的性能，同时始终保持有利的效率权衡。有趣的是，较大的词汇表并不持续提升下游效果，DunbaaBERT$_{\text{32k}}$反复提供最强的整体效率概况。总体而言，我们的结果表明，尽管模型和训练规模相对紧凑，精心策划的乌尔都语特定编码器模型仍能保持高度竞争力。所有模型均在MIT许可下发布。

英文摘要

Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.

URL PDF HTML ☆

赞 0 踩 0

2605.26934 2026-05-27 cs.CL cs.AI 版本更新

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂度：逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

发表机构 * Kyoto University（京都大学）； University of Tokyo（东京大学）； NII LLMC（日本信息处理学会LLMC）； RIKEN（理化学研究所）

AI总结通过将推理空间划分为深度和复杂度两个维度，并考虑四种推理形式，在合成知识图谱环境中进行受控实验，发现联合深度-复杂度覆盖优于单轴策略，不同推理家族对RLVR覆盖的反应非均匀，且均匀混合优于分阶段课程。

Comments Pre-print

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为后训练推理模型的核心，但现有研究的一个关键局限在于对推理空间的狭隘视角：难度仅被视为推理深度，奖励集中在正向演绎状态追踪。相反，我们沿两个维度刻画推理空间。难度：除了推理深度，我们研究环境复杂度，即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式：我们考虑现实世界推理核心的四种能力：演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素，我们构建了一个合成知识图谱环境，具有受控的预训练和后训练分布，其中每个实例在深度、复杂度和任务家族上变化。三个发现：联合深度-复杂度覆盖优于单轴策略；推理家族反应非均匀，溯因推理在RL覆盖区域外退化，任务相关性聚类为演绎-溯因对和归纳-类比对；在固定预算下，均匀混合优于分阶段课程。我们还发现，最近的现成模型表现出相同的演绎-溯因不对称性，表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

URL PDF HTML ☆

赞 0 踩 0

2605.26924 2026-05-27 cs.CL 版本更新

Learning to Adapt SFT Data for Better Reasoning Generalization

学习适应SFT数据以实现更好的推理泛化

Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu

发表机构 * Beihang University（北京航空航天大学）； Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）； Hangzhou International Innovation Institute（杭州国际创新研究院）； Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing（北京未来区块链与隐私计算高级创新中心）

AI总结提出DART方法，通过强化学习训练映射器将分布不匹配的SFT数据转化为模型自适应的监督，提升推理泛化能力。

详情

AI中文摘要

大型语言模型（LLMs）取得了显著进展，其中后训练在增强其推理能力方面起着关键作用。在后训练范式中，监督微调（SFT）被广泛使用：它利用外部数据提供密集监督并实现高效训练。然而，当数据分布与目标模型自身分布不匹配时，直接在专家数据上微调可能会损害泛化能力。在这项工作中，我们提出了推理调优的数据适应（DART），它将使用固定且可能分布不匹配的SFT数据集表述为对演示转换的优化问题。DART使用强化学习训练一个映射器模型，将原始SFT数据转换为与目标模型分布和学习偏好更匹配的模型自适应监督。转换后的数据随后用于SFT，使目标模型能够更好地利用外部监督。在多个模型和数据集上的实验表明，DART提高了泛化能力，实现了比直接RL更高的训练效率，并帮助模型超越标准SFT。我们的代码可在https://anonymous.4open.science/r/DART525E50D获取。

英文摘要

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

URL PDF HTML ☆

赞 0 踩 0

2605.26918 2026-05-27 cs.CL 版本更新

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

视频模型是教育领域的零样本学习者和推理者吗？EduVideoBench：面向教育视频生成的知识-技能-态度基准

Unggi Lee, Hoyoung Ahn, Yoon Choi, Seonmin Eun, Jahyun Jeong, Seonmin Jin, Harmony Jung, Hye Jin Kim, Chaerin Lee, Hyunji Lee, Jeongjin Lee, Soohwan Lee, Young-Seok Oh, Jaehyeon Park, Sun-ok Ryu, Sunyoung Shin, Yoorim Son, Haeun Park, Yeil Jeong

发表机构 * Korea University Sejong Campus（韩国大学世宗校区）； Cardiff Metropolitan University（卡迪夫 Metropolitan 大学）； Seoul National University（首尔国立大学）； Bugil Academy（Bugil 学院）； Gyeonggi Provincial Office of Education（京畿省教育厅）； Loughborough University（洛桑大学）； Korea National University of Education（韩国教育大学）； Korea University（韩国大学）； Korean Educational Development Institute（韩国教育发展院）； Sungshin Women’s University（顺天堂女子大学）； Seoul National University of Education（首尔国立教育大学）； Korea Institute for Curriculum and Evaluation（韩国课程评价院）； Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结提出基于知识-技能-态度框架的教育视频生成基准EduVideoBench，评估五个前沿视频生成模型在教育有效性上的不足，并发现教育有效性是多维度的，单一元素不匹配即可使视频失效。

详情

AI中文摘要

视频生成模型（VGMs）正迅速进入课堂，然而现有基准仅评估感知质量、内在忠实性、通用安全性或将视频作为推理媒介，没有评估输出是否具有教育有效性。在这项工作中，我们提出了EduVideoBench，这是教育领域第一个平衡的基准，基于知识-技能-态度（KSA）框架，使得教学充分性和教育安全性被联合评估，而非作为临时的质量维度。在五个前沿VGMs上，我们的结果显示，在知识、技能和态度方面，它们距离课堂准备就绪还有很大的改进空间。我们辅以专家评论的定性分析，发现教育有效性是多维度的，单个不匹配的元素（如节奏、可读性或符号）可能使原本正确的视频失效。我们希望EduVideoBench能够指导开发教学上合理且课堂安全的VGMs。

英文摘要

Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.

URL PDF HTML ☆

赞 0 踩 0

2605.26893 2026-05-27 cs.CL cs.AI 版本更新

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

GeoFaith: 时空双视角下的忠实思维链

Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia

发表机构 * Xidian University（西安电子科技大学）； Xi’an Jiaotong University（西安交通大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； University of Science and Technology of China（中国科学技术大学）

AI总结针对思维链推理中的事后合理化问题，提出基于潜在几何结构和熵动力学的时空框架GeoFaith，通过可扩展的引导流水线构建忠实性检测器并联合优化结果正确性、过程忠实性和轨迹一致性。

详情

AI中文摘要

思维链推理推动了大型语言模型的发展，但基于结果的监督导致了普遍的事后合理化，产生了看似合理但不忠实的推理链。大多数先前的忠实性评估方法要么不可扩展、昂贵，要么不可靠。我们提出GeoFaith，一个利用潜在几何结构和熵动力学来诊断和强制忠实推理的时空框架。我们开发了一个可扩展的引导流水线，将四个领域的步骤级标注从1k扩展到20k样本，训练了一个在标准基准上优于GPT-5的8B忠实性检测器，并设计了一个忠实性感知的强化学习框架，联合优化结果正确性、过程忠实性和轨迹一致性。实验表明，所提出的方法在忠实性检测和下游推理上均取得了优越性能，生成了更短、更可解释的链，且不牺牲准确性。我们的代码将公开提供。

英文摘要

Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.26849 2026-05-27 cs.CL 版本更新

Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning

不确定性感知的自适应测试时推理预算分配

Manh Nguyen, Sunil Gupta, Hung Le

发表机构 * Applied Artificial Intelligence Initiative（应用人工智能计划）

AI总结提出不确定性感知预算分配（UAB）框架，通过基于每问题不确定性的凹整数优化重新分配固定采样预算，无需额外推理成本，在多个推理基准上提升准确率高达3-5%。

详情

AI中文摘要

采样多个响应可以改善语言模型的推理能力，但均匀的计算分配效率低下：简单问题被过度采样，而困难问题探索不足。我们提出不确定性感知预算分配（UAB），这是一个凹整数优化框架，基于每问题的不确定性重新分配固定采样预算，且无需额外推理成本。在第一阶段，每个问题生成一个响应；其平均负对数似然（ANLL）直接从输出对数概率中提取，作为难度信号，同时该生成贡献于最终投票。在第二阶段，剩余预算通过边际贪心算法分配，该算法精确求解凹覆盖最大化替代问题：不确定的问题获得更多采样预算，而确定的问题获得更少的额外样本。在六个开源和黑盒模型（参数规模从1.5B到27B）以及五个涵盖数学、逻辑和偏好任务的推理基准上评估，UAB在平均准确率上比基线高出最多3%，在单个基准上高出最多5%，在低资源设置下增益最大，且无需辅助模型或额外的LLM调用。代码公开于 https://github.com/manhitv/UAB。

英文摘要

Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.

URL PDF HTML ☆

赞 0 踩 0

2605.26842 2026-05-27 cs.LG cs.CL 版本更新

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

MONA: 基于Nesterov加速的Muon优化器用于可扩展语言模型训练

Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

发表机构 * Meituan（美团）

AI总结提出MONA优化器，通过将Nesterov加速项集成到Muon的梯度处理流程中，实现曲率感知加速，从而帮助逃离尖锐局部最小值，并在1B到68B参数的混合专家预训练中取得更优收敛和下游任务性能。

详情

AI中文摘要

Muon优化器最近为大型语言模型训练提供了一种有希望的AdamW替代方案，利用矩阵正交化产生几何感知更新。然而，与所有一阶方法一样，Muon可能会陷入尖锐的局部最小值。在这项工作中，我们提出了MONA，一种将Muon的正交化框架与曲率感知加速相结合的优化器。MONA直接将加速项添加到Muon的梯度处理流程中。该加速项根据梯度差异的指数移动平均计算得出。我们提供了MONA的详细收敛性分析，表明加速项能够在保持Muon谱范数正则化的同时逃离尖锐最小值。实验上，在从1B到68B参数的三个规模的混合专家预训练中（最大模型在1万亿tokens上训练），MONA在收敛性和下游任务性能上均优于Muon和AdamW。此外，我们在MOE-68B-A3B模型上进行了监督微调，并在通用能力、数学推理和代码生成基准上评估，MONA达到了最先进的性能。

英文摘要

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

URL PDF HTML ☆

赞 0 踩 0

2605.26840 2026-05-27 cs.CL 版本更新

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

通过来自多个不完美指标的偏好学习优化摘要的事实一致性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent System Laboratory（智能系统实验室）

AI总结提出一种自动化训练流程，通过聚合多个弱事实性指标的分数并映射为偏好，过滤高分歧样本，利用词汇相似摘要对进行偏好学习，从而提升摘要的事实一致性。

Comments EMNLP 2025 Findings

详情

DOI: 10.18653/v1/2025.findings-emnlp.940

AI中文摘要

不是能力问题：LLM 智能体层级间的驾驭敏感性非单调

Yong-eun Cho

发表机构 * KailosLab（凯罗斯实验室）

AI总结通过 432 次实验，发现 LLM 智能体的驾驭敏感性随模型层级非单调变化，且依赖模型类型（聊天 vs. 推理），推翻了“更高能力模型需要更少结构指导”的假设。

Comments 9 pages, 3 figures

详情

AI中文摘要

LLM 智能体部署中的一个普遍假设是，更结构化的驾驭方式普遍能提高可靠性，并且能力更强的模型需要成比例地减少结构指导——这共同暗示了模型能力层级与最优驾驭复杂度之间存在单调反比关系。我们通过一个受控的 432 次实验来检验这一假设，实验跨越了四个能力层级的六个模型，在 HEAT-24（一个基于 git 工作区验证的 24 任务合成基准）上采用了三种驾驭条件（轻量、平衡、严格）。我们的结果从两个方面反驳了单调反比关系。首先，对于评估的前沿聊天模型（Gemini 2.5 Flash），增加驾驭冗长度使 VTSR 降低 29-38 个百分点——这是一个驾驭复杂度悖论。其次，对于评估的前沿推理模型（Qwen3.5-122B，启用扩展思考），严格驾驭实现了最高的 VTSR（91.7%）和最低的延迟，与预测相反。在受限层级内，一个 2B 模型（Gemma4:e2B）在所有驾驭条件下均以 91.7% 的 VTSR 达到了强开放层级的稳定性。由于本研究中每个层级仅由一个模型代表，这些结果应解释为模型特定的观察；驾驭敏感性在所评估的模型中呈现非单调性，并且关键依赖于模型类型（聊天 vs. 推理）。我们引入了一个六标签失败分类法，显示格式违规主导了能力强的模型失败，而错误文件主导了低能力失败，并推导出了实用的层级感知驾驭选择指南。

英文摘要

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

URL PDF HTML ☆

赞 0 踩 0

2605.25981 2026-05-27 cs.CL 版本更新

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

LLM 代理何时对表面噪声与语义噪声做出不同处理？一项基于 68 个单元格的测量研究及留出轨迹级验证

Liyun Zhang, Jiayi Guo

发表机构 * School of Information and Software Engineering, UESTC（信息与软件工程学院，电子科技大学）； Jacobs School of Engineering, UC San Diego（工程学院，加州大学圣地亚哥分校）

AI总结本研究通过 68 个单元格的测量实验，发现大语言模型驱动的思维链和 ReAct 代理对语义扰动（如释义、同义词）比表现扰动（如格式、重排序）更敏感，并基于留出模型和轨迹级机制分析提出了“隐蔽发散”解释。

详情

AI中文摘要

我们记录了一个经验现象：在来自七个架构家族的十种大语言模型驱动的思维链和 ReAct 代理中，意义承载扰动（例如，释义、同义词）比同等严重程度的表现扰动（例如，格式、重排序）更频繁地改变最终答案。跨越 GSM8K、MATH 和 HotpotQA 的 68 个单元格（1,530 个原始样本和约 11,150 个变体），在严重性匹配后，不一致性差距平均为 +19.69 个百分点（配对 t=9.58，p<0.0001），其中 64/68 个单元格为正。该差距通过了四次严重性代理审计，并且在排除 qwen 模型时仍然显著（+11.10 个百分点，p<0.0001）。几项压力测试诚实地失败了：在更严格的假设下，聚类自助法显著性消失；可处理性对比无法复制；跨架构生成器交换破坏了每个单元格的排名；第二个 LLM 判断器仅产生中等一致性（κ=0.50）。然后，我们在一个完全留出的第 11 个模型（qwen2.5-14B-Instruct；1,800 条轨迹）上验证了标题效应，并重新测试了一个预先注册的能力×可处理性分区，观察到一个小但正的留出效应（3/4 个单元格为正；合并 Welch t=3.81，p=9.6×10^{-4}）。利用留出轨迹，我们探测了四个轨迹级机制信号。两个先前的机制主张未能复制并被明确撤回。两个新的探测反而支持一种“隐蔽发散”图景：语义扰动通常保留第一个动作，但从后续步骤开始导致中间推理发散，并伴随略微更深的轨迹。我们将此定位为一项带有留出复现和部分轨迹级解释的测量贡献，说明语义扰动如何通过代理推理传播。代码、扰动语料库、原始轨迹和分析脚本已匿名发布以供评审。

英文摘要

We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.

URL PDF HTML ☆

赞 0 踩 0

2605.25731 2026-05-27 cs.CL 版本更新

Trait-Aware Policy Optimization for Autoregressive Multi-Trait Essay Scoring

面向自回归多维度作文评分的特质感知策略优化

Zhengyang Wang, Sanwoo Lee, Jiaxin Wang, Chenxi Miao, Weikang Li, Yunfang Wu

发表机构 * Peking University（北京大学）； Baidu Inc.（百度公司）

AI总结提出特质感知策略优化（TAPO）框架，通过分解样本和特质维度的奖励并结合增强提示，提升自回归多维度作文评分性能。

详情

AI中文摘要

多维度作文评分旨在跨多个维度提供写作质量的细粒度评估。然而，如何有效后训练自回归评分模型仍未充分探索。在本文中，我们提出了特质感知策略优化（TAPO），一种专为自回归多维度评分设计的后训练框架。我们的方法沿样本和特质维度分解奖励，结合全局评分一致性、特质级准确性、格式有效性以及跨特质依赖保持。此外，我们在整个训练过程中使用增强提示，通过融入原始提示文本和特质描述，为特质特定分数生成提供更丰富的语义信息。跨多个骨干模型的实验表明，我们的方法在监督微调和标量奖励优化基线上持续提升了多维度评分性能，证明了特质感知后训练在作文评分中的有效性和可迁移性。

英文摘要

Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tailored to autoregressive multi-trait scoring. Our method decomposes rewards along both the sample and trait dimensions, combining global scoring consistency, trait-level accuracy, format validity, and inter-trait dependency preservation. In addition, we use enhanced prompts throughout training by incorporating original prompt texts and trait descriptions, providing richer semantic information for trait-specific score generation. Experiments across multiple backbone models show that our method consistently improves multi-trait scoring performance over supervised fine-tuning and scalar-reward optimization baselines, demonstrating the effectiveness and transferability of trait-aware post-training for essay scoring.

URL PDF HTML ☆

赞 0 踩 0

2605.25510 2026-05-27 cs.CL 版本更新

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

好奇心时代遇上AI时代：大型语言模型中的儿童安全基准测试

Samee Arif, Angana Borah, Rada Mihalcea

发表机构 * University of Michigan（密歇根大学）

AI总结针对7-11岁儿童使用LLM的安全性问题，提出基于发展心理学的KIDBench基准，通过隐式线索和显式年龄指令提升安全性，并开发了儿童安全评估器KIDGuardLlama和响应模型KIDLlama。

详情

AI中文摘要

儿童越来越多地接触大型语言模型（LLM），这可能会使他们接触到发展不适当或需要年龄敏感性安全、指导和界限的回应。现有的LLM安全评估主要关注有害内容规避，并未明确针对面向儿童的安全性。我们引入了KIDBench，这是一个使用基于发展心理学的LLM作为评判标准的基准，用于评估面向7-11岁儿童的LLM安全性。KIDBench包含十个类别的真实儿童查询，包括单轮提示和多轮儿童角色模拟。我们比较了无儿童上下文的无提示、暗示儿童说话者的隐式提示以及显式年龄指令。隐式提示使模型得分提高了9-47%，而显式年龄进一步增加了10-30%的增益。跨语言和文化评估显示，不同语言和国家背景下的安全行为不均匀。多轮模拟显示，面向儿童的响应质量从第一轮到最差轮次可能下降6-24%。除了评估，我们还引入了儿童安全评估器KIDGuardLlama和面向儿童的响应模型KIDLlama，展示了KIDBench如何支持更安全的面向儿童AI。

英文摘要

Children increasingly have access to Large Language Models (LLMs), which may expose them to responses that are developmentally inappropriate or require age-sensitive safety, guidance, and boundaries. Existing LLM safety evaluations largely focus on harmful-content avoidance and do not explicitly target child-facing safety. We introduce KIDBench, a benchmark for evaluating child-facing LLM safety for ages 7-11 using a developmental-psychology-grounded LLM-as-a-Judge rubric. KIDBench contains realistic child queries across ten categories, with single-turn prompts and multi-turn child-actor simulations. We compare no-cues prompts with no child context, implicit-cues prompts that suggest a child speaker, and explicit age instructions. Implicit-cues improve scores by 9-47% across models, while explicit age adds a further 10-30% gain. Cross-lingual and cultural evaluations show uneven safety behavior across languages and country contexts. Multi-turn simulations show that child-facing response quality can degrade by 6-24% from the first to worst turn. Beyond evaluation, we introduce KIDGuardLlama, a child-safety evaluator, and KIDLlama, a child-oriented response model, showing how KIDBench supports safer child-facing AI.

URL PDF HTML ☆

赞 0 踩 0

2605.25480 2026-05-27 cs.CL 版本更新

Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki

检索即推理：通过LLM-Wiki实现自我进化的智能体原生检索

Haoliang Ming, Feifei Li, Xiaoqing Wu, Wenhui Que

发表机构 * Tencent Inc.（腾讯公司）

AI总结提出LLM-Wiki系统，将外部知识组织为可编译、可组合、自进化的Wiki页面，通过工具调用接口实现搜索、阅读和链接跟踪操作，并引入错误簿进行持续自纠正，在多项多跳问答基准上取得最优结果。

Comments 15 pages, 3 figures, 10 tables, 1 algorithm

详情

AI中文摘要

LLM智能体需要的检索行为应更像推理（搜索、阅读、遍历、判断证据是否充分），而非一次性上下文获取。然而，当前的检索增强生成（RAG）系统将外部知识组织为扁平块，通过嵌入相似性检索，暴露出一种不适合迭代推理智能体的“检索即查找”接口。我们提出LLM-Wiki，一种智能体原生检索系统，它将外部知识视为可编译、可组合、自进化的结构而非静态检索索引，从而实现了“检索即推理”范式。LLM-Wiki将文档编译为带有双向链接的结构化Wiki页面，通过标准工具调用接口暴露搜索、阅读和链接跟踪操作，并引入错误簿进行持久的结构和语义自纠正。LLM-Wiki在HotpotQA、MuSiQue和2WikiMultiHopQA上取得了最先进的结果，比HippoRAG 2、LightRAG和GraphRAG高出2.0-8.1个F1点。在AuthTrace上，LLM-Wiki取得了最佳总体准确率，在多文档结构化查询上尤其有显著提升，证实了基于编译的检索在链式多跳推理之外也具有泛化能力。

英文摘要

LLM agents require retrieval to behave less like one-shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. Yet current Retrieval-Augmented Generation (RAG) systems organize external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval-as-lookup interface ill-suited to iterative reasoning agents. We propose LLM-Wiki, an agent-native retrieval system that operationalizes the Retrieval-as-Reasoning paradigm by treating external knowledge as a compilable, composable, and self-evolving structure rather than a static retrieval index. LLM-Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link-following operations through standard tool-calling interfaces, and introduces an Error Book for persistent structural and semantic self-correction. LLM-Wiki achieves state-of-the-art results on HotpotQA, MuSiQue, and 2WikiMultiHopQA, outperforming HippoRAG 2, LightRAG, and GraphRAG by 2.0-8.1 F1 points. On AuthTrace, LLM-Wiki achieves the best overall accuracy, with especially strong gains on multi-document structured queries, confirming that compilation-based retrieval generalizes beyond chain-style multi-hop reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.25382 2026-05-27 cs.CL 版本更新

AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

AuthTrace: 主题密集的单作者语料库中的证据构建诊断

Xiaoqing Wu, Feifei Li, Haoliang Ming, Wenhui Que

发表机构 * Tencent Inc.（腾讯公司）； Beijing, China（中国北京）

AI总结提出AuthTrace基准，通过扇入梯度诊断主题密集单作者语料库中证据构建系统的召回率、精度和答案正确性，发现证据召回是答案正确性的最强预测因子。

详情

AI中文摘要

证据构建——决定在生成开始前哪些段落到达语言模型的阶段——按范式进行评估，使得从业者无法有原则地诊断哪种组织策略失败、在哪里失败或为什么失败。我们引入了AuthTrace，这是一个基于主题密集的单作者语料库构建的诊断基准，其中近失干扰项与所需证据共享风格、主题和词汇。AuthTrace提供明确的引用证据、精确的扇入注释以及统一的包级协议，用于衡量证据召回率、证据精度和答案正确性。扇入梯度——支持答案所需的源文档数量——作为主要诊断轴，使得能够在检索、记忆、图和结构化证据范式之间进行受控比较。评估两个QA模型上的八个系统，我们发现，在主要读者-判断器对下，证据召回率是答案正确性最强的观察预测因子（r = 0.96）；大多数失败源于缺失证据而非答案合成。扇入进一步揭示了特定范式的崩溃模式：平面检索的退化速度比主题组织的证据构建快2-3倍。这些结果表明，扇入分解是一种可重用的诊断镜头，用于识别证据构建系统失败的位置以及哪种范式最适合给定的工作负载。

英文摘要

Evidence construction--the stage that determines which passages reach the language model before generation begins--is evaluated paradigm by paradigm, leaving practitioners with no principled way to diagnose which organization strategy fails, where, or why. We introduce AuthTrace, a diagnostic benchmark built on thematically dense single-author corpora where near-miss distractors share style, topic, and vocabulary with the required evidence. AuthTrace provides explicit quoted evidence, exact fan-in annotation, and a unified pack-level protocol measuring evidence recall, evidence precision, and answer correctness. A fan-in gradient--the number of source documents required to support the answer--serves as the primary diagnostic axis, enabling controlled comparison across retrieval, memory, graph, and structured-evidence paradigms. Evaluating eight systems across two QA models, we find that evidence recall is the strongest observed predictor of answer correctness under the primary reader-judge pair (r = 0.96); most failures stem from missing evidence rather than answer synthesis. Fan-in further exposes paradigm-specific collapse patterns: flat retrieval degrades 2-3x faster than thematically organized evidence construction. These results show fan-in decomposition to be a reusable diagnostic lens for identifying where evidence-construction systems fail and which paradigm best serves a given workload.

URL PDF HTML ☆

赞 0 踩 0

2605.25281 2026-05-27 cs.CL cs.AI 版本更新

READER: Reasoning-Enhanced AI-Generated Text Detection

READER: 增强推理的AI生成文本检测

Pingfan Su, Kai Ye, Shijin Gong, Erhan Xu, Jin Zhu, Giulia Livieri, Chengchun Shi

发表机构 * School of Mathematics, University of Birmingham（布里斯托尔大学数学学院）

AI总结提出READER方法，通过微调1.5B参数的LLM在结构化推理数据集READ上，结合推理与检测，在分布偏移下优于GPT-5.2等大100-1000倍的模型。

2605.24636 2026-05-27 cs.AI cs.CL 版本更新

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

GlobalDentBench：一个用于评估牙科领域大语言模型临床推理能力并包含专家校准的多国基准

Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang

发表机构 * Division of Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong（香港大学牙科学院应用口腔科学与社区牙科护理系）； School of Data Science, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）数据科学学院）； School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）人工智能学院）； Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China（南方医科大学深圳口腔医院（平山）牙周科）； Beijing Institute of Collaborative Innovation（北京协同创新研究院）； Department of Orthodontics, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China（南方医科大学深圳口腔医院（平山）正畸科）； Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China（南方医科大学深圳口腔医院（平山））； College of Future Technology, Peking University（北京大学未来技术学院）； Freedom AI ； New Cornerstone Science Laboratory, National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University, Beijing 100871, China（新基石科学实验室、国家生物医学成像中心、膜生物学国家重点实验室、分子医学研究院、北京大学未来技术学院、生命科学中心，北京大学，北京100871，中国）； IDG/McGovern Institute for Brain Research, Peking University, Beijing 100871, China（IDG/ McGovern脑科学研究院，北京大学，北京100871，中国）； Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, The University of Hong Kong（香港大学牙科学院口腔颌面外科系）； Shenzhen Loop Area Institute（深圳环城区域研究所）； Department of Cancer Biology, Mayo Clinic Arizona, 5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA（梅奥诊所亚利桑那分部癌症生物学部门，5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA）； Department of Conservative Dentistry, Periodontology and Digital Dentistry, LMU University Hospital, LMU Munich, Munich, Germany（慕尼黑大学医院保守牙科、牙周病学和数字牙科部门，慕尼黑，德国，慕尼黑大学）； Division of Periodontology & Implant Dentistry, Faculty of Dentistry, The University of Hong Kong, Hong Kong, SAR, China（香港大学牙科学院牙周病学与种植牙科系，香港，中国）

AI总结提出首个跨国牙科基准GlobalDentBench，包含14个专科、88个国家的8978道专家验证题目，评估三种推理层次，揭示当前大语言模型在牙科临床推理中性能随复杂度下降且存在高风险。

详情

AI中文摘要

尽管大语言模型（LLMs）在医学领域具有变革潜力，但其在真实临床场景中的推理鲁棒性和安全性仍未得到充分探索，尤其是在牙科领域。本文提出GlobalDentBench，首个跨国牙科基准，其分类体系涵盖六大洲88个国家和地区的14个牙科专科。该基准包含8978道专家验证题目，分为三种格式（选择题、简答题和基于案例的题目），并评估三个递进推理层次：知识回忆（L1）、常规推理（L2）和个体化推理（L3）。为确保数据质量，自动构建框架由六名资深牙医校准，选择题和简答题的专家一致率达到99.98%，更复杂的基于案例的题目达到96.78%。在GlobalDentBench上对12个前沿LLMs的评估显示，随着推理复杂度增加，性能呈急剧阶梯式下降。具体而言，准确率从选择题的81.34%骤降至简答题的64.53%和基于案例的题目的22.34%，同时从L1的74.01%显著下降至L2的55.64%和L3的35.71%。更关键的是，对真实牙科案例的风险分析表明，LLM生成的临床建议中总体不安全率高达31.01%，其中4.51%存在导致不可逆患者伤害的风险，且风险在正畸等专科中尤为突出。这些发现暴露了当前LLMs在医学推理和安全性方面的根本局限性。因此，GlobalDentBench为可信赖的临床AI评估提供了可扩展的基础，强调了在医疗领域安全部署这些模型之前迫切需要严格验证。

英文摘要

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

URL PDF HTML ☆

赞 0 踩 0

2605.23910 2026-05-27 cs.CL cs.AI 版本更新

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

基于信息融合的文档分类模式识别：多模态与多视角表示方法的系统综述

Marcin Michał Mirończuk

发表机构 * National Information Processing Institute（国家信息处理研究所）

AI总结本文通过系统综述和元分析，提出了统一框架，量化了多模态和多视角融合在文档分类中的性能提升，并揭示了方法学严谨性不足的问题。

详情

DOI: 10.1016/j.inffus.2026.104247
Journal ref: Information Fusion, 132, 2026, 104247

AI中文摘要

信息融合被广泛用于通过整合多数据源（多模态）或多表示（多视角）来改进文档分类。然而，该领域缺乏统一框架、对其有效性的定量综合以及给实践者的明确指导。本系统综述通过分析139项主要研究来填补这些空白。它引入了一个正式框架来结构化该领域，呈现了定性分析结果以识别关键趋势，并进行了随机效应元分析（据我们所知，这是首次专注于文档分类的元分析）以量化性能提升。我们的元分析显示，多模态融合显著提高了准确率（平均提升+5.28个百分点，$p=0.0016$）——F1分数效应方向为正，但在我们的主要模型中统计上不显著。多视角融合在准确率（+4.67%）、F1分数（+3.08%）和召回率（均$p<0.05$）上提供了一致但适度的提升。关键的是，我们的定性综合揭示了方法学严谨性方面的可重复性挑战：只有11.8%（多模态）和23.3%（多视角）的研究使用统计检验来验证其发现，这削弱了许多结果的可靠性。本综述的主要贡献是一个统一框架、首个定量证据基础以及数据驱动的指南。本综述得出结论，成功的信息融合不依赖于算法复杂性，而在于融合方法与任务上下文的战略对齐以及对更严格验证的承诺。

英文摘要

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

URL PDF HTML ☆

赞 0 踩 0

2605.22511 2026-05-27 cs.AI cs.CL cs.IR 版本更新

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1: 自蒸馏驱动搜索增强推理中的自我进化

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai, Lingtao Mao

AI总结提出Search-E1方法，通过交替使用普通GRPO和在线策略自蒸馏（OPSD），让搜索增强智能体无需外部监督或复杂模块即可自我进化，在七个QA基准上以3B模型超越所有开源基线。

详情

AI中文摘要

后训练已成为将语言模型转变为胜任的搜索增强推理智能体的主要方法。近期一系列工作通过在此标准流程之上添加复杂机制来进一步提升性能。这些增强引入了来自更强外部系统的外部监督，附加了诸如过程奖励模型或回顾性评论者等辅助模块，通过树搜索或多阶段课程重构了轨迹生成本身，并利用手工设计的奖励和惩罚来塑造奖励。每项增加都带来了可衡量的提升，但同时也使训练流程更加复杂，并将方法绑定到可能并非总是可用的资源或设计上。我们退一步思考这些机制是否真的必要，并提出了Search-E1，一种自我进化方法，让搜索增强智能体仅通过普通的GRPO与在线策略自蒸馏（OPSD）交替进行来改进。在每轮GRPO之后，策略在其自身的训练问题上进行轨迹生成。然后，一个token级的前向KL目标将策略的推理时分布与其在特权上下文下的自身分布对齐，该特权上下文暴露了更高效的兄弟轨迹。尽管简单，该过程自然地提供了密集的每步监督。在七个QA基准上，Search-E1使用Qwen2.5-3B达到了0.440的平均EM，在两个规模上均超越了所有开源基线。代码和完整版本将很快公开。

英文摘要

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with on-policy self-distillation (OPSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches 0.440 average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

URL PDF HTML ☆

赞 0 踩 0

2605.20530 2026-05-27 cs.AI cs.CL cs.LG cs.SE 版本更新

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas：超越LLM智能体的结果排行榜

Parsa Mazaheri, Kasra Mazaheri

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出AgentAtlas框架，通过控制决策分类法和轨迹故障词汇表，将智能体评估从结果成功分离为控制决策质量和轨迹质量，并揭示仅依赖结果排行榜的测量风险。

详情

AI中文摘要

大型语言模型智能体现在可以操作代码库、浏览器、操作系统、日历、文件和工具生态系统，但它们的评估通常将行为简化为最终任务成功。AgentAtlas将智能体评估重新定义为一种诊断词汇和审计协议，用于将结果成功与控制决策质量和轨迹质量分离。本文贡献了：(i) 一个六状态控制决策分类法（行动/询问/拒绝/停止/确认/恢复）；(ii) 一个包含主要错误源和下游影响的轨迹失败词汇表；(iii) 对十五个智能体基准的0/1/2基准覆盖审计；(iv) 一个在合成1,342项数据集上进行的说明性协议研究，使用八种模型在分类法感知和分类法盲提示格式下进行评估。该合成演示不是公开基准发布，不应被视为确定的模型比较。相反，它说明了两个测量风险：当显式标签菜单被移除时，映射标签一致性可能发生显著变化，并且轴选择可能改变表观排名。AgentAtlas旨在帮助基准设计者说明他们覆盖的行为，并帮助评估者诊断仅结果排行榜隐藏的失败。

英文摘要

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

URL PDF HTML ☆

赞 0 踩 0

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG 版本更新

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态：利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

发表机构 * Peking University（北京大学）； Nanyang Technological University（南洋理工大学）； Beijing Jiaotong University（北京交通大学）

AI总结通过因果追踪识别出稀疏的“拒绝轨迹”激活模式，并提出轻量级白盒检测器SALO，基于隐藏状态窗口实现鲁棒越狱检测。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情

AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪，我们识别出一个 extit{拒绝轨迹}：一种稀疏的上游激活模式，即使当诸如GCG的攻击抑制终端拒绝信号时，该模式也常常持续存在。基于这一观察，我们提出了SALO（稀疏激活定位算子），一种轻量级白盒检测器，它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上，SALO在固定的XSTest校准工作点下，改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况，阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.02035 2026-05-27 cs.CL cs.AI 版本更新

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

VIDA: 多模态机器翻译中视觉依赖歧义的数据集

Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo, Chris Biemann

发表机构 * Department of Informatics, Universität Hamburg（汉堡大学信息学院）； Alibaba Group（阿里巴巴集团）； Alibaba Cloud（阿里云）

AI总结提出VIDA数据集，包含2500个精心策划的实例，用于评估多模态机器翻译中需要视觉证据才能解决的歧义，并引入以歧义消解为中心的指标，实验表明链式思维微调能提升跨分布歧义消解能力。

详情

AI中文摘要

歧义消解是多模态机器翻译（MMT）中的一个关键挑战，模型必须真正利用视觉输入将歧义表达映射到其预期含义。尽管先前的工作提出了面向消歧的基准来评估视觉的作用，但我们观察到现有基准仍受限于任务格式不匹配、歧义覆盖范围狭窄或视觉依赖性验证不足。此外，现有的歧义评估并不适用于开放式翻译中的多种歧义类型。为解决这些局限性，我们提出了VIDA（视觉依赖歧义），一个包含2500个精心策划实例的数据集，其中解析带注释的源语言片段需要视觉证据。我们进一步提出了以消歧为中心的指标，使用LLM作为评判分类器来验证带注释的歧义表达是否在片段级别被正确消解。使用两个最先进的LVLM进行的实验表明，监督微调（SFT）提高了整体翻译质量，而链式思维SFT（CoT-SFT）产生了更强的跨分布歧义消解能力，这表明显式的消歧指导提高了对多种歧义类型的泛化能力。

英文摘要

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

URL PDF HTML ☆

赞 0 踩 0

2605.26689 2026-05-27 cs.CV cs.CL 版本更新

PinPoint: Prompting with Informative Interior Points

PinPoint: 通过信息性内部点进行提示

Pouya Sadeghi, Shawn He, Pedro Pablo Guerrero Vela, C. Thomas, Alex Wong, Sirisha Rambhatla

发表机构 * University of Waterloo（滑铁卢大学）； Critical ML ； Apple（苹果公司）

AI总结针对指代图像分割中VLM与SAM结合时因提示模糊导致的性能差距，提出无需训练的确定性点选择器PinPoint，通过融合视觉线索选择稳定、信息丰富的内部点，在无训练下达到监督和强化学习方法的性能。

详情

AI中文摘要

现代指代图像分割流程将用于定位的视觉语言模型（VLM）与用于掩码生成的可提示分割器（如Segment Anything Model，SAM）相结合。先前该方案的无训练实例始终落后于微调和强化学习（RL）调优的专家，且不清楚差距来自VLM的定位、SAM的能力还是提示。我们表明差距主要由提示模糊性主导：VLM提出的边界框（bbox）让SAM猜测框内哪些像素属于表达式所指的对象。内部点是自然的消歧器，但它们的落点很重要；先前的工作依赖于朴素采样的点，这些点落在边界、干扰物和背景杂波上，甚至可能比单独使用bbox更差。有监督和RL调优的方法通过训练VLM预测更好的点来缩小这一差距；我们表明这种训练是不必要的。在五个内部点的匹配预算下，用稳定、信息丰富的点选择替换朴素采样，在RefCOCO/+/g上累积交并比（cIoU）提高了12-18个点，且每个模型固定。我们将这一观察转化为PinPoint，一个确定性的、无需训练的点选择器，它融合四个视觉线索为共识图，选择紧凑、空间多样且远离边界的点，并使用冻结的VLM标记每个点。无需任何任务特定训练，PinPoint在相同堆栈上匹配了有监督和RL调优的专家，同时每次查询仅调用两次VLM。

英文摘要

Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.

URL PDF HTML ☆

赞 0 踩 0

2605.26683 2026-05-27 cs.CL cs.AI 版本更新

An In-Vitro Study on Cross-Lingual Generalization in Language Models

语言模型中跨语言泛化的体外研究

Adrian Cosma

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA)（达勒莫利人工智能研究所（IDSIA））

AI总结通过构建两种程序生成的语言，独立控制词汇距离、少数语言比例等变量，研究语言模型跨语言迁移的机制，发现迁移主要取决于分词是否保留可复用的跨语言子结构，且词汇量越小越有利于掩码迁移。

Comments 16 Figures, 1 Table

详情

AI中文摘要

在自然语料中，语言模型的跨语言迁移难以研究，因为词汇重叠、形态、数据不平衡和分词相互纠缠。我们引入了一个体外框架，使用两种程序生成的语言，它们共享相同的本体、类型化语法和组合结构，但表面实现不同。这使我们能够独立改变词汇距离、少数语言比例、分词器训练制度和词汇量大小，同时评估在掩码少数语言条件下的迁移，该条件的词汇形式在训练中从未被观察到。在700次受控运行中，我们发现迁移受分词器平衡或原始词汇相似性的影响较小，而更多地取决于分词是否保留可复用的跨语言子结构。较小的词汇量通常通过保持单词可分解为共享片段来改善掩码迁移，而较大的词汇量可能将形式转化为特定语言的原子。我们进一步表明，迁移是一个阶段性过程：语法和类型级能力先于掩码词汇泛化。最后，我们尝试通过分词器桥梁解释这一机制，并表明桥梁强度与掩码可达性密切相关。

英文摘要

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.

URL PDF HTML ☆

赞 0 踩 0

2605.26678 2026-05-27 cs.CL 版本更新

AI评估可能扭曲认知：语境在解读学术写作中的重要性

Shang Wu, Randol Yao

发表机构 * UC Irvine（加州大学欧文分校）； MIT（麻省理工学院）

AI总结本文通过构建AI相似度基准，发现忽略国家和领域差异的评估方法会系统性高估或低估某些群体中的AI使用，提出基于具体语境的基准以更准确评估科学写作中的AI使用。

详情

AI中文摘要

本文研究了当评估方法忽略国家和领域的语境差异时，科学写作中AI使用估计可能产生的偏差。利用Dimensions中期刊论文的大规模数据，我们基于人类撰写和LLM重写的摘要之间的差异构建了AI相似度基准。我们表明，合并基准可能混淆已有的风格差异与AI生成的文本，即使在LLM之前的出版物中也会在跨国家-领域组中产生显著扭曲。相比之下，特定国家-领域的基准减轻了这种扭曲，并提供了更可信的比较基线。将这些方法应用于2025年的出版物，结果显示合并基准系统性高估了某些国家和领域的AI使用，同时低估了其他国家和领域的AI使用。这些发现强调了语境感知测量对于准确和公平评估科学中AI使用的重要性。

英文摘要

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.

URL PDF HTML ☆

赞 0 踩 0

2605.26655 2026-05-27 cs.CL cs.LG cs.NE 版本更新

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

为什么提示优化有效，以及为什么有时无效：一种因果启发的编辑级分析

Shuzhi Gong, Hechuan Wen

发表机构 * The University of Melbourne（墨尔本大学）； The University of Queensland（昆士兰大学）

AI总结本文通过因果推断方法分析自动提示优化在不同任务和模型上的泛化失败原因，发现编辑类型与任务特性之间的系统性交互作用。

Comments 17 pages, 4 figures, 8 tables

详情

AI中文摘要

自动提示优化方法（例如 DSpy、TextGrad）可以显著提升大语言模型（LLM）的性能，然而，它们在不同任务上的泛化能力仍然不足。在实践中，优化后的提示在一个基准上的优势往往无法迁移到另一个基准，即使切换不同的 LLM 骨干网络，这种局限性依然存在。为了探究提示性能中未被充分探索的异质性来源，我们对跨多种优化框架、LLM 骨干网络和 NLP 基准的优化提示进行了因果推断启发的观察性分析。为此，我们基于倾向调整的关联分析以及提示编辑的多种互补表示，识别出一致的任务条件编辑模式。我们发现，增加复杂性和元指令的编辑与数学和多跳推理性能呈负相关，而逐步和元认知的编辑则改善了逻辑和顺序推理任务。这些效应在认知负荷标注、表面文本特征和编辑主题分析中均具有鲁棒性，并且可以跨优化框架泛化。总体而言，这些结果表明，提示优化失败源于编辑族与任务特性之间的系统性交互，而非随机的优化伪影，从而提供了优化器行为的特征级表征，并激励了未来任务条件优化器的设计。

英文摘要

Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.

URL PDF HTML ☆

赞 0 踩 0

2605.26646 2026-05-27 cs.AI cs.CL cs.MA 版本更新

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

UnityMAS-O: 基于LLM的多智能体系统的通用强化学习优化框架

Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China（中国人民大学）； Xiaohongshu Inc.（小红书公司）

AI总结提出UnityMAS-O框架，将多智能体工作流作为优化单元，通过逻辑角色、图轨迹、用户定义奖励和智能体-模型映射四个核心对象解耦逻辑与物理参数，支持灵活的参数共享和奖励分配，在检索增强问答、迭代搜索和反思代码生成任务上验证了多智能体RL对手动工作流的提升效果。

详情

AI中文摘要

基于LLM的多智能体系统将复杂任务分解为交互角色，但大多数仍通过提示、工具和控制规则手动编排，智能体很少通过统一的强化学习接口进行优化。现有的RL后训练框架主要针对单策略优化，缺乏对用户定义的多智能体工作流、结构化交互、角色特定信用分配和可配置参数共享的抽象。我们提出了UnityMAS-O，一个用于基于LLM的多智能体系统的通用RL优化框架。UnityMAS-O将完整工作流视为优化单元，而非单个响应或策略轨迹。它通过四个核心对象表示工作流：逻辑智能体角色、图轨迹、用户定义奖励和智能体-模型映射。这将逻辑智能体与物理模型参数解耦，支持完全共享、完全分离和部分共享，奖励在角色、轮次和轨迹级别分配。UnityMAS-O通过基于Ray的星形拓扑运行时扩展了verl。中央控制器执行工作流、调用工具、记录结构化轨迹并组装奖励；模型本地工作器组负责轨迹生成、缓冲、优势计算和分布式PPO风格更新。用户可以定义智能体、工作流、模型映射和奖励，而无需重写优化基础设施。我们在检索增强问答、迭代智能体搜索和反思代码生成上实例化了UnityMAS-O。在Natural Questions、HotpotQA和保留代码任务上，多智能体RL在优化后改进了手动指定的工作流，对于较小模型和严格代码全通过指标尤其有较大提升。这些结果表明，UnityMAS-O可以作为可复用基础，将多样化的基于LLM的多智能体工作流转化为可训练的多智能体RL系统。

英文摘要

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26645 2026-05-27 cs.CL 版本更新

Bounded Path Context: A Controlled Study of Visible Path History in LLM-Based Knowledge Graph Question Answering

有界路径上下文：基于LLM的知识图谱问答中可见路径历史的受控研究

Xihang Shan, Ye Luo

发表机构 * School of Mathematical Sciences（数学科学学院）； Xiamen University（厦门大学）； School of Informatics（信息学院）

AI总结本文提出有界路径上下文（BPC）方法，通过将完整路径存储在符号记忆中，仅暴露最近K跳路径给关系选择提示，在WebQSP和CWQ数据集上使用Qwen3.5-9B-AWQ模型达到或超过全历史提示的性能，同时减少输入token。

Comments 13 pages, 1 figure, submitted to EMNLP 2026

详情

AI中文摘要

基于LLM的知识图谱问答（KGQA）将图遍历委托给语言模型，将每个问题转化为一系列局部关系选择决策，这些决策在多个波束和跳数上重复。一个常见但未经测试的默认做法是将完整的部分路径序列化到每个路由提示中，尽管控制器已经以精确符号状态维护了该路径。有界路径上下文（BPC）解耦了这两个角色：控制器在符号记忆中保留完整路径用于答案提取和审计，而关系选择提示仅暴露问题、当前实体、候选关系以及最多最后K跳。对K进行受控扫描——固定图邻域、波束预算、深度、解码和答案提取格式——表明，在完整的WebQSP和CWQ测试集上，使用Qwen3.5-9B-AWQ模型，有界历史匹配或超过全历史提示：K=1在WebQSP上达到0.487的答案集F1，而全历史为0.472；K=0在CWQ上达到0.287，而全历史为0.274，同时输入token分别减少9.7%和12.1%。在4B规模下，K=1在两个基准上仍然是最强设置。逐示例分析显示，71-84%的示例不受历史长度影响，而受影响的案例揭示了先前跳数何时起到消歧作用或分散注意力。这些结果表明，路径序列化长度更适合作为可调接口变量，而不是基于LLM的图控制器中的默认假设。

英文摘要

LLM-based knowledge-graph question answering (KGQA) delegates graph traversal to language models, turning each question into a sequence of local relation-selection decisions repeated across beams and hops. A common but untested default is to serialize the complete partial path into every routing prompt, even though the controller already maintains this path as exact symbolic state. Bounded Path Context (BPC) decouples these two roles: the controller retains full paths in symbolic memory for answer extraction and audit, while the relation-selection prompt exposes only the question, the current entity, outgoing relation candidates, and at most the last K hops. A controlled sweep over K -- fixing graph neighborhoods, beam budget, depth, decoding, and answer-extraction format -- shows that bounded histories match or exceed full-history prompting on complete WebQSP and CWQ test sets with Qwen3.5-9B-AWQ: K=1 achieves 0.487 answer-set F1 on WebQSP versus 0.472 for full history, and K=0 reaches 0.287 on CWQ versus 0.274, with 9.7% and 12.1% fewer input tokens respectively. At the 4B scale, K=1 remains the strongest setting on both benchmarks. Per-example analysis reveals that 71-84% of examples are unaffected by history length, while the affected cases expose when prior hops disambiguate versus distract. These results suggest that path serialization length is better treated as a tunable interface variable than as a default assumption in LLM-based graph controllers.

URL PDF HTML ☆

赞 0 踩 0

2605.26620 2026-05-27 cs.CL cs.HC 版本更新

概念隐写术

Zhejian Zhou, Jonathan May

发表机构 * University of Southern California Information Sciences Institute（南加州大学信息科学研究所）

AI总结提出概念隐写术，通过思维链中的高级推理行为模式而非词汇选择嵌入信息，实验表明其对释义防御比标准关键词方法更鲁棒，且不影响推理效用。

详情

AI中文摘要

语言模型（LM）发出的思维链（CoT）驱动了其大部分能力。然而，承载有用推理的同一序列也可以隐蔽地传递信息：一个未对齐的模型可能在其CoT中嵌入隐蔽信息，从而逃避人类监督，这种隐写术形式被称为编码推理。先前的LM隐写术方案在词元或词汇空间操作，而内容保留释义器是近期工作中规范且有效的防御手段。我们引入了概念隐写术，其中CoT的每一步通过高级推理行为模式而非词汇选择来携带信息。在四个模型家族和两个推理领域中，这种后门通信渠道被证明比标准关键词方法对强释义防御具有更一致的鲁棒性，并且将信息编码到CoT中不会影响其在推理过程中的效用。在提高对这一新风险的认识后，我们进一步证明，一个策略感知的释义器可以关闭大部分渠道，强调了确保野外忠实LLM推理的新挑战和推荐防御措施。

英文摘要

Language Models (LMs) emit Chains-of-Thought (CoTs) that drive much of their capability. However, the same sequence that carries useful reasoning can also covertly convey messages: a misaligned model may embed covert information in its CoT that slips through human supervision, a form of steganography known as encoded reasoning. Prior LM steganography schemes operate in the token or lexical space, and a content-preserving paraphraser is the canonical and effective defense in recent work. We introduce conceptual steganography, in which each step of a CoT carries information through patterns of high-level reasoning behavior, rather than through lexical choice. Across four model families and two reasoning domains, this backdoor communication channel is shown to be consistently more robust to a strong paraphrase defense than standard keyword approaches, and the encoding of information into CoTs does not affect their utility in the reasoning process. Having raised awareness of this new risk, we then demonstrate that a strategy-aware paraphraser can close much of the channel, highlighting new challenges and recommended defenses for ensuring faithful LLM reasoning in the wild.

URL PDF HTML ☆

赞 0 踩 0

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG 版本更新

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

发表机构 * School of Computing, Telkom University（Telkom大学计算机学院）； Faculty of Engineering and Technology, School of Computing and Artificial Intelligence（工程与技术学院，计算与人工智能学院）

AI总结本文提出一种解耦的边缘可部署管道，结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型，实现风电叶片缺陷定位与结构化报告生成，在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情

AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成；在当前的实践中，这些任务被分开处理，语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道，由三个组件组成，每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器，在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块，将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型，通过量化低秩适应（QLoRA）在947个合成生成的维护报告上进行适配，从该提示生成结构化的JSON报告。检索增强微调（RAFT）进一步将每个建议基于索引的维护程序。五项消融实验，通过BLEU-4、ROUGE-L、幻觉率（HR）和LLM-as-a-Judge评分标准，将该管道与单一视觉-语言模型（VLM）基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10，而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下，QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明，具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

URL PDF HTML ☆

赞 0 踩 0

2605.26498 2026-05-27 cs.CL 版本更新

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Verilog-Evolve：反馈驱动与技能演进的Verilog生成

Zehua Pei, Hui-Ling Zhen, Yu Zhang, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结提出Verilog-Evolve框架，通过反馈驱动（功能仿真、Yosys综合、ABC时序代理等）迭代优化Verilog代码，并利用跨会话技能演进提升生成质量，实验表明在VerilogEval和混合精度GEMM任务上提高了功能成功率和下游友好性。

详情

AI中文摘要

大型语言模型（LLMs）改进了从自然语言规范生成Verilog的过程，但大多数流水线仍将生成视为孤立的采样后功能检查。这对于实际的RTL设计是不够的，因为有用的Verilog必须正确、可综合、考虑时序，并对下游硬件目标友好。我们提出Verilog-Evolve，一个用于版本化Verilog细化和跨会话技能演进的反馈驱动框架。对于每个任务，Verilog-Evolve生成多样化的次要候选，通过功能仿真、Yosys综合、ABC时序代理以及可选的GEMM指标的可执行反馈进行评估，然后在可配置评分下将最佳候选提升为主要版本。为了跨任务改进，系统维护模块化技能指导，根据任务和反馈上下文检索技能，并通过创建/改进/跳过决策和验证器报告从记录的历史中演进候选技能。在VerilogEval和混合精度GEMM任务上的实验表明，Verilog-Evolve提高了最终功能成功和晋升稳定性，同时在开源综合、时序代理和网表级GEMM目标下生成更下游友好的RTL。验证门控的技能演进进一步提高了GEMM下游质量，并在评估的技能模式中实现了最佳下游分数和GEMM保留通过率。

英文摘要

Large language models (LLMs) have improved Verilog generation from natural-language specifications, but most pipelines still treat generation as isolated sampling followed by functional checking. This is insufficient for practical RTL design, where useful Verilog must be correct, synthesizable, timing-conscious, and friendly to downstream hardware objectives. We present Verilog-Evolve, a feedback-driven framework for versioned Verilog refinement and cross-session skill evolution. For each task, Verilog-Evolve generates diverse minor candidates, evaluates them with executable feedback from functional simulation, Yosys synthesis, ABC timing proxy, and optional GEMM metrics, then promotes the best candidate into a major version under configurable scoring. To improve across tasks, the system maintains modular skill guidance, retrieves skills according to task and feedback context, and evolves candidate skills from logged histories through create/improve/skip decisions and verifier reports. Experiments on VerilogEval and mixed-precision GEMM tasks show that Verilog-Evolve improves final functional success and promotion stability while producing more downstream-friendly RTL under open-source synthesis, timing-proxy, and netlist-level GEMM objectives. Validation-gated skill evolution further improves GEMM downstream quality and achieves the best downstream score and GEMM held-out pass rate among the evaluated skill modes.

URL PDF HTML ☆

赞 0 踩 0

2605.26494 2026-05-27 cs.AI cs.CL cs.LG 版本更新

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2系列：小激活释放最大现实智能

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge

发表机构 * MiniMax

AI总结提出MiniMax-M2系列混合专家语言模型，通过小激活参数实现前沿性能，核心包括智能体驱动数据管道、可扩展强化学习系统Forge及自进化检查点M2.7。

Comments Technical Report. 35 pages, 10 figures, 4 tables

详情

AI中文摘要

我们介绍了MiniMax-M2系列，这是一个基于“小激活可以释放最大现实智能”原则构建的混合专家语言模型家族。旗舰版M2总参数量为229.9B，每个token仅激活9.8B参数。M2系列专为智能体部署而端到端设计，包含三个组成部分：(i) 智能体驱动数据管道，生成大规模、可验证的轨迹，涵盖智能体编码和智能体协作，每个轨迹都基于可执行工作空间和与工件对齐的奖励；(ii) Forge，一个可扩展的智能体原生强化学习系统，适应长程智能体轨迹，并配有窗口FIFO调度、前缀树合并、推理优化以及支持白盒和黑盒智能体的干净训练-推理-智能体解耦；(iii) 最新的M2.7检查点向自我进化迈出了早期一步——自主调试训练运行并修改其自身框架。从M2到M2.7，这种组合将小激活足迹转化为智能体编码、深度搜索、办公任务和推理基准上的前沿性能。

英文摘要

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26492 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

灯塔中的伊莱亚斯，再次？诊断LLM故事中的低多样性

Sil Hamilton, David Mimno

发表机构 * Department of Information Science（信息科学系）

AI总结研究通过采样20000个故事发现，LLM生成的故事中存在高度重复的词汇（如Elias、灯塔等），这些词汇来自偏好数据而非预训练数据，表明小数据集与强对齐算法的结合可能对多样性产生不成比例的影响。

2605.26485 2026-05-27 cs.CV cs.CL 版本更新

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniInteract：面向实时全模态助手的流式交互基准测试

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

发表机构 * CUHK MMLab（香港中文大学多模态实验室）； SJTU（上海交通大学）； NTU（国立新加坡大学）； McMaster（麦马斯特大学）； CityUHK（香港城市大学）； JUFE（吉林大学）

AI总结提出OmniInteract基准，通过在线推理音视频流评估全模态大模型的实时交互能力，发现现有模型在流式交互中表现薄弱。

详情

AI中文摘要

我们引入了OmniInteract，一个用于实时全模态大语言模型的流式基准测试，通过音视频流上的原生在线推理进行评估。与离线视频理解或文本提示的流式问答不同，OmniInteract保留了原始音视频流，并要求模型在线处理，无法访问未来内容。用户查询和环境声音嵌入在音频轨道中，要求模型检测多模态触发信号，决定何时响应，并在流展开时回答问题。OmniInteract包含250个视频，具有1430个时间锚定的响应槽：其中1062个1Q1A槽涵盖实时、主动和嵌套场景，368个1QnA槽用于连续任务监控和步骤指导。每个槽包括触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1、中断诊断套件和嵌套链完成分数来评估响应正确性、时序、无效输出、中断处理和上下文连续性。实验表明，当前模型在流式交互中仍然薄弱，最佳整体IA-QTF1仅为0.368，最佳1QnA IA-QTF1仅为0.052。在全双工设置下的数学推理进一步研究表明，离线能力不一定能迁移到在线交互中。代码和数据集将在https://github.com/Lucky-Lance/OmniInteract公开。

英文摘要

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

URL PDF HTML ☆

赞 0 踩 0

2605.26476 2026-05-27 cs.CL cs.IR 版本更新

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench：半导体制造中自适应RAG基准测试框架

Jingbin Qian, Congwen Yi, Min Xia, Wen Wu, Jun Zhu, Jian Guan

AI总结提出FAB-Bench框架，通过六项诊断指标和三种合成策略，评估半导体制造领域RAG系统在不同上下文窗口下的性能，发现注意力稀释是极端上下文长度下性能下降的主要原因。

详情

AI中文摘要

检索增强生成（RAG）已成为知识密集型应用的关键技术，然而在垂直领域评估其性能仍然困难，原因包括领域复杂性、多样的上下文规模以及对专家评估的严重依赖，而专家评估成本高、不一致且不可扩展。我们提出了FAB-Bench，一个用于半导体制造中RAG系统自适应基准测试的端到端框架。FAB-Bench定义了六项诊断指标，衡量事实准确性、上下文利用率、完整性、检索相关性、技术深度和推理一致性。该框架将检索器诊断与生成器级别的推理分析相结合，覆盖4K-32K token的上下文窗口，量化了随着上下文范围扩展，检索精度和生成保真度如何共同演变。从超过1300个生成的候选对中，我们精选了200个查询-答案对的高质量基准，涵盖三种合成策略：大海捞针、文档内多主题和跨文档多跳。在四个LLM和四个RAG框架上的系统评估揭示了三种不同的上下文缩放行为：对数增长、早期饱和和冷启动动态，并确定注意力稀释是极端上下文长度下性能下降的主要机制。在另外三个生产级RAG系统上的跨框架验证确认了评估的可移植性。

英文摘要

Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.

URL PDF HTML ☆

赞 0 踩 0

2605.26463 2026-05-27 cs.CL cs.AI 版本更新

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

迈向无差错的电子健康记录：临床笔记与结构化表格之间的推理密集型一致性验证

Yeonsu Kwon, Jiho Kim, Junseong Choi, Paloma Rabaey, Minseo Kim, Sujeong Im, Jeewon Yang, Jun-Min Lee, Sangji Lee, Jiwon Kim, Hangyul Yoon, Hyunwook Kwon, Edward Choi

发表机构 * KAIST（韩国科学技术院）； Ghent University（根特大学）； Samsung Medical Center（三星医疗中心）； Samsung Changwon Hospital（三星昌原医院）； Asan Medical Center（亚山医疗中心）

AI总结针对电子健康记录中临床笔记与结构化表格数据不一致的问题，提出推理密集型基准EHR-ReasonCon和基于大语言模型的框架EHR-Inspector，通过锚点实体提取、时间引用和表格探索工具实现高效一致性验证。

详情

AI中文摘要

电子健康记录中非结构化临床笔记与结构化表格之间的数据一致性对于患者安全和临床决策至关重要。然而，现有关于笔记-表格一致性验证的工作主要依赖于数值或简单事件的表面匹配。这些方法未能捕捉真实世界EHR文档背后的推理，包括临床解释、事件关系和时间变化。为弥补这一差距，我们引入了EHR-ReasonCon，一个用于笔记-表格一致性验证的推理密集型基准。它基于MIMIC-III构建，并经过专家指导的注释，包含来自临床笔记的8,048个实体，并提供高质量的真实标签。注释协议由专门的表格探索工具支持，以确保系统的证据检索和可靠的一致性评估。我们还提出了EHR-Inspector，一个基于LLM的框架，它分割笔记、提取锚点实体和时间引用，并使用表格探索工具与结构化表格进行一致性验证。在严格和宽松标准下，使用经过专家验证的LLM-as-a-judge指标进行评估，EHR-Inspector在多个模型骨干上实现了最先进的性能。进一步的分析证明了其组件的有效性，并突出了与人工验证的差异。

英文摘要

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.

URL PDF HTML ☆

赞 0 踩 0

2605.26457 2026-05-27 cs.SE cs.AI cs.CL cs.PL 版本更新

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Verus-SpecGym: 用于评估规范自动形式化的智能体环境

Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck

发表机构 * CMU（卡内基梅隆大学）； Amazon（亚马逊）

AI总结提出 Verus-SpecGym 环境与 Verus-SpecBench 基准，通过执行规范机制和对抗性测试评估 LLM 智能体将非正式编程问题转化为形式规范的能力，发现前沿模型可解决 77.8% 的任务但存在遗漏假设等脆弱性。

Comments Preprint

详情

AI中文摘要

AI 编码智能体越来越多地用于编写真实世界的软件，但确保其输出正确性仍然是一个基本挑战。形式化验证提供了一条有希望的路径：智能体生成代码的同时生成机器检查的证明，保证代码满足形式规范。然而，无法保证形式规范本身与用户意图一致。在这项工作中，我们研究规范自动形式化：LLM 智能体能否将非正式编程问题转化为忠实的形式规范。我们引入了 Verus-SpecBench，一个包含 581 个规范编写任务的基准，这些任务源自针对 Rust 验证器 Verus 的 Codeforces 问题，以及 Verus-SpecGym，一个智能体环境，模型在其中与 Verus、bash 和文件系统交互以开发这些规范。核心挑战在于评估：专家编写的参考规范编写成本高昂，LLM 评判者可能遗漏细微错误。我们通过以下方式解决这一问题：(a) 扩展 Verus 的 exec_spec 机制，使生成的规范可以作为 Rust 代码执行；(b) 针对官方 Codeforces 测试和从 Codeforces "hacks"（即竞争对手编写的用于破解不正确解决方案的边缘情况）中提取的对抗性案例进行测试。在 Verus-SpecBench 上，最强的模型 Gemini 3.1 Pro 解决了 77.8% 的任务，其他前沿模型解决了 51.1-57.8%，而开源模型仅达到 21.5-25.5%。我们对失败模式的分析表明，模型生成的规范可能遗漏重要的输入假设、接受不正确的输出以及拒绝有效的输出。我们还发现，LLM 作为评判者的评估遗漏了我们评估者捕获的 26% 的失败。总体而言，我们的结果表明，规范自动形式化对于前沿智能体来说是可行的，但即使在它们已经能够生成正确代码的问题上仍然脆弱。代码、数据和日志可在 https://github.com/formal-verif-is-cool/verus-spec-gym 获取。

英文摘要

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

URL PDF HTML ☆

赞 0 踩 0

2605.26454 2026-05-27 cs.CL 版本更新

Model Unlearning Objectives Vary for Distinct Language Functions

模型遗忘目标因语言功能而异

Berk Atil, Vipul Gupta, Rebecca J. Passonneau

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）； Scale AI

AI总结本文提出针对不同语言功能（危险知识遗忘和毒性遗忘）应设计不同的遗忘方法，并分别提出基于余弦的元学习变体RMU和多层目标方法，在多个7-8B模型上取得良好效果。

2605.26445 2026-05-27 cs.CL 版本更新

Curation and Extraction of Drug-Related Entities from Reddit Platform

从Reddit平台策划和提取药物相关实体

Zewei Wang, Zihan Xu, Yishu Wei, Michael Chary, Yifan Peng

发表机构 * Population Health Sciences, Weill Cornell Medicine, New York City, USA（威立·科恩医学中心人口健康科学系，纽约市，美国）； School of Computing and Information Systems, University of Melbourne, Melbourne, Australia（墨尔本大学计算机与信息系统学院，墨尔本，澳大利亚）； Emergency Medicine, Weill Cornell Medicine, New York City, USA（急诊医学，威立·科恩医学中心，纽约市，美国）

AI总结为解决医生对非法药物真实使用情况了解有限的问题，本文构建了ReDose数据集（6435条Reddit帖子），并采用BERT、LLM和RAG模型进行药物、剂量和效果实体提取，其中BiomedBERT在药物实体上F1达0.843，Llama-3 70B优于GPT-4，但效果提取仍具挑战。

Comments Accepted by IEEE International Conference on Healthcare Informatics (ICHI 2026)

详情

AI中文摘要

医生主要通过临床过量案例了解非法药物，这限制了他们对其真实使用情况的理解。与此同时，药物用户在线上分享第一手经验，提供了关于药物剂量和效果的见解。为弥合这一差距，我们引入了ReDose（Reddit药物剂量和效果）数据集，包含6435条关于物质使用的Reddit帖子。一名委员会认证的毒理学家主要注释了训练集和测试集，而两名医学生参与了测试集的注释，标注了药物、剂量和效果实体。我们使用基于BERT、大型语言模型（LLM）和检索增强生成（RAG）模型对6267个注释进行了基准测试。BiomedBERT在药物实体上达到了0.843的F1分数，而Llama-3 70B优于GPT-4（F1=0.79 vs. 0.72）。效果提取仍然具有挑战性，GPT-4的召回率为0.41。ReDose捕捉了患者策划的叙述，以推进从社交媒体中提取医学数据。

向量并非中性：从摘要任务中导出的LLM表示进行敏感信息推断

Weixin Liu, Bowen Qu, Juming Xiong, Congning Ni, Bradley A. Malin, Zhijun Yin

发表机构 * Vanderbilt University（范德比大学）； Vanderbilt University Medical Center（范德比大学医学院）

AI总结研究LLM摘要系统导出向量中的敏感信息泄露风险，提出SurfaceLoRA微调方法降低特定向量的可恢复性，但未针对的池化向量仍存在风险。

Comments 30 pages, 2 figures; preprint

详情

AI中文摘要

大型语言模型（LLM）摘要系统可能将私有输入的紧凑向量表示传递给下游检索、监控、审计或分析工作流。即使源文档保持访问受限，派生向量可能在不同访问控制下处理，仍支持敏感信息推断，造成残留的信息披露风险。我们以临床出院摘要生成为高风险案例研究，使用电子健康记录（EHR）记录的种族作为受控敏感标签审计。我们审计系统可能保留或暴露给下游组件的两个工件：最终提示令牌隐藏状态和均值池化提示表示。我们的结果表明，从一个导出工件降低案例研究敏感标签的可恢复性并不一定能降低另一个工件的可恢复性。作为缓解案例研究，我们引入了SurfaceLoRA，一种针对导出向量的参数高效微调方法，该方法使用连接到指定导出向量的梯度反转鉴别器。在平衡的五向探测协议下，SurfaceLoRA将EHR记录的种族可恢复性从目标最终令牌工件降低到接近随机水平，同时保持摘要效用，但从未经目标池化工件的可恢复性仍然显著更高。这些发现表明，隐私审计和缓解应针对保留或暴露给下游组件的确切向量工件进行。

英文摘要

Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.

URL PDF HTML ☆

赞 0 踩 0

2605.26414 2026-05-27 cs.AI cs.CL cs.LG 版本更新

检索增强生成的上下文优化：梯度下降视角

Mingchen Li, Jiatan Huang, Chuxu Zhang, Liang Zhao, Hong Yu

发表机构 * University of Massachusetts, Amherst（马萨诸塞大学阿默斯特分校）； University of Connecticut（康涅狄格大学）； Emory University（埃默里大学）

AI总结本文从梯度下降视角研究检索增强生成（RAG）作为上下文优化过程，提出一种轻量级前向更新方法，在冻结LLM和检索器的情况下提升生成器对检索证据的利用。

详情

AI中文摘要

上下文学习最近被与线性自注意力模型中的隐式梯度下降联系起来，表明上下文可以诱导前向传递更新。检索增强生成（RAG）也依赖于上下文，但检索到的文档通常被视为静态证据而非适应信号。我们将RAG研究为一种上下文优化过程。首先，我们展示一个线性自注意力层可以在统一的线性化RAG目标上实现一步梯度下降，该目标涵盖基于投影和基于点积的检索接口。这给出了检索增强预测与上下文优化一致的一个精确区域。我们使用这一结果并非作为LLM计算的字面模型，而是作为调整查询与检索证据之间交互的指南。然后，我们测试这种对应关系的边界：在受控的线性扩展下保持稳定，但在非线性架构下变得依赖于特征分布。最后，我们将这一观点转化为一种针对冻结RAG LLM的轻量级方法。该方法保持检索器和骨干网络固定，并预测一个上下文条件更新到生成器侧的证据使用接口。在七个QA基准、两个检索器和两个冻结LLM骨干网络上，这种仅前向的更新改进了共享接口基线，迁移到未见任务，并以更低的每查询成本接近测试时的梯度适应。

英文摘要

In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.

URL PDF HTML ☆

赞 0 踩 0

2605.26355 2026-05-27 cs.LG cs.CL eess.SP 版本更新

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

能量门控注意力与小波位置编码：Transformer注意力的互补归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher（独立研究者）； Athens, Greece（希腊雅典）

AI总结针对标准注意力缺乏能量显著性和尺度选择性局部性两种互补归纳偏置的问题，提出能量门控注意力（EGA）和莫雷特位置编码（MoPE），两者组合在字符级语言建模上实现超加性性能提升。

Comments 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: https://github.com/AthanasiosZeris/energy-gated-attention

详情

AI中文摘要

标准Transformer注意力计算成对标记相似性，但将所有标记视为同等显著、所有位置视为同等局部，忽略了输入的信息结构。我们识别出标准注意力缺乏两种互补归纳偏置：能量显著性（哪些标记集中了信息能量，通过端到端学习而不需要显式频率分解）和尺度选择性局部性（在每个频率上位置影响的范围，通过Morlet小波编码实现）。我们通过两个简单组件解决这两个问题。能量门控注意力（EGA）通过键标记嵌入的学习能量估计（通过单个线性投影计算）来门控值聚合；它选择关注什么。莫雷特位置编码（MoPE）用学习的高斯窗口小波替换固定的正弦编码，使联合位置-频率定位适应语料库；它指定注意力在每个尺度上操作的位置。在TinyShakespeare上，单独EGA相比标准注意力实现+0.092验证损失改进（相比Phase 1-3基线+0.103）；单独MoPE为-0.032（作为独立编码低于基线）；但它们的组合实现+0.119——超过各部分之和。这种超加性在两个独立训练运行中观察到，是核心实证发现：显著性和局部性是互补归纳偏置，各自填补对方无法单独填补的空白。消融实验证实，结构化谱先验（Morlet小波门控、尺度初始化头、固定正弦PE）始终不如其无约束学习对应物，而互补学习组件交互产生超加性。所有实验都在小规模（≤6M参数、字符级基准、单种子）进行；更大规模的多种子验证是未来工作最重要的方向。

英文摘要

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.

URL PDF HTML ☆

赞 0 踩 0

2605.26352 2026-05-27 cs.CL 版本更新

QAM-W: 通过哈达玛旋转和激活感知缩放实现LLM权重的联合2D码本量化

Preetam Sharma, Kacper Dobek

发表机构 * Independent Research（独立研究）； Institute of Computing Science（计算科学研究所）； Poznan University of Technology（波兹南技术大学）

AI总结提出QAM-W方法，通过L2归一化、块哈达玛旋转和2D坐标配对量化，结合激活感知缩放，在约5.5 bpw下使困惑度接近BF16，优于极坐标编码，并在5-6 bpw范围内保持质量。

详情

AI中文摘要

标量后训练量化器丢弃了权重行内的成对坐标结构。我们引入QAM-W（权重正交幅度调制），一种恢复该结构的编解码器：每行经过L2归一化、块哈达玛旋转、配对为2D坐标，并针对在单位圆高斯上训练的单个Lloyd-Max码本进行量化，同时采用激活感知的每通道缩放。在跨越四个家族（1.1B--13B参数）的五种LLM和八种量化配置的跨模型研究中，激活感知变体在约5.5 bpw下，每个模型的WikiText-2困惑度保持在BF16的±0.4%以内，以少32%的权重比特匹配SmoothQuant W8A8质量包络。联合2D编码在相同比特率下，在ΔPPL上优于极坐标（幅度×相位）编码2--15个百分点，且与BF16的配对KL散度在37个（方法，模型）行上以Spearman ρ=0.99跟踪ΔPPL%，与从编解码器失真到KL散度的单调复合界一致。3.5 bpw变体在量化容忍架构上具有竞争力。在严格的4 bpw下，旋转码本前沿方法QTIP优于QAM-W；贡献在于质量保持的5--6 bpw波段。

英文摘要

Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B--13B parameters) and eight quantized configurations, the activation-aware variant at $\approx 5.5$ bpw stays within $\pm 0.4\%$ of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at $32\%$ fewer weight bits. Joint 2D coding outperforms polar (amplitude $\times$ phase) coding by 2--15~pp $Δ$PPL at equal bitrate, and paired KL against BF16 tracks $Δ$PPL\% at Spearman $ρ= 0.99$ across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5--6~bpw band.

URL PDF HTML ☆

赞 0 踩 0

2605.26320 2026-05-27 cs.LG cs.CL 版本更新

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

MULTISEISMO: 面向跨模态地震理解的多模态地震数据集与模型

Sai Munikoti, Ian Stewart, Chengping Chai, Lisa Linville, Scott Vasquez, Sameera Horawalavithana, Karl Pazdernik

发表机构 * Pacific Northwest National Laboratory（太平洋西北国家实验室）； Oak Ridge National Laboratory（橡树岭国家实验室）； Sandia National Laboratory（桑迪亚国家实验室）； North Carolina State University（北卡罗来纳州立大学）

AI总结针对地震学中多模态数据整合的缺失，构建了包含超过1.6万次地震事件的结构化多模态数据集MultiSeismo，并开发了专用多模态模型SeisModal，在跨模态地震推理任务上取得了优越性能。

详情

AI中文摘要

通用多模态模型（GMMs）在专业科学领域的应用仍然有限，原因是缺乏整合文本和图像之外多种数据模态的综合性领域特定数据集。在地震学中，理解地震现象需要综合时间序列波形数据、地理图像和上下文元数据，而现有地震数据集缺乏这种多模态整合。我们提出了MultiSeismo，一个大规模结构化多模态地震数据集，包含跨越13年（2010年至2023年）来自不同地理区域的超过1.6万次地震事件。每个事件数据整合了全球台网波形记录、烈度图、人口暴露可视化以及标准JSON格式的全面文本描述。此外，我们开发了MISCE，一个基于原始数据的多模态指令集，用于对GMMs进行监督训练和评估，涵盖从基本信息检索到复杂跨模态分析的地震推理任务。我们利用MISCE微调了一个现有的多模态模型（Unified IO 2），并增强了专门的时间序列编码器，从而得到了SeisModal——首个用于综合地震分析的领域特定多模态模型。在MultiSeismo上对最先进的多模态模型进行评估，揭示了显著挑战，特别是通用模型在处理时间序列数据方面的困难，同时证明了SeisModal在地震多模态推理任务上的优越性能。这些结果证明，MultiSeismo为未来地震学多模态研究提供了严格的基准，并验证了我们领域特定架构调整的成功。

英文摘要

The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal's superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.

URL PDF HTML ☆

赞 0 踩 0

2605.26302 2026-05-27 cs.AI cs.CL cs.MA 版本更新

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在老化：面向部署系统的智能体寿命工程

Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出 AgingBench 基准，通过四种老化机制和诊断工具评估部署后智能体的可靠性退化，并指出需要寿命评估、机制级诊断和阶段针对性修复。

详情

AI中文摘要

长寿命AI智能体越来越多地被部署为持久化运行系统，但它们仍然像刚初始化的模型一样被评估。第一天基准测试忽略了一个基本系统问题：智能体在部署后能保持可靠多久？即使模型权重被冻结，智能体的有效状态也在不断变化，因为它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实，并经历常规维护。因此，可靠性成为整个智能体框架的寿命属性，而不仅仅是基础模型的快照属性。我们引入了AgingBench，一个用于智能体寿命工程的纵向可靠性基准：不仅测量部署的智能体是否退化，还测量退化的形式以及修复应针对何处。AgingBench将智能体老化组织为四种机制：压缩老化、干扰老化、修订老化和维护老化。为了诊断这些故障，AgingBench使用时间依赖图和对偶反事实探针，为记忆管道的写入、检索和利用阶段生成诊断档案。在7个场景、14个模型、多种记忆策略以及运行者控制和自主智能体上，跨越约400次运行（涵盖8到200个会话）的结果表明，智能体老化不是一维的：行为测试可以保持干净，而事实精度下降；派生状态跟踪可能在单个模型内急剧崩溃；相同的错误答案可能需要不同的修复，具体取决于诊断档案指向的内容。这些结果表明，可靠的智能体部署需要寿命评估、机制级诊断和阶段针对性修复，而不仅仅是更强的第一天模型。

英文摘要

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

URL PDF HTML ☆

赞 0 踩 0

2605.26293 2026-05-27 cs.CL cs.AI 版本更新

工具模式压缩实现受限上下文预算下的智能体检索增强生成

Furkan Sakizli

发表机构 * Independent Researcher（独立研究者）

AI总结针对智能体RAG系统中工具模式与检索上下文竞争资源的问题，提出工具模式压缩方法，在8K上下文预算下将平均精确匹配率提升20.5个百分点，并验证了压缩模式在超过800个工具时仍可运行。

Comments 12 pages (8 main + 4 appendix), 7 tables, 2 figures. Code and data: https://github.com/SKZL-AI/tscg

详情

DOI: 10.5281/zenodo.20369668

AI中文摘要

配备数十到数百个工具定义的语言模型的智能体RAG系统面临关键资源冲突：工具模式消耗了检索增强生成所需的相同上下文窗口。我们首次系统研究了这种工具-上下文权衡，评估了14个模型（涵盖1.5B-32B本地模型和一个前沿API模型），在三个上下文预算（8K、16K、32K）下使用28个工具定义进行了6,566次受控API调用。应用TSCG保守配置文件压缩（节省44-50%的模式令牌），我们观察到二元启用效应：在8K令牌时，JSON模式工具定义完全溢出上下文窗口，导致接近零的EM（平均2.6%），而压缩模式恢复了RAG功能，所有八个模型平均精确匹配提升20.5个百分点（六个表现出完全启用的模型平均提升24.7个百分点）。在32K（两种格式都适合）时，五个测试模型中的四个显示delta <= 1个百分点，确认该效应纯粹由预算驱动。在HotpotQA（50个多跳问题）上的外部验证显示，在相同溢出场景下EM提升48个百分点。前沿扩展测试表明，JSON模式在大约494个工具时溢出，而压缩模式在超过800个工具时仍可运行。我们的结果确立了工具模式压缩作为受限上下文部署中智能体RAG的必要基础设施层。所有代码、数据和检查点均已公开。

英文摘要

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.26133 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露：成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology（日本先进科学研究院）

AI总结本文首次统一综述了大型语言模型中的预训练数据暴露问题，涵盖成员推断和数据污染，形式化定义了暴露级别，回顾了攻击与防御方法，并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情

DOI: 10.1007/978-3-031-97144-0_14

AI中文摘要

大型语言模型（LLMs）已成为NLP中的主导范式，推动了研究和工业的发展。随着模型规模和预训练数据的增长，由于训练数据集的规模和不可见性，对预训练数据暴露（PDE）的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要，涉及两个关键领域：数据污染和成员推断。尽管概念上相关，但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE，回顾了攻击和防御方法，综合了实证发现，并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

URL PDF HTML ☆

赞 0 踩 0

2605.26132 2026-05-27 cs.CL cs.LG 版本更新

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

自验证蒸馏：你的语言模型秘密地就是它自己的合成数据管道

Tony Lee, Percy Liang

发表机构 * Stanford University（斯坦福大学）

AI总结提出自验证蒸馏算法，让大语言模型仅用无标注种子问题，通过自生成、自验证和自训练提升推理能力，在数学、科学和编程任务上取得显著提升。

详情

AI中文摘要

经过后训练的大语言模型能否仅使用无标注提示，在没有外部教师或工具反馈的情况下进一步提升自己？我们在三个推理领域（数学、科学和编程）中研究这一设置，仅从没有真实解的无标注种子问题开始。我们提出自验证蒸馏，一种简单的后训练精炼算法，其中模型生成这些种子问题的候选解，使用基于提示的自验证进行过滤，并在由此产生的自策展数据集上进行训练。受UQ基准使用多个验证器筛选困难未解问题候选答案的启发，我们将这种基于验证的过滤思想应用于自训练：模型通过三级级联的循环一致性、事实性和正确性检查来过滤自己生成的解，仅当解通过所有阶段且获得一致判断时才被接受。我们发现，在训练数据构建过程中采样更多候选生成并使用更大的验证预算，可以产生更高质量的自策展数据，进而得到更好的推理模型。然后，我们使用自验证蒸馏训练多个规模的Qwen3模型，并在所有三个领域获得收益。对于Qwen3-4B，我们的方法在数学（AIME26和HMMT）上将聚合保留pass@1提升了+16.7个百分点，在科学（GPQA Diamond和HLE）上提升了+11.1个百分点，在编程（LCBv5和LCBv6）上提升了+8.3个百分点，这些收益也扩展到0.6B和8B模型。与我们的仅测试时基线（UQ-TTC）相比，后者通过在推理时花费额外计算来提升性能，自验证蒸馏在大多数设置下实现了更好的性能，同时仅在测试时进行一次推理调用。

英文摘要

Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark's use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.

URL PDF HTML ☆

赞 0 踩 0

2605.26079 2026-05-27 cs.CL 版本更新

Automated Benchmark Auditing for AI Agents and Large Language Models

AI智能体与大语言模型的自动化基准审计

Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou

发表机构 * Duke University（杜克大学）； Together AI ； Stanford University（斯坦福大学）

AI总结提出自动化基准审计框架ABA，系统审计基准任务中的隐藏环境依赖、规范缺失和评分逻辑问题，在168个基准中发现25.7%的任务存在关键问题，过滤后模型排名变化且性能提升约10%。

详情

AI中文摘要

现代AI基准的复杂性超出了传统验证方法的能力。由领域专家编写的任务通常包含隐含假设、不完整的环境规范和脆弱的评估逻辑，人工标注无法可靠地捕捉这些问题。我们引入了自动化基准审计（ABA），一个系统审计单个基准任务的智能体框架，揭示隐藏的环境依赖、规范缺失和有限的评分逻辑等问题。我们在前沿LLM基准和之前的NeurIPS出版物上运行ABA，共涵盖九个领域的168个基准。在这些语料中，ABA识别出关键问题，包括模糊的任务设计、执行环境冲突和错误的地面真值，在超过25.7%的评估任务中。这些自动化审计的精确性通过专家评审和独立第三方报告（如上游PR）得到验证。关键的是，我们证明这些有问题的任务严重扭曲了对智能体和LLM的能力评估：过滤掉这些有问题的任务会改变模型排名，并在SWE-bench Verified和Terminal-Bench 2上分别将平均性能提高9.9%和9.6%。我们发布智能体工具和所有任务注释，以支持前沿基准的未来发展。

英文摘要

Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.25971 2026-05-27 cs.CL cs.IR cs.MA 版本更新

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

预见与学习：在主动智能体中释放空闲时间计算

Haoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu, Jianghao Lin, Zixuan Guo, Yan Xu, Yasheng Wang, Weinan Zhang, Yong Yu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Tencent（腾讯）

AI总结提出ProAct主动智能体架构，利用空闲时间计算预测并满足用户未来需求，通过ProActEval基准测试验证其在任务加速、减少用户努力和降低幻觉率方面的显著优势。

Comments 26 pages, 4 figures; code available at https://github.com/AgentACE-AI/ProAct

详情

AI中文摘要

虽然AI智能体在推理和工具使用方面展现出显著能力，但它们本质上仍然是被动的：仅在用户明确提示后才计算响应。这种范式忽略了一个关键机会：交互之间的空闲时间很大程度上被浪费，使得智能体无法为未来的用户需求做准备。为弥补这一差距，我们引入了ProAct，一种主动智能体架构，利用空闲时间计算来预测并满足可能即将出现的用户需求。通过分析不断演变的对话历史以及持久记忆，ProAct预测即将到来的需求并迭代获取信息，使智能体能够在用户发起查询之前解决知识差距并准备证据。为严格评估主动能力，我们还引入了ProActEval，一个包含40个领域200个场景的综合基准，具有可预测的需求链和多样化的用户认知特征。实验结果表明，与被动基线相比具有显著优势。ProAct通过减少14.8%的必要交互轮次加速任务完成，减少11.7%的用户努力，并在ProActEval上将幻觉率降低28.1%。此外，MemBench评估证实ProAct达到了最先进的反思准确性，突显其持续且稳健的性能。

英文摘要

While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query. To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.

URL PDF HTML ☆

赞 0 踩 0

2605.25758 2026-05-27 cs.CL 版本更新

查询自适应语义分块用于检索增强生成：一种具有上下文窗口扩展的动态策略

Mudit Rastogi

发表机构 * Independent Researcher（独立研究者）

AI总结提出查询自适应语义分块（QASC）方法，通过将查询融入分块过程，利用句子-查询嵌入余弦相似度、上下文窗口扩展和分块级分数聚合，动态构建相关且连贯的文档块，在F1分数上比固定分块提升18-27%，比语义和智能体分块提升8-12%。

详情

AI中文摘要

检索增强生成（RAG）系统关键依赖于文档分块质量以检索相关上下文。固定分块将文档分割成统一单元，不考虑语义或用户意图，导致精度-召回率权衡无法通过调整块大小解决。语义和智能体方法部分解决了这些限制，但未在分块阶段集成用户查询。我们提出查询自适应语义分块（QASC），通过三种机制将查询融入分割以动态构建块：句子与查询嵌入之间的余弦相似度评分以识别种子句子，围绕种子的上下文窗口扩展以保持连贯性，以及块级分数聚合以确保整体相关性。我们在100篇技术文档上评估QASC，涵盖四种类型的200个查询，并与五种粒度的固定分块、递归分割、语义分块和智能体分块进行比较。QASC实现了0.85的F1分数，相对于固定分块相对提升18-27%，相对于语义和智能体替代方法提升8-12%。消融研究证实每个组件都有意义贡献。三名标注者的人工评估（Cohen kappa = 0.82）证实QASC比现有方法产生更相关和连贯的块。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.

URL PDF HTML ☆

赞 0 踩 0

2605.21883 2026-05-27 cs.CL 版本更新

Token-weighted Direct Preference Optimization with Attention

基于注意力的令牌加权直接偏好优化

Chengyu Huang, Zhuohang Li, Sheng-Yen Chou, Claire Cardie

发表机构 * Cornell University（康奈尔大学）； Vanderbilt University（范德比大学）

AI总结提出Token-weighted DPO (TwDPO)方法，利用注意力机制估计令牌权重，在不增加额外训练成本的情况下提升大语言模型与人类偏好对齐的性能。

详情

AI中文摘要

直接偏好优化（DPO）无需单独奖励模型即可使大语言模型与人类偏好对齐。然而，DPO平等对待响应中的所有令牌，忽略了单个令牌的不同重要性。现有的令牌级PO方法要么使用基于令牌位置的启发式函数，要么使用单独训练模型给出的概率估计来计算令牌权重，这缺乏鲁棒性且增加了额外训练成本。相比之下，我们提出令牌加权DPO（TwDPO）——一种基于令牌加权RL的新型训练目标，以及AttentionPO——TwDPO的一个实例，它利用LLM自身的注意力来估计令牌权重。AttentionPO提示LLM作为成对评判者，并在比较响应时检查模型关注的位置。这种设计使AttentionPO具有内容感知能力，根据响应内容调整权重，并且高效，每个样本仅需额外两次前向传播。实验结果表明，AttentionPO在AlpacaEval、MT-Bench和ArenaHard上显著提升了性能，超越了现有的偏好优化方法。

英文摘要

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) -- a novel training objective grounded on token-weighted RL -- and AttentionPO -- an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.

URL PDF HTML ☆

赞 0 踩 0

2605.19908 2026-05-27 cs.CL 版本更新

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

作者身份信号在基于编码器的语言模型中出现在哪里？

Francis Kulumba, Guillaume Vimont, Laurent Romary, Florian Cafiero

发表机构 * Inria Paris（巴黎国家信息与自动化研究所）； Sorbonne Université（索邦大学）； IRIF（IRIF研究所）； LRE, EPITA Ecole nationale des chartes – PSL（LRE，EPITA国立档案馆 – 法国社会科学研究院）

AI总结通过机械可解释性工具，研究不同评分机制对基于编码器的作者身份归因模型性能的影响，发现评分机制决定了编码器在何处整合作者身份信号。

Comments 12 pages, 6 figures. Under review

2605.18592 2026-05-27 cs.LG cs.AI cs.CL 版本更新

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS: 一种用于基于评分标准的强化学习的记忆增强评分标准改进系统

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

发表机构 * The University of Texas at Dallas（德克萨斯大学达拉斯分校）； Adobe Inc.（Adobe公司）； Department of Computer Science, University of California, Santa Barbara（加州大学圣芭芭拉分校计算机科学系）

AI总结提出AMARIS系统，通过持久化评估记忆存储纵向训练证据来改进评分标准，在科学、医学、指令遵循和创意写作任务上优于静态、局部自适应和无记忆基线方法。

Comments Preprint. Under review

详情

AI中文摘要

基于评分标准的奖励塑形为通过强化学习（RL）微调大语言模型（LLMs）提供了可解释且可编辑的奖励信号，但现有的自适应评分标准方法通常从局部证据（如当前批次或实例级比较）更新标准。这种局部视角丢弃了训练过程中产生的诊断信息，使得难以跟踪重复失败、评估之前的评分标准编辑或在早期标准饱和后提高标准。我们引入了AMARIS，一种记忆增强的评分标准改进系统，它将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中，然后检索最近和语义相关的历史来修订评分标准。我们在全局和实例特定评分标准设置下，在科学、医学、指令遵循和创意写作任务上评估了AMARIS。AMARIS在静态、局部自适应和无记忆基线上有所改进，例如在GPQA-Diamond上比最强基线高出+2.8分，在IFBench上高出+2.2分，同时分析表明记忆减少了振荡性的评分标准编辑，并支持从早期错误纠正到后期课程推进的进展。AMARIS与正常RL循环异步运行，相对于同步评分标准更新减少了阻塞延迟。

英文摘要

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

URL PDF HTML ☆

赞 0 踩 0

2605.17774 2026-05-27 cs.CL 版本更新

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

通过QLoRA微调将工具知识内化到小型语言模型中

Yuval Shemla, Ayal Yakobe, Tanmay Agarwal, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia School of General Studies, Columbia University, NY, USA（哥伦比亚大学泛研学院）； Columbia Engineering, Columbia University, NY, USA（哥伦比亚大学工程学院）； IBM Research, NY, USA（IBM研究院）

AI总结本文研究通过QLoRA参数高效微调将工具知识内化到小型语言模型中，在AssetOpsBench基准上，微调后的Gemma 4 E4B和Qwen3-4B模型在无描述推理下优于有完整工具描述的未微调基线，输入长度减少82.6%，规划分数提升。

详情

AI中文摘要

大型语言模型越来越多地被用作代理系统中的规划组件，但当前的工具使用流程通常需要将完整的工具模式包含在每个提示中，这产生了大量的令牌开销，并限制了较小模型的实用性。本文研究了是否可以通过参数高效微调将工具使用知识内化到小型语言模型中，从而在推理时无需显式的工具描述即可进行结构化规划。使用AssetOpsBench作为主要基准，我们使用8位QLoRA在约1700个工具使用示例上微调了Gemma 4 E4B和Qwen3-4B，这些示例涵盖工具知识、问题到规划的映射以及执行风格的轨迹。我们在无描述推理下评估了生成的模型，其中提示完全省略了工具目录。微调后的模型优于接收完整工具描述的有信息未微调基线，输入长度减少了82.6%，同时提高了结构性和LLM评判的规划分数。在最佳的Gemma运行中，模型达到了0.65的AT-F1和3.88的整体评判分数，而信息基线的分数分别为0.47和2.88。Qwen3-4B达到了3.78的强劲整体评判分数，同时使用的内存比Gemma少62%，运行速度快2.5倍，尽管它在一般多项选择基准上也表现出更大的灾难性遗忘。额外的消融实验表明，LoRA秩控制着质量与保留之间的权衡，其中$r=32$最大化规划质量，而较小的秩保留了更多的一般知识。这些结果表明，对于固定的工具目录，QLoRA微调可以将工具知识从提示上下文转移到模型权重中，从而在保持或提高工具规划质量的同时，大幅减少推理开销。

英文摘要

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

URL PDF HTML ☆

赞 0 踩 0

2605.17482 2026-05-27 cs.CL cs.LG 版本更新

RSD: A Local Triangulation Audit Primitive for Learned Vector Blocks

RSD：一种用于学习向量块的局部三角剖分审计原语

Seungmin Jin

发表机构 * HSE University（俄罗斯高等经济大学）

AI总结提出RSD（关系语义分解）作为局部三角剖分审计方法，通过拟合单纯形成员关系和坐标极点，结合关系解码器和坐标残差，实现学习向量块的可解释性审计。

Comments 8 pages, 1 figure. Revised version with clarified scope, experiments, and limitations

详情

AI中文摘要

局部XAI审计将有限的学习向量块与弱侧信号进行比较。基线方法如最近邻查找、低秩坐标模型和关系分解揭示了审计的不同部分。我们引入关系语义分解（简称RSD），作为学习向量块的局部三角剖分审计。给定坐标X和一个声明的有界弱亲和代理A，RSD拟合单纯形成员关系S和坐标极点C。它在关系解码器中重用S来解码A，并报告坐标残差R=X-SC。这产生了一个范围限定的审计单元：所选块、代理、解码器类和损失预算的兼容性，以及组件质量和残差读数。合成控制检查单纯形重构、代理解码和固定S残差分解。定理陈述、月份和狗/狼块说明了为什么低代理损失应结合组件质量、残差读数和块大小来解读。

英文摘要

Local XAI audits compare a finite block of learned vectors with a weak side signal. Baselines such as nearest-neighbor lookup, low-rank coordinate models, and relation factorization expose different parts of this audit. We introduce Relational Semantic Decomposition, abbreviated as RSD, as a local triangulation audit for learned vector blocks. Given coordinates X and a declared bounded weak affinity proxy A, RSD fits simplex memberships S and coordinate poles C. It reuses S in a relation decoder for A and reports the coordinate residual R=X-SC. This yields a scoped audit unit: compatibility for the chosen block, proxy, decoder class, and loss budget, plus component mass and residual readouts. Synthetic controls check simplex reconstruction, proxy decoding, and fixed-S residual decomposition. The theorem-statement, month, and dog/wolf blocks illustrate why low proxy loss should be read with component mass, residual readouts, and block size.

URL PDF HTML ☆

赞 0 踩 0

2604.27019 2026-05-27 cs.LG cs.CL cs.CR 版本更新

MetaGraph：金融NLP中GenAI的大规模元分析（2022-2025）

Paolo Pedinotti, Peter Baumann, Nathan Jessurun, Leslie Barrett, Enrico Santus

发表机构 * Bloomberg（贝莱德）

AI总结提出MetaGraph方法，利用本体引导的LLM从科学语料中提取类型化知识图谱，对681篇GenAI在金融领域的论文进行结构化趋势分析，揭示了三个阶段：早期LLM驱动的任务和数据集扩展、对局限性和风险的日益关注、以及向模块化系统导向方法的转变。

Comments 8 pages, appendices, GEM, ACL

2605.06152 2026-05-27 cs.LG cs.CL math.OC stat.ML 版本更新

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Grokking 还是 Glitching？低精度如何驱动 Slingshot 损失尖峰

Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

发表机构 * Tsinghua University（清华大学）； The University of Tokyo（东京大学）

AI总结本文证明深度神经网络训练中的 Slingshot 损失尖峰现象是由浮点精度限制导致的数值特征膨胀（NFI）机制引起的，并解释了参数范数快速增长和梯度消失等现象。

Comments 28 pages, 13 figures; ICML 2026 Workshop on High-dimensional Learning Dynamics (Spotlight)

详情

AI中文摘要

深度神经网络在无正则化的长期训练中会出现周期性的损失尖峰，这种现象被称为“Slingshot 机制”。现有工作通常将其归因于内在的优化动力学，但其触发机制仍不清楚。本文证明这种现象是浮点算术精度限制的结果。当训练进入高置信度阶段时，正确类别的 logit 与其他 logit 之间的差异可能超过吸收误差阈值。然后在反向传播中，正确类别的梯度被精确舍入为零，而错误类别的梯度保持非零。这打破了跨类别的梯度零和约束，并在分类器层的参数更新中引入了系统性漂移。我们证明这种漂移与特征形成正反馈循环，导致全局分类器均值和全局特征均值呈指数增长。我们将这种机制称为数值特征膨胀（NFI）。该机制解释了 Slingshot 尖峰前的快速范数增长、随后梯度的重新出现以及由此产生的损失尖峰。我们进一步表明，NFI 并不等同于观察到的损失尖峰：在更实际的任务中，部分吸收可能不会产生可见的尖峰，但它仍然可以打破零和约束并驱动参数范数的快速增长。我们的结果将 Slingshot 重新解释为有限精度训练的一种数值动力学，并为训练后期异常参数增长和 logit 发散提供了可检验的解释。

英文摘要

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

URL PDF HTML ☆

赞 0 踩 0

2605.09156 2026-05-27 cs.CL cs.AI 版本更新

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中？探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

发表机构 * Bavarian Academy of Sciences (BAdW)（巴伐利亚科学学院）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； University of Freiburg（弗赖堡大学）

AI总结本文提出一个可解释的深度学习框架，通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分（阳性、阴性、中性）到二分（阳性、阴性）的演变，并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情

AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组，在大多数罗曼语中从三分结构（阳性、阴性、中性）变为二分结构（阳性、阴性）。在这项工作中，我们引入了一个可解释的深度学习框架，在词法和上下文层面研究这一现象。首先，我们表明传统的分词策略对于这种低资源历史设置不够稳健，而我们提出的分词器在这些基线上提高了性能。在词法层面，我们评估了形态特征对性别预测的贡献。在上下文层面，我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

URL PDF HTML ☆

赞 0 踩 0

2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE 版本更新

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * University College London（伦敦大学学院）； Holistic AI ； Imperial College London（伦敦帝国学院）

AI总结本文发现语言模型内部存在对应工具选择的线性方向，通过干预该方向可切换工具调用，并能提前检测潜在错误，在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情

AI中文摘要

当工具调用代理选错工具时，失败在执行之前是不可见的：邮件被发送，会议被错过。随着代理承担重要行动，一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它；本文表明我们可以做到。在模型内部，工具的选择由激活空间中的单个方向承载，每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1（270M 到 27B）的 12 个指令微调模型和 6 个基础模型上，这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率，在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式，因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误：模型在两个工具之间不确定的查询失败率比确定的高 21 倍（Gemma 3 27B）。这不仅仅是主题注入：相同幅度的随机向量给出 0% 的切换率，而在单个领域（共享一个主题的 14 个航空工具）内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具：从模型内部状态读取所选工具（余弦读出）在 BFCL 上恢复 61-82% 的准确率，而基础生成仅为 2-10%，这表明预训练形成了表示，而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置；在多轮代理循环中，相同的干预不太稳定（匹配基线的增益或损失高达 30 个百分点，没有一致的方向）。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

URL PDF HTML ☆

赞 0 踩 0

2605.07632 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Post-training makes large language models less human-like

后训练使大型语言模型更不像人类

Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska Brändle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, Noémi Éltető, Michael Franke, Thomas L. Griffiths, Fritz Günther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield, Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang, Jian-Qiao Zhu, Eric Schulz

发表机构 * Helmholtz Munich（海德堡-慕尼黑亥姆霍兹中心）； Massachusetts Institute of Technology（麻省理工学院）； University of Tübingen（图宾根大学）； University of Oxford（牛津大学）； Stanford（斯坦福大学）

AI总结通过引入Psych-201数据集，发现后训练（将基础模型转化为有用助手的过程）一致地降低了模型与人类行为的对齐度，且这种错位在新模型世代中加剧，而人物诱导技术无法改善个体层面的预测。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用作人类参与者的替代品，但目前尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题，我们引入了Psych-201，这是一个新颖的数据集，使我们能够大规模测量行为对齐。我们发现，后训练——将基础模型转化为有用助手的阶段——在模型家族、规模和目标上一致地降低了与人类行为的对齐度。此外，这种错位在新模型世代中扩大，即使基础模型继续改进。最后，我们发现人物诱导——一种通过将模型条件化为参与者特定信息来引发类人行为的流行技术——并不能改善个体层面的预测。综合来看，我们的结果表明，当前用于将LLMs转化为有用助手的那些过程也使得它们成为人类行为的不太准确的模型。

英文摘要

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.07053 2026-05-27 cs.CL cs.AI 版本更新

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

GSM-SEM: 生成语义变体增强的基准与框架

Jyotika Singh, Fang Tu, Aziza Mirsaidova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Karan Dua, Yassine Benajiba, Weiyi Sun, Tao Sheng, Graham Horwood, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结提出GSM-SEM框架，通过修改实体、属性和关系生成语义多样的数学问题变体，降低模型对固定测试集的记忆偏差，并在多个基准上验证性能下降。

详情

AI中文摘要

像GSM8K这样的基准测试是数学推理的流行度量，但由于对固定测试集的记忆，排行榜上的提升可能夸大真实能力。大多数鲁棒性变体应用表面级别的扰动（释义、重命名、数字交换、干扰项），这些扰动在很大程度上保留了底层事实，而静态发布本身可能随着时间的推移成为记忆目标。我们引入了GSM-SEM，一个可重用且随机的框架，用于生成语义多样化的基准变体，其语义方差显著高于先前方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述，经常改变底层事实，并要求模型在新条件下重新计算解决方案，同时约束生成以保留原始计算/答案和近似问题难度。GSM-SEM在每次运行时生成新的变体，无需重新标注，减少了对静态公共基准评估的依赖，从而降低了记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有的变体系列（GSM-Symbolic和GSM-Plus），生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA LLM，我们观察到一致的性能下降，当语义扰动与符号/plus变体结合时下降更大（在GSM-SEM的最大严格配置中平均下降率为28%）。我们公开发布这三个SEM变体作为完全人工验证的数据集。最后，为了展示在GSM风格数学问题之外的适用性，我们将GSM-SEM应用于其他基准，包括BigBenchHard、LogicBench和NLR-BIRD。

英文摘要

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

URL PDF HTML ☆

赞 0 踩 0

2509.26619 2026-05-27 cs.CL cs.AI 版本更新

Searching the Internet for Challenging Benchmarks at Scale

在互联网上大规模搜索具有挑战性的基准测试

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

发表机构 * Google（谷歌）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出一种自动框架，将互联网建模为多臂老虎机问题，通过epsilon-greedy策略高效搜索最具挑战性的主题，以构建无需人工筛选的基准测试。

详情

AI中文摘要

许多静态基准测试开始饱和：随着模型快速改进，它们在固定测试集上获得近乎完美的分数，几乎没有剩余空间来暴露模型的真正弱点——即使是专家策划的挑战集在爬山后也会迅速饱和。我们提出一个完全自动化的框架，在互联网上大规模搜索以构建具有挑战性的基准测试，无需人工筛选。关键洞察是将互联网建模为一个广阔的主题空间，并将搜索形式化为多臂老虎机问题，其中每个主题的难度仅通过昂贵的采样和评估查询来揭示。我们的epsilon-greedy策略在仅探索6%的搜索空间的情况下识别出最具挑战性的主题——相比穷举评估成本降低了100倍。我们在机器翻译和知识问答上进行了验证，确认发现的难度在独立指标（GEMBA-SQA和MetricX）、语言和模型上都是稳健的。

英文摘要

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.

URL PDF HTML ☆

赞 0 踩 0

2605.01489 2026-05-27 cs.AI cs.CL 版本更新

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: 面向前沿科学推理的深度研究智能体规模化

Tianshi Zheng, Rui Wang, Xiyun Li, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Wei Fan, Yangqiu Song, Tianqing Fang

发表机构 * HKUST（香港科技大学）； CUHK（香港大学）； Tencent AI Lab（腾讯AI实验室）

AI总结提出SciResearcher框架，通过合成基于学术证据的概念与计算任务并训练智能体，在HLE-Bio/Chem-Gold等基准上达到最优性能。

Comments 23 pages, 6 figures, 15 tables

详情

AI中文摘要

前沿科学推理正迅速成为推动AI智能体在自动化科学发现中的关键基础。深度研究智能体为此挑战提供了有前景的方法。这些模型通过后训练处理信息寻求任务（通常通过知识图谱构建或迭代网页浏览来策划）来发展强大的问题解决能力。然而，这些策略在前沿科学中面临固有局限性，因为领域特定知识分散在稀疏且异构的学术来源中，而问题解决需要远超事实回忆的复杂计算和推理。为弥合这一差距，我们引入了SciResearcher，一个用于前沿科学数据构建的全自动智能体框架。SciResearcher综合了基于学术证据的多样化概念和计算任务，同时激发信息获取、工具集成推理和长程能力。利用策划的数据进行监督微调和智能体强化学习，我们开发了SciResearcher-8B，一个在HLE-Bio/Chem-Gold基准上达到19.46%的智能体基础模型，在其参数规模上建立了新的最先进水平，并超越了多个更大的专有智能体。它在SuperGPQA-Hard-Biology和TRQA-Literature基准上进一步取得了13-15%的绝对提升。总体而言，SciResearcher为前沿科学推理的自动数据构建引入了一种新范式，并为未来的科学智能体提供了一条可扩展的路径。

英文摘要

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

URL PDF HTML ☆

赞 0 踩 0

2605.01188 2026-05-27 cs.CL 版本更新

Compute Optimal Tokenization

计算最优分词

Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer

发表机构 * FAIR at Meta（Meta 的 FAIR）； University of Washington（华盛顿大学）

AI总结通过训练988个BLT模型，研究压缩率（每token平均字节数）对缩放趋势的影响，发现计算最优配置下模型参数量与数据字节数成比例，且最优压缩率低于BPE并随计算量下降。

详情

AI中文摘要

缩放定律能够优化数据量和语言模型大小的选择，但数据单元——token——对此关系的影响尚未充分探索。本文系统研究了由压缩率（即每token平均文本字节数）控制的token信息粒度如何影响缩放趋势。我们训练了988个潜在分词模型（BLT），参数规模从50M到7B，这些模型可以设置所需的压缩率。这种灵活性使我们能够研究压缩率远超过使用流行BPE分词器得到的每token 4.57字节的作用。实验表明，在计算最优配置中，模型参数量与以字节为单位的数据大小成比例，而不是通常认为的以token为单位（Kaplan et al., 2020; Hoffmann et al., 2022）。此外，我们发现最优压缩率不同于BPE得到的压缩率，并且随计算量增加而降低。这些发现普遍适用于潜在分词和子词分词，以及英语以外的其他语言，指导语言模型开发者选择分词方案以实现最大计算效率。

英文摘要

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

URL PDF HTML ☆

赞 0 踩 0

2604.26102 2026-05-27 cs.SE cs.CL 版本更新

SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent

SWE-Edit：重新思考高效SWE智能体的代码编辑

Yikai Zhang, Jiaxin Pei, Kenan Li, Qirui Jin, Maoquan Wang, Jin Pan, Yu Kang, Shengyu Fu, Elsie Nallipogu, Junjie Hu, Yufan Huang, Zijian Jin

发表机构 * Microsoft（微软）； Department of Computer Sciences, University of Wisconsin–Madison（威斯康星大学麦迪逊分校计算机科学系）； Stanford Institute for Human-Centered Artificial Intelligence (HAI), Stanford University（斯坦福大学人机交互研究所）

AI总结提出SWE-Edit框架，通过将代码编辑分解为查看器和编辑器两个子智能体，解决上下文耦合问题，在SWE-Bench Verified上提升解决率2.1个百分点并降低推理成本17.9%，且通过GRPO训练小模型实现高效编辑格式选择。

详情

AI中文摘要

大型语言模型智能体在软件工程方面取得了显著进展，但当前系统存在上下文耦合问题：标准代码编辑界面将代码检查、修改规划和编辑执行混杂在单个上下文窗口中，迫使智能体将探索性查看与严格格式化的编辑生成交错进行。无关上下文不断累积，编辑可靠性下降。我们提出SWE-Edit，将编辑界面分解为两个专门的子智能体：一个按需提取任务相关代码的查看器，以及一个根据高层自然语言计划执行修改的编辑器——让主智能体专注于推理，同时将上下文密集型操作委托给干净的上下文窗口。在SWE-Bench Verified上，这种分解将解决率提高了2.1个百分点，并将推理成本降低了17.9%，在多个推理模型家族（Kimi-K2、MiniMax-M2.1、GLM-4.7）上均有一致提升。我们进一步证明，有效的编辑格式选择可以训练到小模型中，而无需前沿规模的能力：在Qwen3-8B上使用自适应查找替换/全文件重写策略进行GRPO训练，将编辑成功率提高了12.5个百分点，并使8B开源编辑器在下游SWE-Bench解决率上与GPT-5-nano持平。为支持快速编辑器迭代，我们发布了PR-Edit，这是一个轻量级评估，其分数与SWE-Bench解决率强相关。我们在https://github.com/microsoft/SWE-Edit发布代码。

英文摘要

Large language model agents have made strong progress on software engineering, yet current systems suffer from a context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. Irrelevant context accumulates and edit reliability degrades. We propose SWE-Edit, which decomposes the editing interface into two specialized subagents: a Viewer that extracts task-relevant code on demand, and an Editor that executes modifications from high-level natural language plans -- letting the main agent focus on reasoning while delegating context-intensive operations to clean context windows. On SWE-Bench Verified, this decomposition raises resolve rate by 2.1 pp and cuts inference cost by 17.9%, with consistent gains across multiple reasoning-model families (Kimi-K2, MiniMax-M2.1, GLM-4.7). We further show that effective edit-format selection can be trained into a small model rather than requiring frontier-scale capacity: GRPO training on Qwen3-8B with an adaptive find-replace/whole-file-rewrite policy improves edit success by 12.5 pp and brings an 8B open-source editor to parity with GPT-5-nano on downstream SWE-Bench resolve rate. To enable rapid editor iteration, we release PR-Edit, a lightweight evaluation whose scores correlate strongly with SWE-Bench resolve rate. We release our code at https://github.com/microsoft/SWE-Edit.

URL PDF HTML ☆

赞 0 踩 0

2604.19499 2026-05-27 cs.CL 版本更新

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

秩湍流Delta与可解释的文体测量Delta方法

Dmitry Pronin, Evgeny Kazartsev

发表机构 * HSE University（俄罗斯高等经济大学）

AI总结本文引入两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta，通过将距离函数应用于概率分布来推广Burrows经典Delta，并提出词级分解实现数值可解释性，在四个语料库上验证了方法的有效性。

Comments Published in Digital Scholarship in the Humanities. The version of record is available at https://academic.oup.com/dsh/advance-article-abstract/doi/10.1093/llc/fqag072/8692587 Code available at: https://github.com/DDPronin/Rank-Turbulence-Delta

详情

DOI: 10.1093/llc/fqag072
Journal ref: Digital Scholarship in the Humanities, 2026

AI中文摘要

本文介绍了两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta，它们通过应用为概率分布设计的距离函数来推广Burrows经典Delta。我们首先阐述了这些度量的理论基础，对比了词频向量的中心化和非中心化z分数，并将非中心化向量重新解释为概率分布。基于这一表示，我们开发了一种词级分解，使每个Delta距离在数值上可解释，从而促进细读和结果验证。这些方法的有效性在英语、德语、法语和俄语的四个文学语料库上进行了评估。英语、德语和法语数据集来自Project Gutenberg，而俄语基准是SOCIOLIT语料库，包含89位作者的639部作品，时间跨度从18世纪到21世纪。秩湍流Delta达到了与余弦Delta相当的归属准确率；Jensen-Shannon Delta始终匹配或超越经典Burrows Delta的性能。最后，在扩展的SOCIOLIT语料库上重新评估了几种已有的归属算法，提供了它们在显著时间和风格变化下鲁棒性的现实估计。

英文摘要

This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 639 works by 89 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus, providing a realistic estimate of their robustness under pronounced temporal and stylistic variation.

URL PDF HTML ☆

赞 0 踩 0

2604.21454 2026-05-27 cs.CL cs.AI 版本更新

Fact4ac在金融虚假信息检测挑战赛中的方法：通过微调和少样本提示的大语言模型实现无参考金融虚假信息检测

Cuong Hoang, Le-Minh Nguyen

发表机构 * KaiNKaiho

AI总结本文提出一种结合零样本/少样本提示和LoRA参数高效微调的大语言模型框架，用于无外部证据的金融虚假信息检测，在公开和私有测试集上分别达到95.4%和96.3%的准确率，获得竞赛第一名。

详情

DOI: 10.36190/2026.37
Journal ref: Proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD 2026), 20th International AAAI Conference on Web and Social Media

AI中文摘要

金融虚假信息的泛滥对市场稳定和投资者信任构成严重威胁，误导市场行为并造成关键信息不对称。检测此类误导性叙述本身具有挑战性，尤其是在现实场景中，外部证据或用于交叉验证的补充参考资料严格不可用。本文介绍了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。该任务基于最近提出的RFC-BENCH框架（Jiang等人，2026），挑战模型仅依赖内部语义理解和上下文一致性而非外部事实核查来判断金融声明的真实性。为应对这一艰巨的评估设置，我们提出了一个综合框架，利用最先进的大语言模型（LLM）的推理能力。我们的方法系统地集成了上下文学习（特别是零样本和少样本提示策略）以及通过低秩适应（LoRA）的参数高效微调（PEFT），以最优方式使模型与金融操纵的微妙语言线索对齐。我们提出的系统表现出卓越效果，成功在两个官方排行榜上均获得第一名。具体来说，我们在公开测试集上达到95.4%的准确率，在私有测试集上达到96.3%的准确率，突显了我们方法的鲁棒性，并有助于加速金融自然语言处理中上下文感知的虚假信息检测。我们的模型（14B和32B）可在https://huggingface.co/KaiNKaiho获取。

英文摘要

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

URL PDF HTML ☆

赞 0 踩 0

2604.13502 2026-05-27 cs.CL 版本更新

Using reasoning LLMs to extract SDOH events from clinical notes

使用推理型大语言模型从临床笔记中提取社会健康决定因素事件

Ertan Dogan, Kunyu Yu, Yifan Peng

发表机构 * Department of Population Health Sciences, Weill Cornell Medicine（流行病学与公共卫生系，韦尔·科恩医学中心）

AI总结本研究提出一种基于推理型大语言模型的提示工程方法，通过四个模块（简洁提示、少样本学习、自一致性机制和后处理）从临床笔记中提取结构化SDOH事件，取得0.866的微平均F1分数，展示了简单实现与强性能的平衡。

详情

AI中文摘要

社会健康决定因素（SDOH）指影响个人生活、工作和衰老的环境、行为和社会条件。SDOH对个人健康结果有显著影响，其系统识别和管理可大幅改善患者护理。然而，SDOH信息主要记录在电子健康记录的非结构化临床笔记中，限制了其作为机器可读实体的直接使用。为解决此问题，研究人员采用基于预训练BERT模型的自然语言处理（NLP）技术，展示了有前景的性能，但需要复杂的实现和大量计算资源。在本研究中，我们探索了利用具有高级推理能力的大语言模型（LLM）提取结构化SDOH事件的提示工程策略。我们的方法包含四个模块：1）开发结合既定指南的简洁描述性提示，2）应用精心策划示例的少样本学习，3）使用自一致性机制确保稳健输出，4）后处理进行质量控制。我们的方法达到了0.866的微平均F1分数，展示了与领先模型相比具有竞争力的性能。结果表明，具有推理能力的LLM是SDOH事件提取的有效解决方案，兼具实现简单性和强性能。

英文摘要

Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

URL PDF HTML ☆

赞 0 踩 0

2603.12564 2026-05-27 cs.CL cs.AI 版本更新

别听我的！多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University（范德比尔大学）； Vanderbilt University Medical Center（范德比尔大学医学中心）； Intuit AI Research（Intuit人工智能研究）

AI总结提出“坚持或切换”(SoS)框架，通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性，发现对话税导致准确性和拒绝错误建议的能力平均下降30%，并观察到盲目切换现象。

详情

AI中文摘要

大型语言模型（LLM）在静态基准测试中表现出色，但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要，因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里，我们引入了“坚持或切换”（SoS）框架，该框架将问答空间划分为多个顺序呈现，以模拟两种以安全为中心的行为：坚持（即坚持正确的答案选择或拒绝错误的建议）和灵活性（即在引入正确建议时切换到该建议）。在三个临床基准测试中评估了17个LLM，我们观察到普遍存在的对话税，其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%，在某些模型中达到65%。我们还观察到盲目切换，即模型从初始拒绝转向错误和正确建议的比率几乎相同，达到50%。最后，我们表明，增加模型规模可以缓解其中一些对话效率低下的问题，但会加剧其他问题，例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明，静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

URL PDF HTML ☆

赞 0 踩 0

2604.07028 2026-05-27 cs.MA cs.AI cs.CL 版本更新

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

基于特质条件的多智能体系统在迭代法律论证中的战略说服

Philipp D. Siedler

发表机构 * Aleph Alpha Research（Aleph Alpha研究）

AI总结提出战略法庭框架，通过特质条件化的大语言模型智能体模拟多轮法律辩论，发现异质团队表现更优，并引入强化学习特质编排器动态优化辩护策略。

详情

AI中文摘要

在诸如法律、外交和谈判等对抗性领域中的战略互动是通过语言中介的，然而大多数博弈论模型忽略了通过话语运作的说服机制。我们提出了战略法庭框架，这是一个多智能体模拟环境，其中由特质条件化的大语言模型（LLM）智能体组成的控方和辩方团队参与迭代的、基于轮次的法律论证。智能体使用九种可解释的特质进行实例化，这些特质被组织成四种原型，从而能够系统控制修辞风格和战略取向。我们在10个合成法律案例和84个三特质团队配置上评估该框架，使用DeepSeek-R1和Gemini 2.5 Pro进行了超过7,000次模拟试验。我们的结果表明，具有互补特质的异质团队始终优于同质配置，适度的交互深度产生更稳定的判决，并且某些特质（特别是量化和魅力型）对说服成功贡献不成比例。我们进一步引入了一个基于强化学习的特质编排器，该编排器根据案件和对手团队动态生成辩护特质，发现优于静态、人类设计的特质组合的策略。这些发现共同证明了语言可以被视为第一类战略行动空间，并为构建能够在多智能体环境中进行自适应说服的自主智能体提供了基础。

英文摘要

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

URL PDF HTML ☆

赞 0 踩 0

2603.27146 2026-05-27 cs.CL 版本更新

Learning to Predict Future-Aligned Research Proposals with Language Models

学习用语言模型预测未来对齐的研究提案

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出将研究提案生成重构为时间切片科学预测问题，通过未来对齐分数（FAS）评估模型能否预测截止时间后发表的论文方向，并构建时间一致数据集和推理轨迹进行训练，实验表明未来对齐微调显著提升提案质量。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于辅助研究中的构思，但评估LLM生成的研究提案的质量仍然困难：新颖性和合理性难以自动衡量，而大规模人工评估成本高昂。我们通过将提案生成重构为时间切片科学预测问题，提出了一种可验证的替代方案。给定一个研究问题和截止时间前可用的启发论文，模型生成一个结构化提案，并通过其是否预测到截止时间后发表的论文中出现的研究方向来评估。我们通过检索和基于LLM的语义评分，针对保留的未来语料库计算未来对齐分数（FAS）来操作化这一目标。为了训练模型，我们构建了一个时间一致的数据集，包含来自目标及其截止前引用的3,642个实例中的21,835篇论文出现次数，并合成推理轨迹，教授差距识别和灵感借鉴。在Llama-3.1和Qwen2.5模型上，未来对齐微调相比未对齐基线提高了未来对齐（总体FAS最高提升+10.6%），领域专家的人工评估证实了提案质量的改进。最后，我们通过使用代码代理实现两个模型生成的提案来展示实际影响，从新的提示策略中获得MATH 4.17%的准确率提升，并对一种新颖的模型合并方法实现了一致的改进。我们的代码和数据公开在https://github.com/Arthur-Heng/future-aligned-proposals。

英文摘要

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 21,835 paper occurrences across 3,642 instances from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method. Our code and data are publicly available at https://github.com/Arthur-Heng/future-aligned-proposals.

URL PDF HTML ☆

赞 0 踩 0

2603.28730 2026-05-27 cs.RO cs.CL cs.CV 版本更新

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1：视频语言推理作为机器人强化学习的唯一奖励

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

发表机构 * MIT（麻省理工学院）； RAI Institute（机器人智能研究所）

AI总结提出SOLE-R1模型，通过视频语言时空推理生成密集任务进度估计作为唯一奖励信号，实现在无真实奖励、演示或任务特定调优下的零样本在线强化学习。

详情

AI中文摘要

视觉语言模型（VLM）在各种任务中展现出令人印象深刻的能力，这促使人们努力利用这些模型来监督机器人学习。然而，当在强化学习（RL）中用作评估器时，当今最强的模型在部分可观测性和分布偏移下常常失败，使得策略能够利用感知错误而非解决任务。我们提出SOLE-R1（自观察学习器），一种专门设计用于为在线RL提供唯一奖励信号的视频语言推理模型。仅给定原始视频观测和自然语言目标，SOLE-R1执行每时间步的时空思维链（CoT）推理，并生成可直接用作奖励的密集任务进度估计。为了训练SOLE-R1，我们开发了一个大规模视频轨迹和推理合成流水线，生成与连续进度监督对齐的时间基础CoT轨迹。这些数据与基础的空间和多帧时间推理相结合，并使用混合框架训练模型，该框架将监督微调与可验证奖励的RL相结合。在四个不同的仿真环境和真实机器人设置中，SOLE-R1实现了从随机初始化的零样本在线RL：机器人学习之前未见过的操作任务，无需真实奖励、成功指标、演示或任务特定调优。SOLE-R1在24个未见过的任务上成功，并显著优于强视觉语言奖励器，包括Robometer、RoboReward、ReWiND、GPT-5和Gemini-3-Pro，同时对奖励破解表现出明显更强的鲁棒性。我们在匿名页面发布所有模型、数据、代码和演示：https://philip-mit.github.io/sole-r1/

英文摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including Robometer, RoboReward, ReWiND, GPT-5, and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. We release all models, data, code, and demos at the anonymous page: https://philip-mit.github.io/sole-r1/

URL PDF HTML ☆

赞 0 踩 0

2601.18987 2026-05-27 cs.CL cs.AI cs.PL 版本更新

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

LLMs 与停机问题：程序终止推理的特征化

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O'Hearn

发表机构 * FAIR Team, Meta AI（Meta AI FAIR 团队）； The Hebrew University of Jerusalem, Israel（耶路撒冷希伯来大学）； Bloomberg, New York, USA（彭博社，纽约，美国）； Imperial College London, UK（伦敦帝国理工学院，英国）； University College London, UK（伦敦大学学院，英国）

AI总结本文评估了前沿LLMs在程序终止推理上的能力，发现GPT-5和Claude Sonnet 4.5在C程序终止判断上达到顶级验证工具水平，但无法生成形式化证明，并引入分歧前置条件形式化描述非终止条件。

详情

AI中文摘要

判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止的不可判定性，表明没有算法能普遍确定所有程序和输入的终止性。因此，验证工具近似地处理终止问题，有时无法证明或反驳；这些工具依赖于特定问题的架构，并且通常与特定的编程语言绑定。LLMs的最新进展提出了一个自然的问题：它们在多大程度上能够推理程序终止？我们在2025年国际软件验证竞赛（SV Comp）的一组多样化C程序上评估了前沿LLMs。我们的结果表明，GPT-5和Claude Sonnet 4.5（通过测试时缩放）达到了与顶级验证工具相当的分数。然而，尽管模型通常能正确推断程序是否终止，但它们经常无法构造一个见证作为形式化证明，揭示了语义识别与符号证明生成之间的差距。随着代码长度的增加，性能进一步下降。为了分析这一差距，我们引入了一个分歧前置条件形式化方法，将非终止条件描述为逻辑约束。我们希望这些发现能激励未来在现实世界终止基准测试、结合LLMs与符号验证方法的神经符号方法，以及更广泛地关于LLMs在其他不可判定问题上推理的研究。

英文摘要

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.

URL PDF HTML ☆

赞 0 踩 0

2603.17218 2026-05-27 cs.CL cs.AI cs.GT 版本更新

学习诊断和纠正错误：大型语言模型中的道德敏感性获取

Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi, Xitong Zhang, Kristen Johnson, Guangliang Liu

发表机构 * University of Mississippi（密西根州立大学）； Nanyang Technological University（南洋理工大学）； Northeastern University（东北大学）； AWS AI Labs（AWS AI实验室）； Johns Hopkins University（约翰霍普金斯大学）； Qualcomm（高通）； Michigan State University（密西根州立大学）； Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）

AI总结提出一种实用推理方法，通过让大型语言模型诊断和纠正道德错误来获取道德敏感性，并在多个任务上验证了其有效性。

详情

AI中文摘要

道德敏感性是人类道德能力最基础的能力。尽管许多方法旨在使大型语言模型（LLMs）与人类道德价值观对齐，但它们主要关注拟合道德适当文本的分布，而忽视了如何使LLMs获得道德敏感性。在本文中，我们朝着解决以下问题迈出了一步：LLMs如何获得道德敏感性？具体来说，我们提出了一种实用推理方法，通过使LLMs能够诊断和纠正道德错误来促进其道德敏感性的获取。我们的实用推理方法的一个核心优势在于其统一的视角：它不是对语义多样且复杂的表面形式进行道德话语建模，而是提供了一个基于推理负荷设计实用推理过程的原则性框架。实验证据表明，我们的实用方法能够使LLMs获得道德敏感性，并在多个任务中有效泛化。

英文摘要

Moral sensitivity is the most fundamental capability underlying human moral competence. Although many approaches aim to align large language models (LLMs) with human moral values, they primarily focus on fitting the distributions of morally appropriate texts while overlooking how to enable moral sensitivity acquisition in LLMs. In this paper, we take a step toward addressing the question: How can moral sensitivity be acquired in LLMs? Specifically, we propose a pragmatic inference approach that facilitates moral sensitivity acquisition in LLMs by enabling them to diagnose and correct moral errors. A central strength of our pragmatic inference approach lies in its unified perspective: rather than modeling moral discourses across semantically diverse and complex surface forms, it provides a principled framework for designing pragmatic inference procedures grounded in their inferential load. Empirical evidence demonstrates that our pragmatic approach can enable moral sensitivity acquisition in LLMs and generalizes effectively across tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.16654 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic：迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

发表机构 * The University of Tokyo（东京大学）； Yale University（耶鲁大学）； Stanford University（斯坦福大学）； Xiaomi EV（小米EV）； Soongsil University（顺天大学）

AI总结针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题，提出Omanic基准，通过分解为单跳子问题并分析步骤级错误，揭示后期跳数瓶颈、事实知识下限和错误传播，微调后提升多个推理基准性能。

详情

AI中文摘要

仅从最终答案评估大语言模型（LLM）的推理能力可能会掩盖中间步骤的失败，尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题，我们引入了Omanic，一个开放域4跳问答基准，它不仅用于衡量最终答案的准确性，还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例（OmanicSynth）和967个经专家审核的人工标注评估示例（OmanicBench），每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明，Omanic具有挑战性，而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准，平均提升7.41分，验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据，在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

URL PDF HTML ☆

赞 0 踩 0

2603.13853 2026-05-27 cs.CL cs.AI 版本更新

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

APEX-Searcher: 通过子目标细化信用分配以增强智能体检索增强生成

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； MAIS, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所MAIS部）； Wenge Technology Co., Ltd（Wenger科技有限公司）

AI总结针对复杂多跳问答中检索路径模糊和端到端强化学习奖励稀疏的问题，提出APEX-Searcher，通过分离规划与执行的信用分配（规划用RL优化、执行用SFT学习），在多个基准上取得一致提升。

详情

AI中文摘要

检索增强生成（RAG）将大型语言模型（LLMs）与外部知识连接起来，但单轮检索通常不足以应对复杂的多跳问题。为了增强复杂任务的搜索能力，大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法提高了问题解决性能，但它们仍然面临任务推理和模型训练方面的挑战，尤其是模糊的检索执行路径和端到端强化学习（RL）中的稀疏奖励，这可能导致不准确的检索结果和较低的性能。我们将这些失败归因于层次化的信用纠缠：单一的最终奖励同时更新规划和执行，因此模型无法清晰地区分规划错误和检索错误。我们提出APEX-Searcher，它采用了一种细化信用分配的范式：规划通过带有规划级奖励的RL进行优化，而执行则通过SFT学习。大量实验表明，在多跳RAG和任务规划基准上均取得了一致的提升。

英文摘要

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insufficient for complex multi-hop questions. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches improve problem-solving performance, they still face challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL), which can lead to inaccurate retrieval results and lower performance. We attribute these failures to hierarchical credit entanglement: a single final reward updates planning and execution together, so the model cannot clearly separate plan errors from retrieval errors. We propose APEX-Searcher, which uses a Refining Credit Assignment paradigm: planning is optimized by RL with a plan-level reward, while execution is learned by SFT. Extensive experiments show consistent gains in both multi-hop RAG and task planning across benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2603.12754 2026-05-27 cs.CL 版本更新

A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

一种从语义标注语料库学习大规模计算构式语法的方法

Paul Van Eecke, Katrien Beuls

发表机构 * Artificial Intelligence Laboratory（人工智能实验室）； Vrije Universiteit Brussel（布鲁塞尔自由大学）； Faculté d’informatique（计算机科学系）； Université de Namur（纳慕尔大学）

AI总结提出一种从语义标注语料库自动学习大规模、广覆盖计算构式语法的方法，生成包含数万构式的网络，支持开放域文本的框架语义分析并揭示句法-语义使用模式。

Comments Accepted for oral presentation at CoNLL 2026

详情

AI中文摘要

我们提出了一种从语言使用语料库中学习大规模、广覆盖构式语法的方法。该方法从带有成分结构和语义框架标注的话语出发，促进学习可解释的计算构式语法，捕捉句法结构与所表达语义关系之间的复杂关联。生成的语法由在流体构式语法框架内形式化的数万个构式网络组成。这些语法不仅支持开放域文本的框架语义分析，还蕴含了关于学习数据中句法-语义使用模式的大量信息。该方法及学习到的语法有助于基于使用的构式主义语言方法的规模化，因为它们证实了若干基本构式语法猜想的可扩展性，同时为广覆盖语料库中英语论元结构的构式主义研究提供了实用工具。

英文摘要

We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

URL PDF HTML ☆

赞 0 踩 0

2603.03585 2026-05-27 cs.CL cs.AI 版本更新

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Belief-Sim：迈向信念驱动的人口统计错误信息易感性模拟

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas

发表机构 * University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Texas State University（德克萨斯州立大学）

AI总结提出BeliefSim框架，利用心理学分类和调查先验构建人口信念档案，通过提示条件化和后训练适应，实现基于信念模拟人口统计错误信息易感性，对齐度达92%。

Comments Paper Under Review

2603.03194 2026-05-27 cs.CL cs.SE 版本更新

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

BeyondSWE：当前代码代理能否超越单仓库错误修复？

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

发表机构 * GitHub

AI总结提出BeyondSWE基准测试，评估代码代理在跨仓库、领域特定、依赖迁移和文档生成等复杂软件工程任务上的表现，发现现有代理在利用外部信息进行精确代码修改方面仍存在显著不足。

Comments Benchmark: https://huggingface.co/datasets/AweAI-Team/BeyondSWE. Repo: https://github.com/AweAI-Team/BeyondSWE. Scaffold: https://github.com/AweAI-Team/AweAgent

详情

AI中文摘要

当前的代码代理基准主要评估单个目标仓库内的局部问题解决能力，而许多需要外部知识或更广泛仓库级变更的软件工程任务仍未得到充分测试。我们引入了BeyondSWE，这是一个包含500个实例的基准测试，来自246个真实世界的GitHub仓库，用于评估超越单仓库错误修复的代码代理。BeyondSWE涵盖了四种代表性场景：跨仓库问题解决、领域特定问题解决、依赖驱动的迁移以及文档到仓库的生成，涵盖了更广泛的知识范围和解决范围。我们的评估显示，BeyondSWE远未饱和：基于OpenHands的最佳代理达到了46.12的平均分数，而使用GPT-5.4（xhigh）的最强Codex harness在搜索感知提示下达到了56.65。为了研究外部信息访问是否能缩小这一差距，我们使用SearchSWE作为搜索增强编码的受控诊断基线。搜索访问改善了大多数模型，并对某些任务有显著帮助，但收益仍然有限且不均衡，表明当前代理仍然难以将检索到的信息转化为精确、版本兼容且局部可操作的代码更改。这些结果表明，深度编码搜索仍然是一个开放问题：进展需要代理能够可靠地将外部证据与仓库局部推理和基于执行的验证结合起来。

英文摘要

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

URL PDF HTML ☆

赞 0 踩 0

2601.09001 2026-05-27 cs.CL 版本更新

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

熵哨兵：基于STEM解码熵迹的连续LLM准确性监控

Pedro Memoli Buffa, Luciano Del Corro

发表机构 * Departamento de Matematica, FCEyN Universidad de Buenos Aires（数学系，布宜诺斯艾利斯大学）； ELIAS Lab, Departamento de Ingeniería Universidad de San Andres（ELIAS实验室，圣安德烈斯大学）

AI总结提出利用输出熵迹作为推理时信号，通过轻量分类器预测实例正确性并聚合为领域级准确性估计，在STEM推理基准上验证了其用于监控和数据采集的有效性。

详情

AI中文摘要

部署LLM引发两个耦合挑战：(1)监控——在流量和领域漂移时估计模型表现不佳的位置；(2)改进——优先获取数据以缩小最大的性能差距。我们测试推理时信号能否在领域偏移下估计切片级准确性。对于每个响应，我们从最终层下一个词元概率（来自top-$k$ logprobs）计算输出熵迹，并用不同统计量汇总。一个轻量分类器预测实例正确性，平均预测概率得到领域级准确性估计。我们在十个STEM推理基准上进行了详尽的训练/测试组合（$k\in\{1,2,3,4\}$；所有$inom{10}{k}$组合），在来自六个系列（3B--20B）的九个LLM上评估不同分类器模型和特征。估计值通常跟踪保留的基准准确性，并且多个模型显示领域近乎单调的排序，为输出熵迹作为可扩展监控和针对性数据采集的可访问信号提供了证据。

英文摘要

Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in\{1,2,3,4\}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B--20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.

URL PDF HTML ☆

赞 0 踩 0

2603.01327 2026-05-27 cs.SE cs.CL cs.LG 版本更新

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

SWE-Adept：基于LLM的深度代码库分析与结构化问题解决代理框架

Kang He, Kaushik Roy

发表机构 * Electrical and Computer Engineering, Purdue University（电子工程与计算机工程系，普渡大学）

AI总结提出SWE-Adept双代理框架，通过代理引导的深度优先搜索进行代码定位，结合自适应规划和结构化问题解决，在SWE-Bench上提升端到端解决率最多4.3%。

详情

AI中文摘要

大型语言模型（LLM）在独立编程任务上表现出色，但在仓库级软件工程（SWE）中仍面临挑战，这需要（1）深度代码库导航与有效上下文管理以实现准确定位，以及（2）系统化的迭代、测试驱动代码修改以解决问题。为应对这些挑战，我们提出SWE-Adept，一个基于LLM的双代理框架，其中定位代理识别与问题相关的代码位置，解决代理实施相应的修复。对于问题定位，我们引入代理引导的深度优先搜索，选择性遍历代码依赖关系。这最小化了代理上下文窗口中的问题无关内容，提高了定位准确性。对于问题解决，我们采用自适应规划和结构化问题求解。我们为代理配备了用于进度跟踪和基于Git的版本控制的专用工具。这些工具与共享工作记忆交互，该记忆存储按执行步骤索引的代码状态检查点，便于精确的检查点检索。这种设计实现了可靠的代理驱动版本控制操作，用于系统化问题解决，包括分支以探索替代方案和回滚失败的编辑。在SWE-Bench Lite和SWE-Bench Pro上的实验表明，SWE-Adept在问题定位和解决方面均持续优于先前方法，端到端解决率提升最多4.3%。

英文摘要

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context management for accurate localization, and (2) systematic approaches for iterative, test-driven code modification to resolve issues. To address these challenges, we propose SWE-Adept, an LLM-based two-agent framework where a localization agent identifies issue-relevant code locations and a resolution agent implements the corresponding fixes. For issue localization, we introduce agent-directed depth-first search that selectively traverses code dependencies. This minimizes issue-irrelevant content in the agent's context window and improves localization accuracy. For issue resolution, we employ adaptive planning and structured problem solving. We equip the agent with specialized tools for progress tracking and Git-based version control. These tools interface with a shared working memory that stores code-state checkpoints indexed by execution steps, facilitating precise checkpoint retrieval. This design enables reliable agent-driven version-control operations for systematic issue resolution, including branching to explore alternative solutions and reverting failed edits. Experiments on SWE-Bench Lite and SWE-Bench Pro demonstrate that SWE-Adept consistently outperforms prior approaches in both issue localization and resolution, improving the end-to-end resolve rate by up to 4.3%.

URL PDF HTML ☆

赞 0 踩 0

2602.22190 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra：训练原生GUI代理进行推理与行动——基于动作感知监督和部分可验证强化学习

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

发表机构 * UIUC（伊利诺伊大学香槟分校）； Microsoft（微软）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出GUI-Libra训练方案，通过动作感知SFT和部分可验证RL中的KL正则化，解决GUI代理在长程导航任务中推理与定位冲突及部分可验证性问题，显著提升步骤准确率和任务完成率。

Comments 57 pages, 17 figures

详情

AI中文摘要

开源原生GUI代理在长程导航任务上仍落后于闭源系统。这一差距源于两个限制：缺乏高质量、动作对齐的推理数据，以及直接采用忽视GUI代理独特挑战的通用后训练流程。我们识别出这些流程中的两个基本问题：(i) 带有CoT推理的标准SFT常损害定位能力，(ii) 逐步RLVR式训练面临部分可验证性，即多个动作可能正确但仅有一个示范动作用于验证。这使得离线逐步指标成为在线任务成功的弱预测器。在本工作中，我们提出GUI-Libra，一种定制化训练方案以应对这些挑战。首先，为缓解动作对齐推理数据的稀缺性，我们引入数据构建和过滤流程，并发布精心整理的81K GUI推理数据集。其次，为调和推理与定位，我们提出动作感知SFT，混合推理后动作和直接动作数据，并重新加权token以强调动作和定位。第三，为在部分可验证性下稳定RL，我们识别出RLVR中KL正则化被忽视的重要性，并证明KL信任域对改善离线到在线可预测性至关重要；我们进一步引入成功自适应缩放以降低不可靠负梯度的权重。在多种Web和移动基准测试中，GUI-Libra一致地提升了步骤准确率和端到端任务完成率。我们的结果表明，精心设计的后训练和数据整理可以在无需昂贵在线数据收集的情况下，释放显著更强的任务解决能力。我们发布数据集、代码和模型，以促进对具备推理能力的GUI代理的数据高效后训练的进一步研究。

英文摘要

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

URL PDF HTML ☆

赞 0 踩 0

2510.07231 2026-05-27 cs.CL cs.AI 版本更新

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

EconCausal: 面向大语言模型的上下文感知经济推理基准

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim

发表机构 * Graduate School of Data Science, KAIST（韩国科学技术院数据科学研究生院）； College of Business, KAIST（韩国科学技术院商学院）； Data Science for Humanity Group, MPI-SP（马克斯·普朗克所际数据科学为人类集团）； School of Computing, KAIST（韩国科学技术院计算学院）； Division of Social Science, HKUST（香港科技大学社会科学系）

AI总结提出EconCausal基准，包含从顶级经济金融期刊提取的10,490个上下文标注因果三元组，评估大语言模型在指定上下文中推断因果方向及随上下文变化调整判断的能力。

详情

AI中文摘要

社会经济因果效应高度依赖于制度和环境背景。相同的干预措施在不同监管制度、市场条件、时间段或人群中可能产生不同甚至相反的效果。这对大语言模型（LLM）在决策支持角色中提出了挑战：它们能否在指定上下文中推断因果效应的方向，并在上下文变化时修正该判断？为此，我们引入了EconCausal，这是一个大规模基准，包含从顶级经济和金融期刊的2,595项高质量实证研究中提取的10,490个上下文标注因果三元组，通过严格的四阶段流程构建，包括多轮共识、上下文细化和多批评者过滤。跨模型实验表明，LLM往往无法根据上下文调整其预测。虽然顶级模型在固定、显式上下文中达到88%的准确率，但在需要跨上下文修正符号的情况下，准确率下降32.6个百分点（从73.9%降至41.3%），一旦引入误导性的符号证据，准确率降至50%以下。模型还过度承诺于方向性（+/-）符号，仅在13.8%的情况下识别出零效应，且在这些类别上校准不良。数据集和基准公开于 https://anonymous.4open.science/r/econcausal-benchmark-6F12。

英文摘要

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different, even opposite, effects across regulatory regimes, market conditions, time periods, or populations. This poses a challenge for large language models (LLMs) in decision-support roles: can they infer the direction of a causal effect under a specified context, and revise that judgment when the context changes? To address this, we introduce EconCausal, a large-scale benchmark of 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies in top-tier economics and finance journals, constructed through a rigorous four-stage pipeline with multi-run consensus, context refinement, and multi-critic filtering. Across models, LLMs often fail to condition their predictions on context. While top models reach 88% accuracy in fixed, explicit contexts, accuracy falls by 32.6~pp on cases that require revising the sign across contexts (73.9% to 41.3%), and drops below 50% once misleading signed evidence is introduced. Models also over-commit to directional (+/-) signs, recognizing null effects only 13.8% of the time while remaining poorly calibrated on these categories. The dataset and benchmark are publicly available at https://anonymous.4open.science/r/econcausal-benchmark-6F12.

URL PDF HTML ☆

赞 0 踩 0

2602.17443 2026-05-27 cs.CL 版本更新

AIDG: A Formal Decomposition of Information Extraction and Containment Asymmetries in Multi-Turn LLM Dialogue

AIDG：多轮LLM对话中信息提取与包含不对称性的形式化分解

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）

AI总结提出AIDG框架，将多轮对抗对话形式化为部分可观察随机博弈，分解提取与包含角色，揭示防御性能聚类而攻击性能分散等不对称性。

Comments 20 pages, 5 figures, 13 tables. Includes appendix and supplementary materials

详情

AI中文摘要

多轮LLM评估通常报告为单一胜率标量，混淆了不同能力。我们引入AIDG（对抗信息推断游戏），将多轮对抗对话形式化为双人部分可观察随机博弈（POSG），并沿着搜索者（提取）和持有者（包含）角色分解性能。该分解隔离了三种失败模式：合作先验泄漏、约束推理干扰和低效假设空间遍历。在六个前沿LLM的439场游戏中，防御性能紧密聚类（sigma = 1.9 ELO），而攻击性能差异显著（sigma = 53.3 ELO）；确认框架使提取几率比无信息推断高7.75倍（p < 0.00001）；约束违规占推断失败的41.3%，与规模无关（rho = 0.0）。我们将包含优于提取的差距定位为局部可解的防御决策与全局耦合的攻击规划的可测量结果，而非令人惊讶的发现，并使用该分解将差距归因于每个模型。所有设计选择，包括轮次衰减加权和Bradley-Terry评级模型，均源自明确假设。

英文摘要

Multi-turn LLM evaluation is typically reported as a single win-rate scalar, conflating distinct capabilities. We introduce AIDG (Adversarial Information Deduction Game), formalizing multi-turn adversarial dialogue as a two-player partially observable stochastic game (POSG) and decomposing performance along Seeker (extraction) and Holder (containment) roles. The decomposition isolates three failure modes: cooperative-prior leakage, constraint-reasoning interference, and inefficient hypothesis-space traversal. Across 439 games over six frontier LLMs, defensive performance is tightly clustered (sigma = 1.9 ELO) while offensive performance varies substantially (sigma = 53.3 ELO); confirmation framing increases extraction odds 7.75x over uninformed deduction (p < 0.00001); and constraint violations account for 41.3% of deductive failures, uncorrelated with scale (rho = 0.0). We position the containment-over-extraction gap not as a surprising finding but as a measurable consequence of locally resolvable defensive decisions versus globally coupled offensive planning, and use the decomposition to attribute the gap per model. All design choices, including turn-decay weighting and the Bradley-Terry rating model, are derived from explicit assumptions.

URL PDF HTML ☆

赞 0 踩 0

2503.18288 2026-05-27 cs.CL 版本更新

从知识到推理：形式化GlobalHealthAtlas上的专业公共卫生推理

Zhaokun Yan, Shan Xu, Wuzheng Dong, Zhaohan Liu, Lijie Feng, Chengxiao Dai, Chen Tianqi, Binfan Liu, Yunpu Ma, Wenting Wei, Yingting Li, Yi Zhang, Tongning Wu

发表机构 * China Academy of Information and Communications Technology（中国信息通信技术研究院）； CRRC Industrial Academy Co., Ltd.（CRRC工业学院有限公司）； The University of Sydney（悉尼大学）； Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Beijing University of Posts and Telecommunications（北京邮电大学）； Shanghai Institute of Infectious Disease and Biosecurity（上海传染病与生物安全研究院）； School of Public Health, Fudan University（复旦大学公共卫生学院）

AI总结为解决公共卫生推理缺乏结构化监督信号和基准的问题，提出大规模多语言数据集GlobalHealthAtlas（280,210实例，15领域，17语言），并构建LLM辅助的构建与质量控制流水线及领域对齐评估器，支持安全关键型公共卫生推理的LLM训练与评估。

详情

Journal ref: ICML 2026 regular

AI中文摘要

公共卫生推理需要基于科学证据、专家共识和安全约束的群体层面推理。然而，作为一个结构化的机器学习问题，它仍然未被充分探索，且缺乏监督信号和基准。我们引入了GlobalHealthAtlas，一个大规模多语言数据集，包含280,210个实例，涵盖15个公共卫生领域和17种语言。我们进一步提出了一种大语言模型（LLM）辅助的构建和质量控制流水线，包括检索、去重、证据基础检查和标签验证，以提高大规模数据的一致性。最后，我们提出了一个从不同LLM的高置信度判断中提炼的领域对齐评估器，用于评估输出在六个维度上的表现：准确性、推理、完整性、共识一致性、术语规范性和洞察力。这些贡献共同使得LLM在安全关键的公共卫生推理中的可重复训练和评估成为可能，超越了传统的问答基准。我们公开发布了项目代码库、评估器和模型，网址为：https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator 和 https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model。

英文摘要

Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce GlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages. We further propose a large language model (LLM) assisted construction and quality control pipeline with retrieval, deduplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks. We publicly release project codebase, evaluator, and model at:: https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator and https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model

URL PDF HTML ☆

赞 0 踩 0

2601.20796 2026-05-27 cs.CL cs.LG 版本更新

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

解析多模态上下文学习：现代Transformer中的模态不对称性与电路动力学

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

发表机构 * Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； DeepMind（深Mind）； Beijing University of Posts（北京邮电大学）

AI总结通过可控实验，研究现代Transformer中多模态上下文学习的基本机制，发现模态间学习不对称性，并揭示其背后的归纳式电路机制。

Comments ICML 2026 Spotlight

详情

AI中文摘要

基于Transformer的多模态大语言模型通常展现出上下文学习（ICL）能力。受此现象启发，我们提出疑问：Transformer如何从上下文示例中跨模态关联信息？我们通过在合成分类任务上训练的小型Transformer进行可控实验来研究这一问题，从而能够精确操控数据统计和模型架构。我们首先重新审视现代Transformer中单模态ICL的核心原理。虽然多个先前发现得以复现，但我们发现旋转位置编码（RoPE）提高了ICL的数据复杂度阈值。扩展到多模态设置揭示了一个基本的学习不对称性：当在来自主要模态的高多样性数据上预训练时，次要模态中令人惊讶的低数据复杂度就足以使多模态ICL出现。机制分析表明，两种设置都依赖于一种归纳式机制，该机制从匹配的上下文示例中复制标签；多模态训练则跨模态细化和扩展这些电路。我们的发现为理解现代Transformer中的多模态ICL提供了机制基础，并为未来研究引入了一个可控的测试平台。代码可在 https://github.com/YiranHuangIrene/multimodal-icl 获取。

英文摘要

Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation. Code is available at: https://github.com/YiranHuangIrene/multimodal-icl

URL PDF HTML ☆

赞 0 踩 0

2601.18904 2026-05-27 cs.SD cs.AI cs.CL 版本更新

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）； Tsinghua University（清华大学）

AI总结提出MetaSICL方法，利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力，在低资源场景下优于直接微调。

详情

AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而，当应用于低资源任务时，它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配，直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型，提供了一种无需训练、推理时的解决方案。在这项工作中，我们首先表明，$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能，这表明这种ICL适应能力可以推广到多模态设置。在此基础上，我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$，这是一种后训练方法，仅利用来自各种任务的高资源语音数据，旨在增强模型的上下文学习能力。实验表明，我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

URL PDF HTML ☆

赞 0 踩 0

2512.01556 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Shandong University（山东大学）； Tongji University（同济大学）； City University of Hong Kong（香港城市大学）

AI总结提出LEC框架，通过线性期望约束将选择性预测转化为决策问题，在可交换性假设下利用校准集计算风险约束下的保留最大化阈值，并扩展到双模型路由系统，实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情

AI中文摘要

基础模型常常生成不可靠的答案，而启发式不确定性估计器无法完全区分正确与错误输出，导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题，旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此，我们提出了LEC，一个原则性框架，将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率，这对应于选择条件下的边际错误概率。在可交换性下，我们推导出一个仅依赖于保留校准集的有限样本充分条件，从而能够计算风险约束下的保留最大化阈值。此外，我们将LEC扩展到双模型路由系统：如果主模型的不确定性超过其校准阈值，则输入被委托给后续模型，同时保持系统级的选择条件误差控制。在封闭式和开放式问答（QA）以及视觉问答（VQA）上的实验表明，LEC在接受的预测中维持了规定的风险水平，并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.08267 2026-05-27 cs.CL 版本更新

Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

Med-CoReasoner: 通过语言感知的协同推理减少医学推理中的语言差异

Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akiko Aizawa, Irene Li

发表机构 * The University of Tokyo（东京大学）； ETH Zürich（苏黎世联邦理工学院）； National Institute of Informatics（日本信息处理学会）； Siriraj Informatics and Data Innovation Center（Siriraj信息与数据创新中心）； Yale University（耶鲁大学）

AI总结提出Med-CoReasoner框架，通过并行英语和本地语言推理、结构化概念抽象及概念级对齐与检索，将本地临床知识整合到英语逻辑框架中，以缩小医学推理中的多语言差距，在MultiMed-X基准上平均提升5%的多语言推理性能。

详情

AI中文摘要

尽管推理增强的大语言模型在英语医学任务上表现强劲，但多语言差距仍然存在，本地语言的推理能力明显较弱，限制了全球医疗部署的公平性。为弥合这一差距，我们引入了Med-CoReasoner，一种语言感知的协同推理框架，它引出平行的英语和本地语言推理，将其抽象为结构化概念，并通过概念级对齐和检索将本地临床知识整合到英语逻辑框架中。这种设计结合了英语推理的结构稳健性和本地语言编码的实践基础专业知识。为评估超越选择题设置的多语言医学推理，我们构建了MultiMed-X基准，涵盖七种语言，包含专家标注的长文本问答和自然语言推理任务，每种语言350个实例。在三个基准上的实验表明，Med-CoReasoner平均提高了5%的多语言推理性能，在低资源语言上提升尤为显著。此外，模型蒸馏和专家评估分析进一步证实，Med-CoReasoner产生了临床合理且文化扎根的推理轨迹。

英文摘要

While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.

URL PDF HTML ☆

赞 0 踩 0

2601.08146 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

超越迁移准确率：用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

发表机构 * Monash University Indonesia（印度尼西亚墨尔本大学）； Institute Teknologi Bandung（Bandung理工大学）； MBZUAI（MBZUAI研究所）； Boston University（波士顿大学）

AI总结提出基于上下文分解的电路发现方法（CD-T），通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现，并利用电路目标监督微调（CT-SFT）在低资源跨语言情感迁移中最小化灾难性遗忘，优于全局微调。

详情

AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务，限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分，将上下文分解方法适配到非结构化设置（CD-T），实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调（CT-SFT），将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明，CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率，但CT-SFT独特地最小化灾难性遗忘，保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立，表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2511.02360 2026-05-27 cs.CV cs.CL 版本更新

LaRe: Latent Refocusing for Multimodal Reasoning

LaRe: 用于多模态推理的潜在重聚焦

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络与信息安全学院）

AI总结提出LaRe范式，在潜在空间内进行视觉重聚焦，结合语义增强训练，在提升推理准确率的同时大幅减少推理所需token数。

详情

AI中文摘要

思维链推理通过分解复杂任务提升逻辑性能，但其多模态扩展面临权衡。主流的“用图像思考”范式通过显式裁剪图像区域实现视觉重聚焦，但导致计算开销快速增长。新兴的潜在空间推理范式减少了token消耗，但缺乏动态重聚焦能力。我们认为这种权衡源于一个默认前提：有效的视觉重聚焦必须以显式token的形式发生。基于此，我们提出潜在重聚焦（LaRe），一种新的多模态推理范式，其中视觉重聚焦完全在潜在空间内进行。我们进一步设计了一种语义增强训练策略，通过视觉重建目标确保潜在空间的语义结构。实验评估表明，与现有基线相比，LaRe将平均准确率提高了7.6%，同时将推理所需的token数量减少了59.7%。当扩展到8B参数的视觉语言模型骨干时，LaRe实现了与最先进方法相当的性能，证明了我们提出的潜在重聚焦范式在多模态推理中的有效性。

EHRSummarizer：一种隐私感知、FHIR原生的源接地EHR摘要参考架构

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

发表机构 * MedLedger365 ； MedConnect365 ； Xylemed ； Kypath Associates Inc.

AI总结提出一种隐私感知、FHIR原生的参考架构EHRSummarizer，通过检索HL7 FHIR R4资源并约束生成源接地摘要，以支持临床病历审查。

Comments 15 pages, 2 figures, 2 tables. Version 2 clarifies missing-data status handling, medication-status ambiguity, controlled narrative-document handling, source-grounded resource grouping, and future source-to-summary traceability

详情

AI中文摘要

临床医生通常需要浏览碎片化的电子健康记录（EHR）界面，以整合患者问题、用药、近期就诊和纵向趋势的连贯图像。本文描述了EHRSummarizer，一种用于结构化EHR摘要的隐私感知、FHIR原生参考架构。该架构检索一组目标性的高收益HL7 FHIR R4资源，将其标准化为临床上下文包，并使用受约束的摘要阶段生成源接地摘要，旨在支持病历审查。该架构进一步阐明了缺失数据状态处理、用药状态模糊性、在可用时对叙述性临床文档的受控使用，以及未来的源到摘要可追溯性。本文描述的是参考架构和原型行为，而非经过验证的临床干预、自主临床决策支持系统或临床获益证据。在合成和测试FHIR环境上的原型演示展示了端到端行为和输出格式；然而，本文未报告临床结果、受控工作流研究或基准结果。我们概述了一个评估计划，重点关注忠实性、遗漏风险、时间正确性、可用性、隐私和操作监控，以指导未来的机构评估。

英文摘要

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This manuscript describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture for structured EHR summarization. The architecture retrieves a targeted set of high-yield HL7 FHIR R4 resources, normalizes them into a clinical context package, and uses a constrained summarization stage to produce source-grounded summaries intended to support chart review. The architecture further clarifies missing-data status handling, medication-status ambiguity, controlled use of narrative clinical documents when available, and future source-to-summary traceability. The manuscript describes a reference architecture and prototype behavior rather than a validated clinical intervention, autonomous clinical decision-support system, or evidence of clinical benefit. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes, controlled workflow studies, or benchmark results. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, privacy, and operational monitoring to guide future institutional assessment.

URL PDF HTML ☆

赞 0 踩 0

2601.00575 2026-05-27 cs.CL 版本更新

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

InfoSynth: 信息引导的大语言模型基准合成

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结提出基于信息论（KL散度和熵）的InfoSynth框架，自动生成高难度、多样化的Python编程基准，97%的测试用例和解决方案准确。

详情

AI中文摘要

SEAL: 面向知识图谱对话问答的自我演进智能体学习

Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li, Shunyu Yao, Yanzeng Li, Xinchi Li

发表机构 * Institute of Big Data and Artificial Intelligence, China Telecom Research Institute（大数据与人工智能研究院，中国电信研究院）； Wangxuan Institute of Computer Technology, Peking University（王宣计算机技术研究所，北京大学）； School of Artificial Intelligence, China University of Geosciences (Beijing)（人工智能学院，中国地质大学（北京））； Center for Cognition and Neuroergonomics, State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University（认知与神经工效学中心，认知神经科学与学习国家重点实验室，北京师范大学）； Institute of Artificial Intelligence and Future Networks, Beijing Normal University（人工智能与未来网络研究院，北京师范大学）

AI总结提出SEAL两阶段语义解析框架，通过自我演进智能体学习解决知识图谱对话问答中的指代消解、上下文依赖和复杂逻辑推理问题，在SPICE基准上达到最先进性能。

Comments Accept by NeuroComputing

详情

AI中文摘要

基于知识的对话问答（KBCQA）在解决指代消解、上下文依赖建模和执行复杂逻辑推理方面面临持续挑战。现有方法通常存在不准确性和高昂的计算成本，尤其是在处理大规模知识图谱上的复杂查询时。具体而言，大型语言模型（LLM）倾向于为复杂的多跳或聚合查询生成语法无效或语义错位的逻辑形式，而传统的实体-关系链接方法则面临候选空间指数级增长的问题。为了解决这些限制，我们引入了SEAL，一种基于自我演进智能体学习的新型两阶段语义解析框架。在第一阶段，LLM提取一个捕获核心语义的最小S表达式核心，然后通过智能体校准模块进行修正，以纠正语法不一致性并将实体和关系与知识图谱对齐。第二阶段采用基于问题类型预测的模板补全来构建完全可执行的S表达式。关键的是，SEAL包含一种自我演进机制，将局部和全局记忆与反射模块相结合，能够从对话历史和执行反馈中持续适应，而无需显式重新训练。在SPICE基准上的大量实验表明，SEAL在多跳推理、比较和聚合任务中实现了最先进的性能，验证了在结构准确性和计算效率方面的显著提升。

英文摘要

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches often suffer from inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. Specifically, large language models (LLMs) tend to generate syntactically invalid or semantically misaligned logical forms for complex multi-hop or aggregation queries, while conventional entity-relation linking methods face an exponentially growing candidate space. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, an LLM extracts a minimal S-expression core capturing the essential semantics, which is then refined by an agentic calibration module to correct syntactic inconsistencies and align entities and relations with the knowledge graph. The second stage employs template-based completion guided by question-type prediction to construct a fully executable S-expression. Crucially, SEAL incorporates a self-evolving mechanism integrating local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance in multi-hop reasoning, comparison, and aggregation tasks, validating notable gains in both structural accuracy and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.（先进微器件公司）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结提出 Athena-PRM，一种多模态过程奖励模型，通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签，在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情

AI中文摘要

我们提出了 Athena-PRM，一种多模态过程奖励模型（PRM），旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入，主要因为需要推理步骤的逐步标注。传统的自动标注方法，如蒙特卡洛估计，通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据，我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是，Athena-PRM 在仅5000个样本的情况下，在各种场景和基准测试中展现出卓越的效果。此外，我们还开发了两种有效策略来提升PRM的性能：ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法：测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是，当使用 Qwen2.5-VL-7B 作为策略模型时，Athena-PRM 在 WeMath 上提升了10.2个百分点，在 MathVista 上提升了7.1个百分点（测试时扩展）。此外，Athena-PRM 在 VisualProcessBench 上取得了最先进（SoTA）结果，比之前的 SoTA 高出3.9个F1分数，展示了其准确评估推理步骤正确性的强大能力。另外，利用 Athena-PRM 作为奖励模型，我们通过奖励排序微调开发了 Athena-7B，在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2511.14683 2026-05-27 cs.CL 版本更新

Quadratic Term Correction on Heaps' Law

Heap定律的二次项修正

Oscar Fontanelli, Wentian Li

AI总结针对Heap定律在双对数坐标下仍呈轻微凹形的问题，提出二次函数拟合方法，并通过二十部英文小说验证，发现线性系数略大于1、二次系数约为-0.02，且曲率与“伪方差”相关。

Comments 3 figures

详情

AI中文摘要

Heap或Herdan定律通过幂律函数表征词类型与词例之间的关系，该函数在线性-线性尺度上是凹的，但在双对数尺度上是直线。然而，即使在双对数尺度上，类型-词例曲线仍轻微凹形，使幂律关系失效。作为下一阶近似，我们通过二十部英文小说（部分从其他语言翻译成英文）证明，双对数尺度下的二次函数能完美拟合类型-词例数据。对log(类型)-log(词例)数据同时包含线性和二次项的回归分析一致地得出线性系数略大于1，二次系数约为-0.02。利用“从袋中有放回地随机抽取彩色球”模型，我们证明双对数尺度的曲率等于一个负的“伪方差”。尽管当词例数量较大时，由于伪权重值较大，伪方差计算可能遇到数值不稳定性，但该形式为词例数量较小时提供了曲率的粗略估计。

英文摘要

Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

URL PDF HTML ☆

赞 0 踩 0

2510.17759 2026-05-27 cs.CR cs.CL cs.CV cs.LG stat.ML 版本更新

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V：用于破解视觉语言模型的变分推断框架

Qilin Liao, Anamika Lochab, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University, USA（美国普渡大学计算机科学系）

AI总结提出VERA-V变分推断框架，通过联合后验分布生成隐蔽的文本-图像对抗输入，以系统性地发现视觉语言模型的多模态漏洞，在多个基准上攻击成功率最高提升53.75%。

Comments 18 pages, 7 Figures,

详情

AI中文摘要

视觉语言模型（VLM）通过视觉推理扩展了大语言模型，但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖脆弱的模板，专注于单一攻击设置，并且仅暴露了漏洞的一小部分。为了解决这些限制，我们引入了VERA-V，一个变分推断框架，将多模态越狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得能够生成绕过模型防护的隐蔽、耦合的对抗输入。我们训练一个轻量级攻击者来近似后验分布，从而能够高效采样多样化的越狱方法，并提供对漏洞的分布性洞察。VERA-V进一步整合了三种互补策略：（i）基于排版的文本提示，嵌入有害线索；（ii）基于扩散的图像合成，引入对抗信号；（iii）结构化干扰物，分散VLM的注意力。在HarmBench和HADES基准上的实验表明，VERA-V在开源和前沿VLM上均持续优于最先进的基线方法，在GPT-4o上相比最佳基线实现了高达53.75%的攻击成功率（ASR）提升。我们在项目页面提供了代码，地址为：https://github.com/kxwhiowo/VERA-V

英文摘要

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o. We include the code on the project page available here: https://github.com/kxwhiowo/VERA-V

URL PDF HTML ☆

赞 0 踩 0

2510.06843 2026-05-27 cs.CL cs.AI 版本更新

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

自信号驱动的多LLM辩论以实现高效准确的推理

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

发表机构 * University of Cambridge（剑桥大学）； Sorbonne Université（索邦大学）； University of Science and Technology of China（中国科学技术大学）； Beihang University（北航大学）； Nanyang Technological University（南洋理工大学）

AI总结提出一种利用模型级置信度和token级语义焦点两种自信号来自适应引导多LLM辩论过程的方法，在提高准确性的同时减少token消耗。

详情

AI中文摘要

大型语言模型（LLMs）在 diverse 应用领域展现了令人印象深刻的能力。最近的工作探索了多LLM智能体辩论（MAD），通过使多个LLM迭代讨论和细化响应来增强性能。然而，现有的MAD方法主要关注利用外部结构（如辩论图）和LLM作为评判者，而忽略了生成过程中出现的自信号（如token logits和注意力）。这种遗漏导致了冗余计算和潜在的性能下降。在本文中，我们将重点转移到多LLM辩论的自信号上，并引入了一种自信号驱动的多LLM辩论（SID），它利用两种类型的自信号：模型级置信度和token级语义焦点，来自适应地引导辩论过程。我们的方法使高置信度智能体能够在模型级别提前退出，并基于注意力机制压缩冗余辩论内容。我们在多个具有挑战性的基准测试上，对各种LLMs和多模态LLMs评估了我们的方法。实验结果表明，我们的方法不仅在准确性上优于现有的MAD技术，而且还减少了token消耗，突显了利用自信号在提高多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{https://github.com/xuhang2019/SID}{ exttt{https://github.com/xuhang2019/SID}} 上提供。

英文摘要

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

URL PDF HTML ☆

赞 0 踩 0

2510.05864 2026-05-27 cs.CL cs.CY 版本更新

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

关于指令微调大语言模型对长输入中有害句子的敏感性

Faeze Ghorbanpour, Alexander Fraser

发表机构 * School of Computation, Information and Technology, TU Munich（计算、信息与技术学院，慕尼黑工业大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））

AI总结通过构建长输入并系统变化长度、有害比例、显隐性和位置，研究LLM对稀疏嵌入有害句子的敏感性，发现敏感性非单调、随长度下降、早期位置优先、显性危害更易识别。

详情

AI中文摘要

大型语言模型（LLM）越来越多地处理长输入，但当有害句子稀疏地嵌入其中时，其行为仍知之甚少。我们提出了一种敏感性分析，探究LLM如何提取嵌入在长输入中的有害句子。我们通过组合中性和有害句子构建长输入，并系统变化四个因素：输入长度（600–30,000个token）、有害句子比例（0.01–0.50）、危害实现方式（显性与隐性）以及有害句子在输入中的位置（开头、中间、结尾），从而进行受控压力测试评估。针对有毒、冒犯和仇恨内容，以及LLaMA-3.1、Qwen-2.5和Mistral的实验揭示了一致模式：敏感性相对于有害流行率是非单调的，在中等水平达到峰值；敏感性随输入长度增加而下降；位于输入较早位置的有害句子被更强烈地优先处理；显性危害比隐性危害更可靠地被识别。这些发现提供了在受控压力条件下LLM如何优先处理长输入中有害句子的系统视角，突出了安全相关应用中新兴的优势和持续的挑战。

英文摘要

Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.

URL PDF HTML ☆

赞 0 踩 0

2510.05141 2026-05-27 cs.CL 版本更新

To model human linguistic prediction, make LLMs less superhuman

为了模拟人类语言预测，让大语言模型不那么超人类

Byung-Doh Oh, Tal Linzen

发表机构 * Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore（南洋理工大学语言学与多语言研究系，新加坡）； Department of Linguistics, New York University, New York, USA（纽约大学语言学系，纽约，美国）； Center for Data Science, New York University, New York, USA（纽约大学数据科学中心，纽约，美国）

AI总结本文指出大语言模型因超人类预测能力而无法解释人类阅读行为，主张通过模拟人类记忆来改进模型，并提出新实验方向。

Comments Accepted to Trends in Cognitive Sciences

2510.01833 2026-05-27 cs.AI cs.CL 版本更新

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

先规划后行动：面向LLM推理的高层规划引导强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

发表机构 * Case Western Reserve University, Cleveland, OH, USA（凯斯西储大学）； Kean University, Union, NJ, USA（凯恩大学）； The Ohio State University, Columbus, OH, USA（俄亥俄州立大学）； Fudan University, Shanghai, China（复旦大学）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室）； The University of Hong Kong, Hong Kong, China（香港大学）； North Carolina State University, Raleigh, NC, USA（北卡罗来纳州立大学）

AI总结提出PTA-GRPO两阶段框架，通过高层规划引导与强化学习联合优化，提升LLM在数学和自然科学推理任务中的准确性和泛化能力。

Comments 19 pages and 5 figures

详情

AI中文摘要

大型语言模型（LLMs）通过思维链（CoT）展现出强大的推理能力，但其token级别的生成倾向于局部决策，缺乏全局规划，常常导致冗余或不准确的推理。现有方法（如基于树的搜索和强化学习）试图解决这一问题，但计算成本高，且仍难以产生可靠的推理轨迹。为应对这些挑战，我们提出先规划后行动增强推理与组相对策略优化（PTA-GRPO），这是一个两阶段框架，旨在联合改进高层规划和细粒度CoT推理。具体而言，在第一阶段，给定LLM负责将CoT推理总结为紧凑的高层指导，然后用于监督微调。接着，我们引入一种指导感知的强化学习方法，联合优化最终输出和指导质量，提升推理效果。我们在数学和自然科学的十个推理基准上，使用五个覆盖多种数据模态的多样化基础模型进行评估。结果表明，PTA-GRPO在模型和任务上持续带来显著改进，展现出强大的有效性和泛化能力。

英文摘要

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

URL PDF HTML ☆

赞 0 踩 0

2510.01336 2026-05-27 cs.CL cs.AI cs.LG 版本更新

HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec: 分层推测解码用于大语言模型

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Austin（德克萨斯大学奥斯汀分校电子与计算机工程系）

AI总结提出HiSpec框架，利用早期退出模型进行低开销中间验证，通过重用键值缓存和隐藏状态提高吞吐量，平均加速1.28倍，最高2.01倍，且不损失准确性。

详情

AI中文摘要

推测解码通过使用较小的草稿模型推测令牌，再由较大的目标模型验证，从而加速LLM推理。验证通常是瓶颈（例如，当3B模型为70B目标模型推测时，验证速度比令牌生成慢4倍），但大多数先前工作只关注加速草稿生成。“中间”验证通过早期丢弃不准确的草稿令牌来减少验证时间，但现有方法在引入中间验证器时会产生大量训练开销，增加内存占用以协调中间验证步骤，并依赖近似启发式方法损害准确性。我们提出$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$，一种高吞吐量推测解码框架，利用早期退出模型进行低开销中间验证。早期退出模型允许令牌通过跳过层遍历提前退出，并经过显式训练，使得选定层的隐藏状态可解释，从而在不显著增加计算和内存开销的情况下，非常适合中间验证。为了进一步提高资源效率，我们设计了一种方法，使HiSpec能够在草稿模型、中间验证器和目标模型之间重用键值缓存和隐藏状态。为了保持准确性，HiSpec定期针对目标模型验证中间验证器接受的草稿令牌。我们在各种代表性基准和模型上的评估表明，与基线单层推测相比，HiSpec平均提高吞吐量1.28倍，最高达2.01倍，且不损失准确性。

英文摘要

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

URL PDF HTML ☆

赞 0 踩 0

2509.26600 2026-05-27 cs.CL cs.AI 版本更新

PersianMedQA: 在波斯语-英语双语医学问答基准上评估大型语言模型

Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran（德黑兰大学电气与计算机工程学院）； Shahid Beheshti University of Medical Sciences（沙希德·贝赫什提医学科学大学）； Institute for Research in Fundamental Sciences (IPM)（基础科学研究所（IPM））

AI总结本文提出PersianMedQA数据集，包含20,785道波斯语医学多选题，用于评估41个LLM在零样本和思维链设置下的双语医学推理能力，发现闭源通用模型表现最佳，而波斯语模型性能较差，且翻译会丢失文化临床线索。

Comments Accepted at LREC 2026 (The Fifteenth Language Resources and Evaluation Conference), Palma, Mallorca, Spain, May 2026

详情

DOI: 10.63317/3yixio7ngbkh

AI中文摘要

大型语言模型（LLM）在广泛的自然语言处理（NLP）基准测试中取得了显著性能，通常超越人类水平。然而，它们在医学等高风险领域中的可靠性，尤其是在低资源语言中，仍未得到充分探索。在这项工作中，我们引入了PersianMedQA，这是一个大规模数据集，包含来自14年伊朗国家医学考试的20,785道经过专家验证的波斯语医学多选题，涵盖23个医学专业，旨在评估LLM在波斯语和英语中的表现。我们对41个最先进的模型进行了基准测试，包括通用型、波斯语和医学LLM，在零样本和思维链（CoT）设置下。我们的结果表明，闭权重通用模型（例如GPT-4.1）持续优于所有其他类别，在波斯语中达到83.09%的准确率，在英语中达到80.7%，而波斯语LLM（例如Dorna）表现显著较差（例如波斯语中34.9%），通常在指令遵循和领域推理方面存在困难。我们还分析了翻译的影响，表明虽然英语性能通常更高，但3-10%的问题只能通过波斯语正确回答，因为翻译丢失了文化和临床上下文线索。最后，我们证明，如果没有强大的领域或语言适应，仅凭模型大小不足以获得稳健的性能。PersianMedQA为评估LLM中的双语和文化基础医学推理提供了基础。该数据集以及双语医学词典可在以下网址获取：https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA 。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 41 state-of-the-art models, including general-purpose, Persian, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-weight general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.7% in English, while Persian LLMs such as Dorna underperform significantly (e.g., 34.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, 3-10% of questions can only be answered correctly in Persian due to cultural and clinical contextual cues that are lost in translation. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating bilingual and culturally grounded medical reasoning in LLMs. The dataset, along with a bilingual medical dictionary, is available: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA .

URL PDF HTML ☆

赞 0 踩 0

2508.05305 2026-05-27 cs.CL 版本更新

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

SONAR-LLM：在句子嵌入中思考、以令牌说话的自回归Transformer

Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Nikita Kurdiukov, Aysel Mirzoeva, Anna Borisiuk, Andrey Kuznetsov, Anton Razzhigaev

发表机构 * FusionBrain Lab, AXXX（融合大脑实验室，AXXX）； T-Tech ； MSU（莫斯科大学）； DS-NLP Group, AXXX（自然语言处理组，AXXX）

AI总结提出SONAR-LLM，一种在连续SONAR嵌入空间“思考”但通过令牌级交叉熵监督的解码器-only Transformer，融合语义抽象与似然训练信号，在39M至1.3B参数规模上取得有竞争力的生成质量。

2506.23149 2026-05-27 cs.CL 版本更新

AlignEvoSkill: Towards Knowledge-Aware and Task-Aligned Agent Skill Evolution

AlignEvoSkill: 迈向知识感知与任务对齐的智能体技能进化

Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

发表机构 * Singapore Management University（新加坡国立大学）

AI总结提出AlignEvoSkill框架，通过联合建模知识覆盖和任务对齐，从失败轨迹中识别知识标签、检索并适配候选技能，再基于知识覆盖和任务对齐分数筛选高质量技能，在3个基准和4个LLM骨干上相对提升34.7%，实现技能进化新SOTA且成本更低。

详情

AI中文摘要

可重用技能在提升基于LLM的智能体中扮演关键角色，但现有技能进化方法往往无法确保进化后的技能既覆盖任务所需的知识，又与目标任务保持对齐。结果，进化后的技能可能不完整或无关。为解决这一局限，我们提出AlignEvoSkill，一个联合建模知识覆盖和任务对齐的技能进化框架。给定失败的任务轨迹，AlignEvoSkill首先识别与任务相关的知识标签，检索互补的先前技能，并将它们适配为弥补缺失知识的候选技能。然后，它使用基于知识覆盖和任务对齐分数的联合过滤标准选择高质量候选技能。在3个基准和4个LLM骨干上的实验表明，AlignEvoSkill相对于非进化基线实现了34.7%的相对增益，并以更低的成本实现了技能进化的新SOTA。

英文摘要

Reusable skills play a key role in improving LLM-based agents, but existing skill-evolution methods often fail to ensure that evolved skills both cover the knowledge required by the task and remain aligned with the target task. As a result, evolved skills could be incomplete or irrelevant. To address this limitation, we propose AlignEvoSkill, a skill-evolution framework that jointly models knowledge coverage and task alignment. Given failed task trajectories, AlignEvoSkill first identifies task-relevant knowledge tags, retrieves complementary prior skills, and adapts them into candidate skills that address missing knowledge. It then selects high-quality candidates using a joint filtering criterion based on knowledge-coverage and task-alignment scores. Experiments on 3 benchmarks with4 LLM backbones show a 34.7% relative gain of AlignEvoSkill over the non-evolution baseline and achieves a new SOTA in skill evolution with lower cost.

URL PDF HTML ☆

赞 0 踩 0

2506.21443 2026-05-27 cs.CL cs.AI 版本更新

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

领域知识增强的大语言模型用于欺诈和概念漂移检测

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)（计算与增强智能学院（SCAI），亚利桑那州立大学）； Department of Computer Engineering, Tarsus University（计算机工程系，塔鲁斯大学）； Minerva CQ and HumaConn AI Consulting（Minerva CQ和HumaConn人工智能咨询）； School of Computing and Augmented Intelligence (SCAI), Arizona State University（计算与增强智能学院（SCAI），亚利桑那州立大学）

AI总结提出一种领域知识增强的大语言模型框架，通过集成结构化领域知识和漂移检测单元，实现高准确率的欺诈对话检测和概念漂移分类。

详情

DOI: 10.3390/electronics15030534

AI中文摘要

在动态平台上检测欺骗性对话变得越来越困难，原因是语言模式的演变和概念漂移（CD）——即随着时间推移，语义或主题的转变会改变交互的上下文或意图。这些转变可能掩盖恶意意图或模仿正常对话，使得准确分类具有挑战性。尽管大语言模型（LLMs）在自然语言任务中表现出色，但在风险敏感场景中，它们常常面临上下文模糊和幻觉问题。为了解决这些挑战，我们提出了一个领域知识（DK）增强的LLM框架，该框架将预训练的LLM与结构化的、任务特定的见解相结合，以执行欺诈和概念漂移检测。所提出的架构由三个主要组件组成：（1）一个DK-LLM模块，用于检测虚假或欺骗性对话；（2）一个漂移检测单元（OCDD），用于判断是否发生了语义转变；（3）第二个DK-LLM模块，用于将漂移分类为良性或欺诈性。我们首先使用虚假评论数据集验证领域知识的价值，然后将我们的完整框架应用于SEConvo，一个包含多种欺诈和垃圾攻击的多轮对话数据集。结果表明，我们的系统能够高精度地检测虚假对话，并有效分类漂移的性质。在结构化提示的引导下，基于LLaMA的实现达到了98%的分类准确率。与零样本基线的对比研究表明，在高风险NLP应用中，融入领域知识和漂移意识显著提高了性能、可解释性和鲁棒性。

英文摘要

Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

URL PDF HTML ☆

赞 0 踩 0

2502.14321 2026-05-27 cs.MA cs.CL 版本更新

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

超越自我对话：基于LLM的多智能体系统的通信中心综述

Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, Xiaoming Zhang

发表机构 * School of Cyber Science and Technology, Beihang University, Beijing 100191, China（信息科学与技术学院，北京航空航天大学，北京100191，中国）； The College of Cyber Security/College of Information Science and Technology, Jinan University, Guangzhou, China（网络安全学院/信息科学与技术学院，暨南大学，广州，中国）； School of Computer Science and Engineering, Beihang University, Beijing 100083, China（计算机科学与工程学院，北京航空航天大学，北京100083，中国）； School of Cyber Science and Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China（信息科学与技术学院，北京邮电大学，北京100876，中国）

AI总结本文从通信视角综述基于大语言模型的多智能体系统，提出一个整合系统级和系统内部通信的结构化框架，分析智能体交互机制、挑战及未来方向。

Comments The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-50857-y}

详情

DOI: 10.1007/s11704-026-50857-y

AI中文摘要

基于大语言模型的多智能体系统因其在复杂、协作和智能问题解决方面的潜力，近期获得了显著关注。现有综述通常根据应用领域或架构对基于LLM的多智能体系统（LLM-MAS）进行分类，忽略了通信在协调智能体行为和交互中的核心作用。为弥补这一空白，本文从通信中心的视角对LLM-MAS进行了全面综述。具体而言，我们提出了一个结构化框架，将系统级通信（架构、目标和协议）与系统内部通信（策略、范式、对象和内容）相结合，从而详细探索智能体如何交互、协商并实现集体智能。通过对近期文献的广泛分析，我们识别了多个维度的关键组件，并总结了其优势与局限性。此外，我们强调了当前挑战，包括通信效率、安全漏洞、基准测试不足和可扩展性问题，并指出了有前景的未来研究方向。本综述旨在帮助研究人员和实践者清晰理解LLM-MAS中的通信机制，从而促进鲁棒、可扩展且安全的多智能体系统的设计与部署。

英文摘要

Large language model-based multi-agent systems have recently gained significant attention due to their potential for complex, collaborative, and intelligent problem-solving capabilities. Existing surveys typically categorize LLM-based multi-agent systems (LLM-MAS) according to their application domains or architectures, overlooking the central role of communication in coordinating agent behaviors and interactions. To address this gap, this paper presents a comprehensive survey of LLM-MAS from a communication-centric perspective. Specifically, we propose a structured framework that integrates system-level communication (architecture, goals, and protocols) with system internal communication (strategies, paradigms, objects, and content), enabling a detailed exploration of how agents interact, negotiate, and achieve collective intelligence. Through an extensive analysis of recent literature, we identify key components in multiple dimensions and summarize their strengths and limitations. In addition, we highlight current challenges, including communication efficiency, security vulnerabilities, inadequate benchmarking, and scalability issues, and outline promising future research directions. This review aims to help researchers and practitioners gain a clear understanding of the communication mechanisms in LLM-MAS, thereby facilitating the design and deployment of robust, scalable, and secure multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2506.03627 2026-05-27 cs.CL cs.AI 版本更新

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

提示的鲁棒性：增强大型语言模型对抗提示攻击的鲁棒性

Lin Mu, Guowei Chu, Li Ni, Lei Sang, Yiwen Zhang

发表机构 * School of Computer Science and Technology, Anhui University（安徽大学计算机科学与技术学院）

AI总结提出RoP（提示鲁棒性）策略，通过错误校正和引导两个阶段，增强LLM对输入扰动的鲁棒性，在算术、常识和逻辑推理任务上显著提升性能。

Comments Accepted by IEEE Transactions on Artificial Intelligence

详情

AI中文摘要

大型语言模型（LLM）通过有效利用提示策略在各种任务中展现了卓越的性能。然而，它们对输入扰动高度敏感，例如拼写错误或轻微字符顺序错误，这些扰动会显著损害其性能。尽管在提示技术方面取得了进展，如思维链和自动提示生成，但开发一种明确减轻此类扰动负面影响的提示策略仍然是一个开放的挑战。为弥补这一差距，我们提出了提示鲁棒性（RoP），一种旨在增强LLM鲁棒性的新型提示策略。RoP包括两个阶段：错误校正和引导。在错误校正阶段，RoP应用多种扰动方法生成对抗样本，用于生成自动纠正输入错误的提示。在引导阶段，RoP基于校正后的输入生成最优引导提示，引导模型生成更鲁棒和准确的推理。通过在算术、常识和逻辑推理任务上的全面实验，我们证明RoP显著提高了LLM对抗对抗扰动的鲁棒性。至关重要的是，与干净输入场景相比，它仅以最小的精度下降保持了模型准确性，从而将RoP确立为在实际应用中增强LLM鲁棒性的实用且有效的方法。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic prompt generation, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy aimed at enhancing the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are used to generate prompts that correct input errors automatically. In the Guidance stage, RoP generates an optimal guidance prompt based on the corrected input, guiding the model to generate more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Crucially, it preserves model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准：揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

发表机构 * South China University of Technology（华南理工大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结提出OCR-Reasoning基准，包含1069个人工标注样本，覆盖6种核心推理能力和18个实际推理任务，通过双标注（最终答案和逐步推理过程）评估多模态大语言模型在文本丰富图像推理中的能力，发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情

AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而，由于缺乏专门且系统的基准，它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白，我们提出了OCR-Reasoning，一个新颖的基准，旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言，OCR-Reasoning包含1069个人工标注的示例，涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同，本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程，从而全面评估文本丰富推理能力。利用该基准，我们对最新的多模态大语言模型进行了全面评估。结果表明，即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难，在我们的基准上没有一个模型的准确率超过50%，这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

URL PDF HTML ☆

赞 0 踩 0

2503.08600 2026-05-27 cs.CL 版本更新

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

NSF-SciFy：从NSF资助数据库中挖掘科学主张

Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出NSF-SciFy数据集，从40万篇NSF摘要中提取280万条科学主张，通过零样本提示联合提取主张和研究提案，并在三个下游任务中微调语言模型取得显著提升。

Comments ACL 2026. 19 pages, 7 figures, 11 tables

详情

AI中文摘要

我们介绍了NSF-SciFy，这是一个从国家科学基金会（NSF）获奖摘要中提取的科学主张和研究提案的综合数据集。虽然以往的科学主张验证数据集在规模和范围上有限，但NSF-SciFy取得了重大进展，包含来自40万篇摘要的280万条主张，涵盖所有科学和数学学科。我们提出了两个重点子集：NSF-SciFy-MatSci，包含来自材料科学奖项的11.4万条主张；以及NSF-SciFy-20K，包含来自五个NSF理事会的13.5万条主张。使用零样本提示，我们开发了一种可扩展的方法来联合提取科学主张和研究提案。我们通过三个下游任务展示了该数据集的实用性：非技术性摘要生成、主张提取和研究提案提取。在我们的数据集上微调语言模型带来了显著改进，相对增益通常超过100%，特别是在主张和提案提取任务中。我们的错误分析表明，提取的主张具有高精度但召回率较低，这表明有进一步改进方法的机会。NSF-SciFy为大规模主张验证、科学发现追踪和元科学分析开辟了新的研究方向。代码和数据可在 https://github.com/darpa-scify/NSFSciFy 获取。

英文摘要

We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.

URL PDF HTML ☆

赞 0 踩 0

2407.15073 2026-05-27 cs.AI cs.CL 版本更新

Multi-Agent Causal Discovery Using Large Language Models

多智能体因果发现使用大型语言模型

Hao Duong Le, Xin Xia, Haijie Xu, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University（清华大学工业工程系）

AI总结提出多智能体因果发现框架MAC，通过元融合机制结合自主选择SCD算法的辩论编码模块和基于元数据的对抗性辩论模块，在多个基准上取得最优性能。

详情

AI中文摘要

因果发现旨在识别变量之间的因果关系，是各科学领域的基本问题。传统的统计因果发现（SCD）方法仅依赖观测数据，忽略元数据中可用的上下文信息，而近期基于LLM的方法利用元数据但将大型语言模型（LLM）视为单一智能体，使其判断易受记忆或偏见关联影响。为解决这一差距，我们引入MAC（多智能体因果发现框架），将因果发现转化为多智能体辩论与自主选择SCD算法相结合。MAC通过元融合机制桥接两个互补模块：辩论编码模块（DCM）通过自主选择并执行最合适的SCD算法将初始图基于数据，以及元辩论模块（MDM）通过对抗性的肯定-否定-裁判辩论基于元数据精炼图。在五个基准数据集和三个指标（F1、SHD、NHD）上，MAC在五个统计基线和四个基于LLM的基线中取得了最佳综合性能，在使用Gemini-2.0-Flash时在15个评估点中排名第一10次——包括完美重建地震图——并在三个骨干LLM上保持稳健。

英文摘要

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional statistical causal discovery (SCD) methods rely solely on observational data and ignore the contextual information available in metadata, whereas recent LLM-based methods exploit metadata but treat the large language model (LLM) as a single agent, leaving its judgments vulnerable to memorized or biased associations. To address this gap, we introduce MAC (Multi-Agent Causal Discovery Framework), which casts causal discovery as a multi-agent debate coupled with the autonomous selection of an SCD algorithm. MAC combines two complementary modules, bridged by a Meta Fusion mechanism: a Debate-Coding Module (DCM) that grounds an initial graph in data by autonomously selecting and executing the best-suited SCD algorithm, and a Meta-Debate Module (MDM) that refines the graph through an adversarial Affirmative-Negative-Judge debate over the metadata. Across five benchmark datasets and three metrics (F1, SHD, NHD), MAC achieves the best aggregate performance among five statistical and four LLM-based baselines, ranking first on 10 of 15 evaluation points with Gemini-2.0-Flash -- including a perfect reconstruction of the Earthquake graph -- and remains robust across three backbone LLMs.

URL PDF HTML ☆

赞 0 踩 0

2408.15787 2026-05-27 cs.CL cs.IR 版本更新

Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions

交互式智能体：通过角色扮演的LLM-LLM交互模拟咨询师-来访者心理咨询

Huachuan Qiu, Zhenzhong Lan

发表机构 * School of Engineering, Westlake University（西lake大学工程学院）

AI总结提出Interactive Agents框架，通过LLM-LLM角色扮演生成高质量心理咨询对话数据，解决隐私、成本和扩展性问题，并验证其治疗有效性和模型性能。

Comments Accepted to *SEM2026

详情

AI中文摘要

创建有效的心理健康支持对话系统需要高质量的多轮咨询对话数据，然而收集真实的咨询师-来访者对话面临重大挑战，包括隐私问题、高成本和有限的扩展性。我们提出了 extbf{Interactive Agents}，一个新颖的框架，通过受控的LLM-LLM交互模拟自然的咨询对话。该框架引入了两个关键创新：(1) 一个个性化的来访者智能体，在整个会话过程中保持一致的心理学特征；(2) 一个咨询师智能体，实现了一个基于理论的三阶段治疗模型，包括探索、洞察和行动阶段。通过使用自动评估指标和基于工作联盟清单的专业咨询师评估进行严格评估，我们证明我们的框架生成了具有治疗有效性的对话，其质量与人类生成的会话相当。在我们的合成数据集（SimPsyDial）上微调的模型，在基于LLM的咨询师的标准成对聊天机器人竞技场评估中达到了最先进的性能。我们的框架提供了一种可扩展、保护隐私的方法，用于生成高质量的咨询对话数据，同时保持专业的治疗标准。

英文摘要

Creating effective dialogue systems for mental health support requires high-quality multi-turn counseling dialogue data, yet collecting real counselor-client conversations presents significant challenges, including privacy concerns, high costs, and limited scalability. We present \textbf{Interactive Agents}, a novel framework that simulates naturalistic counseling dialogues through controlled LLM-to-LLM interactions. The framework introduces two key innovations: (1) a personalized client agent that maintains consistent psychological characteristics throughout a session, and (2) a counselor agent that implements a theoretically grounded three-stage therapeutic model comprising the exploration, insight, and action phases. Through rigorous evaluation using both automatic metrics and professional-counselor assessments based on the Working Alliance Inventory, we demonstrate that our framework generates therapeutically valid dialogues that are comparable in quality to human-generated sessions. Models fine-tuned on our proposed synthetic dataset (SimPsyDial) achieve state-of-the-art performance in a standard pairwise chatbot-arena evaluation of LLM-based counselors. Our framework provides a scalable, privacy-preserving method for generating high-quality counseling dialogue data while maintaining professional therapeutic standards.

URL PDF HTML ☆

赞 0 踩 0